From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-91435-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1Lbuzs-0002qF-Mb
	for garchives@archives.gentoo.org; Tue, 24 Feb 2009 10:58:44 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 9DA9CE01E2;
	Tue, 24 Feb 2009 10:58:42 +0000 (UTC)
Received: from dcnode-01.unlimitedmail.net (smtp.unlimitedmail.net [94.127.184.242])
	by pigeon.gentoo.org (Postfix) with ESMTP id 4B9CDE01E2
	for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 10:58:42 +0000 (UTC)
Received: from ppp.zz ([137.204.208.98])
	(authenticated bits=0)
	by dcnode-01.unlimitedmail.net (8.14.3/8.14.3) with ESMTP id n1OAwONQ006017
	for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 11:58:24 +0100
From: Etaoin Shrdlu <shrdlu@unlimitedmail.org>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Tue, 24 Feb 2009 11:56:49 +0100
User-Agent: KMail/1.9.9
References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902232318.38352.shrdlu@unlimitedmail.org> <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com>
In-Reply-To: <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200902241156.50173.shrdlu@unlimitedmail.org>
X-UnlimitedMail-MailScanner-From: shrdlu@unlimitedmail.org
X-Spam-Status: No
X-Archives-Salt: 2919b4c7-b1e7-4931-95db-029b1ddc874e
X-Archives-Hash: 9d0839d16780ec37d5c2732644cd7644

On Tuesday 24 February 2009, 03:26, Mark Knecht wrote:

> If I drop columns - and I do need to - then something like how cut
> works would be good, but it needs to repeat across all the rows being
> used. For instance, if I'm dropping columns 6 & 12 from a 20 column
> wide data set, then I'm dropping 6 & 12 from all N lines. This is
> where using cut after the line is built is difficult as I'm forced to
> figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
> doing that. If I could say something like "Drop 6 & 12 from all rows,
> and 1 & 2 from all rows higher than the first that make up this new
> line" then that would be great. That's a lot to ask though.
>
> D1,T1,A1,B1,C1,D1,
> D2,T2,A2,B2,C2,D2,
> D3,T3,A3,B3,C3,D3,
> D4,T4,A4,B4,C4,D4,
> D5,T5,A5,B5,C5,D5,
>
> In the data above if I drop column A, then I drop it for all rows.
> (For instance, A contains 0 and isn't necessary, etc.)  Assuming 3
> wide I'd get
>
> D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
> D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
> D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5
>
> Making that completely flexible - where I can drop 4 or 5 random
> columns - is probably a bit too much work. On the other hand maybe
> sending it to cut as part of the whole process, line by lone or
> something, is more reasonable? I don't know.

The current "dropcol" variable drops fields from the beginning of line. 
Doing that for arbitrary columns can be done, but requires an array 
where to save the numbers of the columns to drop. 

So, in my understanding this is what we want to accomplish so far:

given an input of the form

D1,T1,a1,b1,c1,d1,...,R1
D2,T2,a2,b2,c2,d2,...,R2
D3,T3,a3,b3,c3,d3,...,R3
D4,T4,a4,b4,c4,d4,...,R4
D5,T5,a5,b5,c5,d5,...,R5

(the ... mean that an  arbitrary number of columns can follow)

You want to group lines by n at a time, keeping the D and T column from 
the first line of each group, and keeping the R column from the last 
line of the group, so for example with n=3 we would have:

D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5

(and you're right, that produces an output that is roughly n times the 
size of the original file)

Now, in addition to that, you also want to drop an arbitrary number of  
columns in the a,b,c... group. So for example, you want to drop columns 
2 and 3 (b and c in the example), so you'd end up with something like

D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
D1,T1,a3,d3,...a4,d4,...a5,d5,...R5

Please confirm that my understanding is correct, so I can come up with 
some code to do that.

> I found a web site to study awk so I'm starting to see more or less
> how your example works when I have the code in front of me. Creating
> the code out of thin air might be a bit of a stretch for me at this
> point though.

I suggest you start from 

http://www.gnu.org/software/gawk/manual/gawk.html

really complete, but gradual so you can have an easy start and move on to 
the complexities later.