From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1Lbuzs-0002qF-Mb for garchives@archives.gentoo.org; Tue, 24 Feb 2009 10:58:44 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 9DA9CE01E2; Tue, 24 Feb 2009 10:58:42 +0000 (UTC) Received: from dcnode-01.unlimitedmail.net (smtp.unlimitedmail.net [94.127.184.242]) by pigeon.gentoo.org (Postfix) with ESMTP id 4B9CDE01E2 for ; Tue, 24 Feb 2009 10:58:42 +0000 (UTC) Received: from ppp.zz ([137.204.208.98]) (authenticated bits=0) by dcnode-01.unlimitedmail.net (8.14.3/8.14.3) with ESMTP id n1OAwONQ006017 for ; Tue, 24 Feb 2009 11:58:24 +0100 From: Etaoin Shrdlu To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file Date: Tue, 24 Feb 2009 11:56:49 +0100 User-Agent: KMail/1.9.9 References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902232318.38352.shrdlu@unlimitedmail.org> <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com> In-Reply-To: <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200902241156.50173.shrdlu@unlimitedmail.org> X-UnlimitedMail-MailScanner-From: shrdlu@unlimitedmail.org X-Spam-Status: No X-Archives-Salt: 2919b4c7-b1e7-4931-95db-029b1ddc874e X-Archives-Hash: 9d0839d16780ec37d5c2732644cd7644 On Tuesday 24 February 2009, 03:26, Mark Knecht wrote: > If I drop columns - and I do need to - then something like how cut > works would be good, but it needs to repeat across all the rows being > used. For instance, if I'm dropping columns 6 & 12 from a 20 column > wide data set, then I'm dropping 6 & 12 from all N lines. This is > where using cut after the line is built is difficult as I'm forced to > figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake > doing that. If I could say something like "Drop 6 & 12 from all rows, > and 1 & 2 from all rows higher than the first that make up this new > line" then that would be great. That's a lot to ask though. > > D1,T1,A1,B1,C1,D1, > D2,T2,A2,B2,C2,D2, > D3,T3,A3,B3,C3,D3, > D4,T4,A4,B4,C4,D4, > D5,T5,A5,B5,C5,D5, > > In the data above if I drop column A, then I drop it for all rows. > (For instance, A contains 0 and isn't necessary, etc.) Assuming 3 > wide I'd get > > D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3 > D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4 > D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5 > > Making that completely flexible - where I can drop 4 or 5 random > columns - is probably a bit too much work. On the other hand maybe > sending it to cut as part of the whole process, line by lone or > something, is more reasonable? I don't know. The current "dropcol" variable drops fields from the beginning of line. Doing that for arbitrary columns can be done, but requires an array where to save the numbers of the columns to drop. So, in my understanding this is what we want to accomplish so far: given an input of the form D1,T1,a1,b1,c1,d1,...,R1 D2,T2,a2,b2,c2,d2,...,R2 D3,T3,a3,b3,c3,d3,...,R3 D4,T4,a4,b4,c4,d4,...,R4 D5,T5,a5,b5,c5,d5,...,R5 (the ... mean that an arbitrary number of columns can follow) You want to group lines by n at a time, keeping the D and T column from the first line of each group, and keeping the R column from the last line of the group, so for example with n=3 we would have: D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 (and you're right, that produces an output that is roughly n times the size of the original file) Now, in addition to that, you also want to drop an arbitrary number of columns in the a,b,c... group. So for example, you want to drop columns 2 and 3 (b and c in the example), so you'd end up with something like D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 Please confirm that my understanding is correct, so I can come up with some code to do that. > I found a web site to study awk so I'm starting to see more or less > how your example works when I have the code in front of me. Creating > the code out of thin air might be a bit of a stretch for me at this > point though. I suggest you start from http://www.gnu.org/software/gawk/manual/gawk.html really complete, but gradual so you can have an easy start and move on to the complexities later.