From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1Lc1Qx-0003Q9-3z for garchives@archives.gentoo.org; Tue, 24 Feb 2009 17:51:07 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id D5617E059A; Tue, 24 Feb 2009 17:51:04 +0000 (UTC) Received: from dcnode-01.unlimitedmail.net (smtp.unlimitedmail.net [94.127.184.242]) by pigeon.gentoo.org (Postfix) with ESMTP id 835C2E059A for ; Tue, 24 Feb 2009 17:51:04 +0000 (UTC) Received: from ppp.zz ([137.204.208.98]) (authenticated bits=0) by dcnode-01.unlimitedmail.net (8.14.3/8.14.3) with ESMTP id n1OHoWPp005472 for ; Tue, 24 Feb 2009 18:50:32 +0100 From: Etaoin Shrdlu To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file Date: Tue, 24 Feb 2009 18:48:56 +0100 User-Agent: KMail/1.9.9 References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902241156.50173.shrdlu@unlimitedmail.org> <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com> In-Reply-To: <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200902241848.57154.shrdlu@unlimitedmail.org> X-UnlimitedMail-MailScanner-From: shrdlu@unlimitedmail.org X-Spam-Status: No X-Archives-Salt: 4e3e07b7-3690-41ba-9207-60e8e9486c62 X-Archives-Hash: 45b153963852f6a6e8e5963558df3391 On Tuesday 24 February 2009, 15:41, Mark Knecht wrote: > On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu > wrote: > > > So, in my understanding this is what we want to accomplish so far: > > > > given an input of the form > > > > D1,T1,a1,b1,c1,d1,...,R1 > > D2,T2,a2,b2,c2,d2,...,R2 > > D3,T3,a3,b3,c3,d3,...,R3 > > D4,T4,a4,b4,c4,d4,...,R4 > > D5,T5,a5,b5,c5,d5,...,R5 > > > > (the ... mean that an =C2=A0arbitrary number of columns can follow) > > > > You want to group lines by n at a time, keeping the D and T column > > from the first line of each group, and keeping the R column from the > > last line of the group, so for example with n=3D3 we would have: > > > > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 > > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 > > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 > > > > (and you're right, that produces an output that is roughly n times > > the size of the original file) > > > > Now, in addition to that, you also want to drop an arbitrary number > > of columns in the a,b,c... group. So for example, you want to drop > > columns 2 and 3 (b and c in the example), so you'd end up with > > something like > > > > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 > > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 > > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 > > > > Please confirm that my understanding is correct, so I can come up > > with some code to do that. > > Perfectly correct for all the data rows. > > For the header I now see that we have a slightly harder job. What we'd > need to do is read the first line of the file, duplicate it N times, > and then drop the same columns as we drop in the rows. The problem is > that now I have the same header value for N columns which won't make > sense to the tool that uses this data. If we could read the header and > then automatically postpend the number N to each duplicated name. (or > some string like _N) So in the last example the header would be something like D,T,a,b,c,d,...,R in the original file, and would become=20 D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R and D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R respectively for the two sample outputs above. > Maybe better would be a separate small program to do the header part > and then this program could read that header and make it the first > line of the output file. My worry is that when this data file becomes > very large - say 1GB or more of data - I probably cannot open the file > with vi to edit the header. Better if I could put the header in it's > own file. That file would be 1 line long. I could check it for the > name edits, make sure it's right, and then the program you are so > kindly building would just read it, cut out columns, and put it at the > start of the new large file. > > Does that make sense? Maybe, but while we're at it my personal preference would be to just add=20 header handling to the existing program. Here's the revised code (also=20 cleaned up a bit and more structured): # returns a ",a1,b1,c1,.." line # drops unwanted columns function do_line(num) { line=3D"" for(j=3D3;j0?"_"num:"")} return line } # dcols is a string like '2,3,4,7' with a comma separated values list of # the columns to drop (first data column after date/time is column 1 # here) BEGIN {FS=3DOFS=3D"," t=3Dsplit(dcols,tmp,/,/) for(i=3D1;i<=3Dt;i++)dc[tmp[i]] } # process the header NR=3D=3D1{ l=3D$1 OFS $2; for(i=3D1;i<=3Dn;i++)l=3Dl do_line(i) l=3Dl OFS $NF print l } NR>=3D2{ for(i=3D1;i=3Dn+1){ l=3Ddt[1] for(i=3D1;i<=3Dn;i++)l=3Dl s[i] l=3Dl OFS $NF print l } }' I think two examples are enough here to demonstrate how it works (blank=20 lines between commands added for clarity - remember first line is header=20 line): $ cat file.csv D,T,a,b,c,d,R D1,T1,a1,b1,c1,d1,R1 D2,T2,a2,b2,c2,d2,R2 D3,T3,a3,b3,c3,d3,R3 D4,T4,a4,b4,c4,d4,R4 D5,T5,a5,b5,c5,d5,R5 D6,T6,a6,b6,c6,d6,R6 $ awk -v n=3D3 -f program.awk file.csv D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3 D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4 D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5 D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6 $ awk -v n=3D3 -v dcols=3D'1,3' -f program.awk file.csv D,T,b_1,d_1,b_2,d_2,b_3,d_3,R D1,T1,b1,d1,b2,d2,b3,d3,R3 D2,T2,b2,d2,b3,d3,b4,d4,R4 D3,T3,b3,d3,b4,d4,b5,d5,R5 D4,T4,b4,d4,b5,d5,b6,d6,R6 The program still assumes that first two columns and last column must be=20 excluded from the general processing, and just prepended and appended to=20 the output lines. That can be changed though.