From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1Lc67j-0006Sr-Lw for garchives@archives.gentoo.org; Tue, 24 Feb 2009 22:51:35 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 6FD8FE052E; Tue, 24 Feb 2009 22:51:31 +0000 (UTC) Received: from wf-out-1314.google.com (wf-out-1314.google.com [209.85.200.172]) by pigeon.gentoo.org (Postfix) with ESMTP id 21EEEE052E for ; Tue, 24 Feb 2009 22:51:31 +0000 (UTC) Received: by wf-out-1314.google.com with SMTP id 29so3020618wff.10 for ; Tue, 24 Feb 2009 14:51:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=XoZYiMnFuM2m3ZzkIvzPEHFMOi0mP3HQusS1km0iZ0M=; b=iprllAZexpUYEulkfqjLI9QNKC9IKbTQaq/id/Z6DEOwySuothKqvyrwJEtq3t4j5G 3cQaWuC/VQilqPxhIxlLU5Swirz2biM0jrVPJQJ4K63XDeUwgFtF4uCBjADkWsj1Z1sE OdV0HArEufXHjnnzNzt9FE8NojkfBnxJDPSvY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=K86YLk2UQwKrTJnVafGke/4FDtMgDcgaMCJrCjp10FSVY2JuCNPHLOWrHb0x8x4e1v hK198jWuI027ToGSY7UMaKnC2gm8dK5QmBbOt/2vrqpXJHxPxYqQvM6JMrtRlK7UHLrA KFS/k1OKygTeARc5iA1kHi2VaWFgbP1KYMwaE= Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Received: by 10.142.185.21 with SMTP id i21mr2746067wff.311.1235515890690; Tue, 24 Feb 2009 14:51:30 -0800 (PST) In-Reply-To: <200902241848.57154.shrdlu@unlimitedmail.org> References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902241156.50173.shrdlu@unlimitedmail.org> <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com> <200902241848.57154.shrdlu@unlimitedmail.org> Date: Tue, 24 Feb 2009 14:51:30 -0800 Message-ID: <5bdc1c8b0902241451t154dbca1ub02b1a140466ca52@mail.gmail.com> Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file From: Mark Knecht To: gentoo-user@lists.gentoo.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Archives-Salt: 9a404acb-42b8-4c38-8e8e-617057189e13 X-Archives-Hash: 2d7a45b8b6c886da9e107318903aabc8 On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu w= rote: > On Tuesday 24 February 2009, 15:41, Mark Knecht wrote: >> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu >> wrote: >> >> > So, in my understanding this is what we want to accomplish so far: >> > >> > given an input of the form >> > >> > D1,T1,a1,b1,c1,d1,...,R1 >> > D2,T2,a2,b2,c2,d2,...,R2 >> > D3,T3,a3,b3,c3,d3,...,R3 >> > D4,T4,a4,b4,c4,d4,...,R4 >> > D5,T5,a5,b5,c5,d5,...,R5 >> > >> > (the ... mean that an =C2=A0arbitrary number of columns can follow) >> > >> > You want to group lines by n at a time, keeping the D and T column >> > from the first line of each group, and keeping the R column from the >> > last line of the group, so for example with n=3D3 we would have: >> > >> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 >> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 >> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 >> > >> > (and you're right, that produces an output that is roughly n times >> > the size of the original file) >> > >> > Now, in addition to that, you also want to drop an arbitrary number >> > of columns in the a,b,c... group. So for example, you want to drop >> > columns 2 and 3 (b and c in the example), so you'd end up with >> > something like >> > >> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 >> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 >> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 >> > >> > Please confirm that my understanding is correct, so I can come up >> > with some code to do that. >> >> Perfectly correct for all the data rows. >> >> For the header I now see that we have a slightly harder job. What we'd >> need to do is read the first line of the file, duplicate it N times, >> and then drop the same columns as we drop in the rows. The problem is >> that now I have the same header value for N columns which won't make >> sense to the tool that uses this data. If we could read the header and >> then automatically postpend the number N to each duplicated name. (or >> some string like _N) > > So in the last example the header would be something like > > D,T,a,b,c,d,...,R > > in the original file, and would become > > D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R > > and > > D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R > > respectively for the two sample outputs above. > >> Maybe better would be a separate small program to do the header part >> and then this program could read that header and make it the first >> line of the output file. My worry is that when this data file becomes >> very large - say 1GB or more of data - I probably cannot open the file >> with vi to edit the header. Better if I could put the header in it's >> own file. That file would be 1 line long. I could check it for the >> name edits, make sure it's right, and then the program you are so >> kindly building would just read it, cut out columns, and put it at the >> start of the new large file. >> >> Does that make sense? > > Maybe, but while we're at it my personal preference would be to just add > header handling to the existing program. Here's the revised code (also > cleaned up a bit and more structured): > > # returns a ",a1,b1,c1,.." line > # drops unwanted columns > function do_line(num) { > =C2=A0line=3D"" > =C2=A0for(j=3D3;j =C2=A0 =C2=A0if(!((j-2) in dc)){line=3Dline OFS $j (num>0?"_"num:"")} > =C2=A0return line > } > > # dcols is a string like '2,3,4,7' with a comma separated values list of > # the columns to drop (first data column after date/time is column 1 > # here) > > BEGIN {FS=3DOFS=3D"," > =C2=A0 =C2=A0 =C2=A0 t=3Dsplit(dcols,tmp,/,/) > =C2=A0 =C2=A0 =C2=A0 for(i=3D1;i<=3Dt;i++)dc[tmp[i]] > =C2=A0 =C2=A0 =C2=A0} > > # process the header > NR=3D=3D1{ > =C2=A0l=3D$1 OFS $2; > =C2=A0for(i=3D1;i<=3Dn;i++)l=3Dl do_line(i) > =C2=A0l=3Dl OFS $NF > =C2=A0print l > } > > NR>=3D2{ > =C2=A0for(i=3D1;i =C2=A0 =C2=A0s[i]=3Ds[i+1] > =C2=A0 =C2=A0dt[i]=3Ddt[i+1] > =C2=A0} > > =C2=A0dt[n]=3D$1 OFS $2 > =C2=A0s[n]=3Ddo_line(-1) > > =C2=A0if(NR>=3Dn+1){ > =C2=A0 =C2=A0l=3Ddt[1] > =C2=A0 =C2=A0for(i=3D1;i<=3Dn;i++)l=3Dl s[i] > =C2=A0 =C2=A0l=3Dl OFS $NF > =C2=A0 =C2=A0print l > =C2=A0} > }' > > I think two examples are enough here to demonstrate how it works (blank > lines between commands added for clarity - remember first line is header > line): > > $ cat file.csv > D,T,a,b,c,d,R > D1,T1,a1,b1,c1,d1,R1 > D2,T2,a2,b2,c2,d2,R2 > D3,T3,a3,b3,c3,d3,R3 > D4,T4,a4,b4,c4,d4,R4 > D5,T5,a5,b5,c5,d5,R5 > D6,T6,a6,b6,c6,d6,R6 > > $ awk -v n=3D3 -f program.awk file.csv > D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R > D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3 > D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4 > D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5 > D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6 > > $ awk -v n=3D3 -v dcols=3D'1,3' -f program.awk file.csv > D,T,b_1,d_1,b_2,d_2,b_3,d_3,R > D1,T1,b1,d1,b2,d2,b3,d3,R3 > D2,T2,b2,d2,b3,d3,b4,d4,R4 > D3,T3,b3,d3,b4,d4,b5,d5,R5 > D4,T4,b4,d4,b5,d5,b6,d6,R6 > > The program still assumes that first two columns and last column must be > excluded from the general processing, and just prepended and appended to > the output lines. That can be changed though. > > Excellent! Worked identically for me once I removed an extra single quote after the last brace that somehow crept into your copy/paste in this email. Really good stuff. I tried it on a 40MB data file from TradeStation with 6 years of 5 minute Dow Futures data. Going 10 wide and deleting 1 column it generated a 292MB output file in about 1 minute on my machine. That's more than acceptable performance. Data mining will likely take hours if not days. Certainly more than a few minutes. Looks like I'm running into one more problem and then I'm ready to give it a try for real. Unfortunately one vendor platform is putting quotes around the names in the header row so your _N increment looks like "High"_4 instead of High_4 or "High_4". I'd like to fix that as I'm pretty sure that the way we have it won't be acceptable, but I don't know whether it would be best to have the quotes or not have the quotes. My two target data mining platforms are R, which is in portage, and RapidMiner which is available as Open Source from the Rapid-i web site. I'll try it both ways with both header formats and see what happens. I had worried about checking the header on a really large file to see if I had cut the correct columns but it turns out that cat awkDataOut.csv | more in a terminal writes the first few lines very quickly. From there I can either just look at it or copy/paste into a new csv file, load it into something like Open Office Calc and make sure I got the right columns so I don't think there's any practical need to do anything more with the header other than whatever turns out to be the right answer with the quotes. My worry had been that when I request 5 data columns it's not obvious what order they are provided so I'd have to look at the file and figure out where everything was. Turns out it's not such a big deal. This has been extremely helpful. I really am embarrassed that you did all the work. I own you at lest a beer or two (or whatever your preferred drink might be) should we ever have a chance to exercise our right arms together one day. I'll drop you a note back when I figure out what makes sense about the header quotes. Cheers, Mark