From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-91455-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1Lc67j-0006Sr-Lw
	for garchives@archives.gentoo.org; Tue, 24 Feb 2009 22:51:35 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 6FD8FE052E;
	Tue, 24 Feb 2009 22:51:31 +0000 (UTC)
Received: from wf-out-1314.google.com (wf-out-1314.google.com [209.85.200.172])
	by pigeon.gentoo.org (Postfix) with ESMTP id 21EEEE052E
	for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 22:51:31 +0000 (UTC)
Received: by wf-out-1314.google.com with SMTP id 29so3020618wff.10
        for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 14:51:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:received:in-reply-to:references
         :date:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        bh=XoZYiMnFuM2m3ZzkIvzPEHFMOi0mP3HQusS1km0iZ0M=;
        b=iprllAZexpUYEulkfqjLI9QNKC9IKbTQaq/id/Z6DEOwySuothKqvyrwJEtq3t4j5G
         3cQaWuC/VQilqPxhIxlLU5Swirz2biM0jrVPJQJ4K63XDeUwgFtF4uCBjADkWsj1Z1sE
         OdV0HArEufXHjnnzNzt9FE8NojkfBnxJDPSvY=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=K86YLk2UQwKrTJnVafGke/4FDtMgDcgaMCJrCjp10FSVY2JuCNPHLOWrHb0x8x4e1v
         hK198jWuI027ToGSY7UMaKnC2gm8dK5QmBbOt/2vrqpXJHxPxYqQvM6JMrtRlK7UHLrA
         KFS/k1OKygTeARc5iA1kHi2VaWFgbP1KYMwaE=
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Received: by 10.142.185.21 with SMTP id i21mr2746067wff.311.1235515890690; 
	Tue, 24 Feb 2009 14:51:30 -0800 (PST)
In-Reply-To: <200902241848.57154.shrdlu@unlimitedmail.org>
References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com>
	 <200902241156.50173.shrdlu@unlimitedmail.org>
	 <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com>
	 <200902241848.57154.shrdlu@unlimitedmail.org>
Date: Tue, 24 Feb 2009 14:51:30 -0800
Message-ID: <5bdc1c8b0902241451t154dbca1ub02b1a140466ca52@mail.gmail.com>
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
From: Mark Knecht <markknecht@gmail.com>
To: gentoo-user@lists.gentoo.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Archives-Salt: 9a404acb-42b8-4c38-8e8e-617057189e13
X-Archives-Hash: 2d7a45b8b6c886da9e107318903aabc8

On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> w=
rote:
> On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
>> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
>> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>>
>> > So, in my understanding this is what we want to accomplish so far:
>> >
>> > given an input of the form
>> >
>> > D1,T1,a1,b1,c1,d1,...,R1
>> > D2,T2,a2,b2,c2,d2,...,R2
>> > D3,T3,a3,b3,c3,d3,...,R3
>> > D4,T4,a4,b4,c4,d4,...,R4
>> > D5,T5,a5,b5,c5,d5,...,R5
>> >
>> > (the ... mean that an =C2=A0arbitrary number of columns can follow)
>> >
>> > You want to group lines by n at a time, keeping the D and T column
>> > from the first line of each group, and keeping the R column from the
>> > last line of the group, so for example with n=3D3 we would have:
>> >
>> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
>> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
>> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
>> >
>> > (and you're right, that produces an output that is roughly n times
>> > the size of the original file)
>> >
>> > Now, in addition to that, you also want to drop an arbitrary number
>> > of columns in the a,b,c... group. So for example, you want to drop
>> > columns 2 and 3 (b and c in the example), so you'd end up with
>> > something like
>> >
>> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
>> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
>> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
>> >
>> > Please confirm that my understanding is correct, so I can come up
>> > with some code to do that.
>>
>> Perfectly correct for all the data rows.
>>
>> For the header I now see that we have a slightly harder job. What we'd
>> need to do is read the first line of the file, duplicate it N times,
>> and then drop the same columns as we drop in the rows. The problem is
>> that now I have the same header value for N columns which won't make
>> sense to the tool that uses this data. If we could read the header and
>> then automatically postpend the number N to each duplicated name. (or
>> some string like _N)
>
> So in the last example the header would be something like
>
> D,T,a,b,c,d,...,R
>
> in the original file, and would become
>
> D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R
>
> and
>
> D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R
>
> respectively for the two sample outputs above.
>
>> Maybe better would be a separate small program to do the header part
>> and then this program could read that header and make it the first
>> line of the output file. My worry is that when this data file becomes
>> very large - say 1GB or more of data - I probably cannot open the file
>> with vi to edit the header. Better if I could put the header in it's
>> own file. That file would be 1 line long. I could check it for the
>> name edits, make sure it's right, and then the program you are so
>> kindly building would just read it, cut out columns, and put it at the
>> start of the new large file.
>>
>> Does that make sense?
>
> Maybe, but while we're at it my personal preference would be to just add
> header handling to the existing program. Here's the revised code (also
> cleaned up a bit and more structured):
>
> # returns a ",a1,b1,c1,.." line
> # drops unwanted columns
> function do_line(num) {
> =C2=A0line=3D""
> =C2=A0for(j=3D3;j<NF;j++)
> =C2=A0 =C2=A0if(!((j-2) in dc)){line=3Dline OFS $j (num>0?"_"num:"")}
> =C2=A0return line
> }
>
> # dcols is a string like '2,3,4,7' with a comma separated values list of
> # the columns to drop (first data column after date/time is column 1
> # here)
>
> BEGIN {FS=3DOFS=3D","
> =C2=A0 =C2=A0 =C2=A0 t=3Dsplit(dcols,tmp,/,/)
> =C2=A0 =C2=A0 =C2=A0 for(i=3D1;i<=3Dt;i++)dc[tmp[i]]
> =C2=A0 =C2=A0 =C2=A0}
>
> # process the header
> NR=3D=3D1{
> =C2=A0l=3D$1 OFS $2;
> =C2=A0for(i=3D1;i<=3Dn;i++)l=3Dl do_line(i)
> =C2=A0l=3Dl OFS $NF
> =C2=A0print l
> }
>
> NR>=3D2{
> =C2=A0for(i=3D1;i<n;i++){
> =C2=A0 =C2=A0s[i]=3Ds[i+1]
> =C2=A0 =C2=A0dt[i]=3Ddt[i+1]
> =C2=A0}
>
> =C2=A0dt[n]=3D$1 OFS $2
> =C2=A0s[n]=3Ddo_line(-1)
>
> =C2=A0if(NR>=3Dn+1){
> =C2=A0 =C2=A0l=3Ddt[1]
> =C2=A0 =C2=A0for(i=3D1;i<=3Dn;i++)l=3Dl s[i]
> =C2=A0 =C2=A0l=3Dl OFS $NF
> =C2=A0 =C2=A0print l
> =C2=A0}
> }'
>
> I think two examples are enough here to demonstrate how it works (blank
> lines between commands added for clarity - remember first line is header
> line):
>
> $ cat file.csv
> D,T,a,b,c,d,R
> D1,T1,a1,b1,c1,d1,R1
> D2,T2,a2,b2,c2,d2,R2
> D3,T3,a3,b3,c3,d3,R3
> D4,T4,a4,b4,c4,d4,R4
> D5,T5,a5,b5,c5,d5,R5
> D6,T6,a6,b6,c6,d6,R6
>
> $ awk -v n=3D3 -f program.awk file.csv
> D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
> D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
> D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
> D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
> D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6
>
> $ awk -v n=3D3 -v dcols=3D'1,3' -f program.awk file.csv
> D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
> D1,T1,b1,d1,b2,d2,b3,d3,R3
> D2,T2,b2,d2,b3,d3,b4,d4,R4
> D3,T3,b3,d3,b4,d4,b5,d5,R5
> D4,T4,b4,d4,b5,d5,b6,d6,R6
>
> The program still assumes that first two columns and last column must be
> excluded from the general processing, and just prepended and appended to
> the output lines. That can be changed though.
>
>
Excellent! Worked identically for me once I removed an extra single
quote after the last brace that somehow crept into your copy/paste in
this email. Really good stuff.

I tried it on a 40MB data file from TradeStation with 6 years of 5
minute Dow Futures data. Going 10 wide and deleting 1 column it
generated a 292MB output file in about 1 minute on my machine. That's
more than acceptable performance. Data mining will likely take hours
if not days. Certainly more than a few minutes.

Looks like I'm running into one more problem and then I'm ready to
give it a try for real. Unfortunately one vendor platform is putting
quotes around the names in the header row so your _N increment looks
like "High"_4 instead of High_4 or "High_4". I'd like to fix that as
I'm pretty sure that the way we have it won't be acceptable, but I
don't know whether it would be best to have the quotes or not have the
quotes. My two target data mining platforms are R, which is in
portage, and RapidMiner which is available as Open Source from the
Rapid-i web site. I'll try it both ways with both header formats and
see what happens.

I had worried about checking the header on a really large file to see
if I had cut the correct columns but it turns out that

cat awkDataOut.csv | more

in a terminal writes the first few lines very quickly. From there I
can either just look at it or copy/paste into a new csv file, load it
into something like Open Office Calc and make sure I got the right
columns so I don't think there's any practical need to do anything
more with the header other than whatever turns out to be the right
answer with the quotes. My worry had been that when I request 5 data
columns it's not obvious what order they are provided so I'd have to
look at the file and figure out where everything was. Turns out it's
not such a big deal.

This has been extremely helpful. I really am embarrassed that you did
all the work. I own you at lest a beer or two (or whatever your
preferred drink might be) should we ever have a chance to exercise our
right arms together one day. I'll drop you a note back when I figure
out what makes sense about the header quotes.

Cheers,
Mark