From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1Lbn03-0005PL-SH for garchives@archives.gentoo.org; Tue, 24 Feb 2009 02:26:24 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 6FBC6E01F1; Tue, 24 Feb 2009 02:26:16 +0000 (UTC) Received: from wf-out-1314.google.com (wf-out-1314.google.com [209.85.200.175]) by pigeon.gentoo.org (Postfix) with ESMTP id 2ABEAE01F1 for ; Tue, 24 Feb 2009 02:26:16 +0000 (UTC) Received: by wf-out-1314.google.com with SMTP id 29so2512780wff.10 for ; Mon, 23 Feb 2009 18:26:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=/e8+o4vVUHxEGPwxtDsn8QABhm7Rcd1oh9BPRePWbwg=; b=LzFyiPen/N5+XPuaGtCQDTaYeHmNK8VlTjIKS5huZi4XVFDxVV/rrvx1IYsGyCzqpD 4TLbCiSpXSFXB6aIPhsvVV4Rny6w115ArN/4Woa7cZ+0p1jzYTrqINoAZgvznE7+nFh+ XqG1q3FduvAymg9eBLmD6UYPIp+lhvcyCy0Uk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Oo0AzxM6sMcX/Rs70Fpa2b+mg0pt+5R6GXVr/fExtYcR2urHEmGSrLP4mKwlLTmEPB UrYDNURtkl1AsLJEMPxhuKhHsk+vI/vym+VisNpTkRLSFeM/l5G0Ggze0VCVDYFtCn1N m2RXrnsV/iL3ZWj7Ey5Au+oGB15D5Ey4AcuC4= Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Received: by 10.142.155.9 with SMTP id c9mr2250759wfe.302.1235442375704; Mon, 23 Feb 2009 18:26:15 -0800 (PST) In-Reply-To: <200902232318.38352.shrdlu@unlimitedmail.org> References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902231057.56850.shrdlu@unlimitedmail.org> <5bdc1c8b0902230805t575e97deg9c8b9fb271f4296@mail.gmail.com> <200902232318.38352.shrdlu@unlimitedmail.org> Date: Mon, 23 Feb 2009 18:26:15 -0800 Message-ID: <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com> Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file From: Mark Knecht To: gentoo-user@lists.gentoo.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Archives-Salt: 93eca4cb-95dd-46e2-bdb5-07c4c0ef0f51 X-Archives-Hash: 11a7b9bd60f16de18dd7e548c39c4be7 On Mon, Feb 23, 2009 at 2:18 PM, Etaoin Shrdlu wrote: > On Monday 23 February 2009, 17:05, Mark Knecht wrote: > >> I'm attaching a small (100 line) data file out of TradeStation. Zipped >> it's about 2K. It should expand to about 10K. When I run the command >> to get 10 lines put together it works correctly and gives me a file >> with 91 lines and about 100K in size. (I.e. - 10x on my disk.) >> >> awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv >> >> No mangling of the first line - that must have been something earlier >> I guess. Sorry for the confusion on that front. >> >> One other item has come up as I start to play with this farther down >> the tool chain. I want to use this data in either R or RapidMiner to >> data mine for patterns. Both of those tools are easier to use if the >> first line in the file has column titles. I had originally asked >> TradeStation not to output the column titles but if I do then for the >> first line of our new file I should actually copy the first line of >> the input file N times. Something like >> >> For i=1; read line, write N times, write \n >> >> and then >> >> for i>=2 do what we're doing right now. > > That is actually accomplished just by adding a bit of code: > > BEGIN {FS=OFS=","} > > NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header > NR>=2{ > r=$NF;NF-- > for(i=1;i s[i]=s[i+1] > dt[i]=dt[i+1] > if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS > if(NR>=n+1)printf "%s%s",s[i],OFS > } > sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} > sub("^([^,]*,){"dropcol"}","") > s[n]=$0 > if(NR>=n+1)printf "%s,%s\n", s[n],r > } > > Note that no column is dropped from the header. If you need to do that, > just tell us how you want to do that. > thanks. that's a good add. If I drop columns - and I do need to - then something like how cut works would be good, but it needs to repeat across all the rows being used. For instance, if I'm dropping columns 6 & 12 from a 20 column wide data set, then I'm dropping 6 & 12 from all N lines. This is where using cut after the line is built is difficult as I'm forced to figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake doing that. If I could say something like "Drop 6 & 12 from all rows, and 1 & 2 from all rows higher than the first that make up this new line" then that would be great. That's a lot to ask though. D1,T1,A1,B1,C1,D1, D2,T2,A2,B2,C2,D2, D3,T3,A3,B3,C3,D3, D4,T4,A4,B4,C4,D4, D5,T5,A5,B5,C5,D5, In the data above if I drop column A, then I drop it for all rows. (For instance, A contains 0 and isn't necessary, etc.) Assuming 3 wide I'd get D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3 D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4 D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5 Making that completely flexible - where I can drop 4 or 5 random columns - is probably a bit too much work. On the other hand maybe sending it to cut as part of the whole process, line by lone or something, is more reasonable? I don't know. I found a web site to study awk so I'm starting to see more or less how your example works when I have the code in front of me. Creating the code out of thin air might be a bit of a stretch for me at this point though. thanks for your help! Cheers, Mark