Re: [gentoo-user] [OT] - command line read *.csv & create new file

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

From: Mark Knecht <markknecht@gmail.com>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Mon, 23 Feb 2009 18:26:15 -0800	[thread overview]
Message-ID: <5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com> (raw)
In-Reply-To: <200902232318.38352.shrdlu@unlimitedmail.org>

On Mon, Feb 23, 2009 at 2:18 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Monday 23 February 2009, 17:05, Mark Knecht wrote:
>
>> I'm attaching a small (100 line) data file out of TradeStation. Zipped
>> it's about 2K. It should expand to about 10K. When I run the command
>> to get 10 lines put together it works correctly and gives me a file
>> with 91 lines and about 100K in size. (I.e. - 10x on my disk.)
>>
>> awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv
>>
>> No mangling of the first line - that must have been something earlier
>> I guess. Sorry for the confusion on that front.
>>
>> One other item has come up as I start to play with this farther down
>> the tool chain. I want to use this data in either R or RapidMiner to
>> data mine for patterns. Both of those tools are easier to use if the
>> first line in the file has column titles. I had originally asked
>> TradeStation not to output the column titles but if I do then for the
>> first line of our new file I should actually copy the first line of
>> the input file N times. Something like
>>
>> For i=1; read line, write N times, write \n
>>
>> and then
>>
>> for i>=2 do what we're doing right now.
>
> That is actually accomplished just by adding a bit of code:
>
> BEGIN {FS=OFS=","}
>
> NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header
> NR>=2{
>   r=$NF;NF--
>   for(i=1;i<n;i++){
>     s[i]=s[i+1]
>     dt[i]=dt[i+1]
>     if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS
>     if(NR>=n+1)printf "%s%s",s[i],OFS
>   }
>   sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
>   sub("^([^,]*,){"dropcol"}","")
>   s[n]=$0
>   if(NR>=n+1)printf "%s,%s\n", s[n],r
> }
>
> Note that no column is dropped from the header. If you need to do that,
> just tell us how you want to do that.
>

thanks. that's a good add.

If I drop columns - and I do need to - then something like how cut
works would be good, but it needs to repeat across all the rows being
used. For instance, if I'm dropping columns 6 & 12 from a 20 column
wide data set, then I'm dropping 6 & 12 from all N lines. This is
where using cut after the line is built is difficult as I'm forced to
figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
doing that. If I could say something like "Drop 6 & 12 from all rows,
and 1 & 2 from all rows higher than the first that make up this new
line" then that would be great. That's a lot to ask though.

D1,T1,A1,B1,C1,D1,
D2,T2,A2,B2,C2,D2,
D3,T3,A3,B3,C3,D3,
D4,T4,A4,B4,C4,D4,
D5,T5,A5,B5,C5,D5,

In the data above if I drop column A, then I drop it for all rows.
(For instance, A contains 0 and isn't necessary, etc.)  Assuming 3
wide I'd get

D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5

Making that completely flexible - where I can drop 4 or 5 random
columns - is probably a bit too much work. On the other hand maybe
sending it to cut as part of the whole process, line by lone or
something, is more reasonable? I don't know.

I found a web site to study awk so I'm starting to see more or less
how your example works when I have the code in front of me. Creating
the code out of thin air might be a bit of a stretch for me at this
point though.

thanks for your help!

Cheers,
Mark

next prev parent reply	other threads:[~2009-02-24  2:26 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28   ` Mark Knecht
2009-02-22 22:57     ` Etaoin Shrdlu
2009-02-22 23:31       ` Mark Knecht
2009-02-23  6:17         ` Paul Hartman
2009-02-23  9:57         ` Etaoin Shrdlu
2009-02-23 16:05           ` Mark Knecht
2009-02-23 22:18             ` Etaoin Shrdlu
2009-02-24  2:26               ` Mark Knecht [this message]
2009-02-24 10:56                 ` Etaoin Shrdlu
2009-02-24 14:41                   ` Mark Knecht
2009-02-24 17:48                     ` Etaoin Shrdlu
2009-02-24 22:51                       ` Mark Knecht
2009-02-25 10:27                         ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15   ` Mark Knecht
2009-02-23  0:57     ` Willie Wong
2009-02-23  1:54       ` Mark Knecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bdc1c8b0902231826t65cd1b4bvaaa203623e9f262b@mail.gmail.com \
    --to=markknecht@gmail.com \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox