public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: Mark Knecht <markknecht@gmail.com>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Mon, 23 Feb 2009 08:05:16 -0800	[thread overview]
Message-ID: <5bdc1c8b0902230805t575e97deg9c8b9fb271f4296@mail.gmail.com> (raw)
In-Reply-To: <200902231057.56850.shrdlu@unlimitedmail.org>

[-- Attachment #1: Type: text/plain, Size: 3461 bytes --]

On Mon, Feb 23, 2009 at 1:57 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Monday 23 February 2009, 00:31, Mark Knecht wrote:
>
>> Yeah, that's probably almost usable as it is . I tried it with n=3 and
>> n=10. Worked both times just fine. The initial issue might be (as with
>> Willie's sed code) that the first line wasn't quite right and required
>> some hand editing. I'd prefer not to have to hand edit anything as the
>> files are large and that step will be slow. I can work on that.
>
> But then could you paste an example of such line, so we can see it? The
> first line was not special in the sample you posted...
>
>> As per the message to Willie it would be nice to be able to drop
>> columns out but technically I suppose it's not really required. All of
>> this is going into another program which must at some level understand
>> what the columns are. If I have extra dates and don't use them that's
>> probably workable.
>
> Anyway, it's not difficult to add that feature:
>
> BEGIN { FS=OFS=","}
> {
>  r=$NF;NF--
>  for(i=1;i<n;i++){
>    s[i]=s[i+1]
>    dt[i]=dt[i+1]
>    if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS
>    if(NR>=n)printf "%s%s",s[i],OFS
>  }
>  sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
>  sub("^([^,]*,){"dropcol"}","")
>  s[n]=$0
>  if(NR>=n)printf "%s,%s\n", s[n],r
> }
>
> There is a new variable "dropcol" which contains the number of columns to
> drop. Also, for the above to work, you must add the --re-interval
> command line switch to awk, eg
>
> awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv

Thanks. I'll give that a try later today. I also like Willie's idea
about using cut. That seems pretty flexible without any programming.

>
>> The down side is the output file is 10x larger than the input file -
>> roughly - and my current input files are 40-60MB so the output files
>> will be 600MB. Not huge but if they grew too much more I might get
>> beyond what a single file can be on ext3, right? Isn't that 2GB or so?
>
> That is strange, the output file could be bigger but not by that
> factor...if you don't mind, again could you paste a sample input file
> (maybe just some lines, to get an idea...)?
>
>

I'm attaching a small (100 line) data file out of TradeStation. Zipped
it's about 2K. It should expand to about 10K. When I run the command
to get 10 lines put together it works correctly and gives me a file
with 91 lines and about 100K in size. (I.e. - 10x on my disk.)

awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv

No mangling of the first line - that must have been something earlier
I guess. Sorry for the confusion on that front.

One other item has come up as I start to play with this farther down
the tool chain. I want to use this data in either R or RapidMiner to
data mine for patterns. Both of those tools are easier to use if the
first line in the file has column titles. I had originally asked
TradeStation not to output the column titles but if I do then for the
first line of our new file I should actually copy the first line of
the input file N times. Something like

For i=1; read line, write N times, write \n

and then

for i>=2 do what we're doing right now.

After I did that I could run it through cut and drop whatever columns
I need to drop, I think... ;-)

This is great help from you all. As someone who doesn't really program
or use the command line too much it's a big advantage. Thanks!

Cheers,
Mark

[-- Attachment #2: awkDataIn.csv.bz2 --]
[-- Type: application/x-bzip2, Size: 1929 bytes --]

[-- Attachment #3: awkScript1.awk --]
[-- Type: application/octet-stream, Size: 153 bytes --]

BEGIN { FS=OFS=","}

{
r=$NF;NF--
for(i=1;i<n;i++){
   s[i]=s[i+1]
   if(NR>=n)printf "%s%s",s[i],OFS
 }
  s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r
  }

  reply	other threads:[~2009-02-23 16:05 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28   ` Mark Knecht
2009-02-22 22:57     ` Etaoin Shrdlu
2009-02-22 23:31       ` Mark Knecht
2009-02-23  6:17         ` Paul Hartman
2009-02-23  9:57         ` Etaoin Shrdlu
2009-02-23 16:05           ` Mark Knecht [this message]
2009-02-23 22:18             ` Etaoin Shrdlu
2009-02-24  2:26               ` Mark Knecht
2009-02-24 10:56                 ` Etaoin Shrdlu
2009-02-24 14:41                   ` Mark Knecht
2009-02-24 17:48                     ` Etaoin Shrdlu
2009-02-24 22:51                       ` Mark Knecht
2009-02-25 10:27                         ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15   ` Mark Knecht
2009-02-23  0:57     ` Willie Wong
2009-02-23  1:54       ` Mark Knecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bdc1c8b0902230805t575e97deg9c8b9fb271f4296@mail.gmail.com \
    --to=markknecht@gmail.com \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox