public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: Etaoin Shrdlu <shrdlu@unlimitedmail.org>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Tue, 24 Feb 2009 18:48:56 +0100	[thread overview]
Message-ID: <200902241848.57154.shrdlu@unlimitedmail.org> (raw)
In-Reply-To: <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com>

On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>
> > So, in my understanding this is what we want to accomplish so far:
> >
> > given an input of the form
> >
> > D1,T1,a1,b1,c1,d1,...,R1
> > D2,T2,a2,b2,c2,d2,...,R2
> > D3,T3,a3,b3,c3,d3,...,R3
> > D4,T4,a4,b4,c4,d4,...,R4
> > D5,T5,a5,b5,c5,d5,...,R5
> >
> > (the ... mean that an  arbitrary number of columns can follow)
> >
> > You want to group lines by n at a time, keeping the D and T column
> > from the first line of each group, and keeping the R column from the
> > last line of the group, so for example with n=3 we would have:
> >
> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
> >
> > (and you're right, that produces an output that is roughly n times
> > the size of the original file)
> >
> > Now, in addition to that, you also want to drop an arbitrary number
> > of columns in the a,b,c... group. So for example, you want to drop
> > columns 2 and 3 (b and c in the example), so you'd end up with
> > something like
> >
> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
> >
> > Please confirm that my understanding is correct, so I can come up
> > with some code to do that.
>
> Perfectly correct for all the data rows.
>
> For the header I now see that we have a slightly harder job. What we'd
> need to do is read the first line of the file, duplicate it N times,
> and then drop the same columns as we drop in the rows. The problem is
> that now I have the same header value for N columns which won't make
> sense to the tool that uses this data. If we could read the header and
> then automatically postpend the number N to each duplicated name. (or
> some string like _N)

So in the last example the header would be something like

D,T,a,b,c,d,...,R

in the original file, and would become 

D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R

and

D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R

respectively for the two sample outputs above.

> Maybe better would be a separate small program to do the header part
> and then this program could read that header and make it the first
> line of the output file. My worry is that when this data file becomes
> very large - say 1GB or more of data - I probably cannot open the file
> with vi to edit the header. Better if I could put the header in it's
> own file. That file would be 1 line long. I could check it for the
> name edits, make sure it's right, and then the program you are so
> kindly building would just read it, cut out columns, and put it at the
> start of the new large file.
>
> Does that make sense?

Maybe, but while we're at it my personal preference would be to just add 
header handling to the existing program. Here's the revised code (also 
cleaned up a bit and more structured):

# returns a ",a1,b1,c1,.." line
# drops unwanted columns
function do_line(num) {
  line=""
  for(j=3;j<NF;j++)
    if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
  return line
}

# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)

BEGIN {FS=OFS=","
       t=split(dcols,tmp,/,/)
       for(i=1;i<=t;i++)dc[tmp[i]]
      }

# process the header
NR==1{
  l=$1 OFS $2;
  for(i=1;i<=n;i++)l=l do_line(i)
  l=l OFS $NF
  print l
}

NR>=2{
  for(i=1;i<n;i++){
    s[i]=s[i+1]
    dt[i]=dt[i+1]
  }

  dt[n]=$1 OFS $2
  s[n]=do_line(-1)

  if(NR>=n+1){
    l=dt[1]
    for(i=1;i<=n;i++)l=l s[i]
    l=l OFS $NF
    print l
  }
}'

I think two examples are enough here to demonstrate how it works (blank 
lines between commands added for clarity - remember first line is header 
line):

$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6

$ awk -v n=3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6

$ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6

The program still assumes that first two columns and last column must be 
excluded from the general processing, and just prepended and appended to 
the output lines. That can be changed though.



  reply	other threads:[~2009-02-24 17:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28   ` Mark Knecht
2009-02-22 22:57     ` Etaoin Shrdlu
2009-02-22 23:31       ` Mark Knecht
2009-02-23  6:17         ` Paul Hartman
2009-02-23  9:57         ` Etaoin Shrdlu
2009-02-23 16:05           ` Mark Knecht
2009-02-23 22:18             ` Etaoin Shrdlu
2009-02-24  2:26               ` Mark Knecht
2009-02-24 10:56                 ` Etaoin Shrdlu
2009-02-24 14:41                   ` Mark Knecht
2009-02-24 17:48                     ` Etaoin Shrdlu [this message]
2009-02-24 22:51                       ` Mark Knecht
2009-02-25 10:27                         ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15   ` Mark Knecht
2009-02-23  0:57     ` Willie Wong
2009-02-23  1:54       ` Mark Knecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200902241848.57154.shrdlu@unlimitedmail.org \
    --to=shrdlu@unlimitedmail.org \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox