From: Etaoin Shrdlu <shrdlu@unlimitedmail.org>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Tue, 24 Feb 2009 18:48:56 +0100 [thread overview]
Message-ID: <200902241848.57154.shrdlu@unlimitedmail.org> (raw)
In-Reply-To: <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com>
On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>
> > So, in my understanding this is what we want to accomplish so far:
> >
> > given an input of the form
> >
> > D1,T1,a1,b1,c1,d1,...,R1
> > D2,T2,a2,b2,c2,d2,...,R2
> > D3,T3,a3,b3,c3,d3,...,R3
> > D4,T4,a4,b4,c4,d4,...,R4
> > D5,T5,a5,b5,c5,d5,...,R5
> >
> > (the ... mean that an arbitrary number of columns can follow)
> >
> > You want to group lines by n at a time, keeping the D and T column
> > from the first line of each group, and keeping the R column from the
> > last line of the group, so for example with n=3 we would have:
> >
> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
> >
> > (and you're right, that produces an output that is roughly n times
> > the size of the original file)
> >
> > Now, in addition to that, you also want to drop an arbitrary number
> > of columns in the a,b,c... group. So for example, you want to drop
> > columns 2 and 3 (b and c in the example), so you'd end up with
> > something like
> >
> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
> >
> > Please confirm that my understanding is correct, so I can come up
> > with some code to do that.
>
> Perfectly correct for all the data rows.
>
> For the header I now see that we have a slightly harder job. What we'd
> need to do is read the first line of the file, duplicate it N times,
> and then drop the same columns as we drop in the rows. The problem is
> that now I have the same header value for N columns which won't make
> sense to the tool that uses this data. If we could read the header and
> then automatically postpend the number N to each duplicated name. (or
> some string like _N)
So in the last example the header would be something like
D,T,a,b,c,d,...,R
in the original file, and would become
D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R
and
D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R
respectively for the two sample outputs above.
> Maybe better would be a separate small program to do the header part
> and then this program could read that header and make it the first
> line of the output file. My worry is that when this data file becomes
> very large - say 1GB or more of data - I probably cannot open the file
> with vi to edit the header. Better if I could put the header in it's
> own file. That file would be 1 line long. I could check it for the
> name edits, make sure it's right, and then the program you are so
> kindly building would just read it, cut out columns, and put it at the
> start of the new large file.
>
> Does that make sense?
Maybe, but while we're at it my personal preference would be to just add
header handling to the existing program. Here's the revised code (also
cleaned up a bit and more structured):
# returns a ",a1,b1,c1,.." line
# drops unwanted columns
function do_line(num) {
line=""
for(j=3;j<NF;j++)
if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
return line
}
# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)
BEGIN {FS=OFS=","
t=split(dcols,tmp,/,/)
for(i=1;i<=t;i++)dc[tmp[i]]
}
# process the header
NR==1{
l=$1 OFS $2;
for(i=1;i<=n;i++)l=l do_line(i)
l=l OFS $NF
print l
}
NR>=2{
for(i=1;i<n;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
}
dt[n]=$1 OFS $2
s[n]=do_line(-1)
if(NR>=n+1){
l=dt[1]
for(i=1;i<=n;i++)l=l s[i]
l=l OFS $NF
print l
}
}'
I think two examples are enough here to demonstrate how it works (blank
lines between commands added for clarity - remember first line is header
line):
$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6
$ awk -v n=3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6
$ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6
The program still assumes that first two columns and last column must be
excluded from the general processing, and just prepended and appended to
the output lines. That can be changed though.
next prev parent reply other threads:[~2009-02-24 17:51 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28 ` Mark Knecht
2009-02-22 22:57 ` Etaoin Shrdlu
2009-02-22 23:31 ` Mark Knecht
2009-02-23 6:17 ` Paul Hartman
2009-02-23 9:57 ` Etaoin Shrdlu
2009-02-23 16:05 ` Mark Knecht
2009-02-23 22:18 ` Etaoin Shrdlu
2009-02-24 2:26 ` Mark Knecht
2009-02-24 10:56 ` Etaoin Shrdlu
2009-02-24 14:41 ` Mark Knecht
2009-02-24 17:48 ` Etaoin Shrdlu [this message]
2009-02-24 22:51 ` Mark Knecht
2009-02-25 10:27 ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15 ` Mark Knecht
2009-02-23 0:57 ` Willie Wong
2009-02-23 1:54 ` Mark Knecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200902241848.57154.shrdlu@unlimitedmail.org \
--to=shrdlu@unlimitedmail.org \
--cc=gentoo-user@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox