From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-91448-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1Lc1Qx-0003Q9-3z
	for garchives@archives.gentoo.org; Tue, 24 Feb 2009 17:51:07 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id D5617E059A;
	Tue, 24 Feb 2009 17:51:04 +0000 (UTC)
Received: from dcnode-01.unlimitedmail.net (smtp.unlimitedmail.net [94.127.184.242])
	by pigeon.gentoo.org (Postfix) with ESMTP id 835C2E059A
	for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 17:51:04 +0000 (UTC)
Received: from ppp.zz ([137.204.208.98])
	(authenticated bits=0)
	by dcnode-01.unlimitedmail.net (8.14.3/8.14.3) with ESMTP id n1OHoWPp005472
	for <gentoo-user@lists.gentoo.org>; Tue, 24 Feb 2009 18:50:32 +0100
From: Etaoin Shrdlu <shrdlu@unlimitedmail.org>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Tue, 24 Feb 2009 18:48:56 +0100
User-Agent: KMail/1.9.9
References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902241156.50173.shrdlu@unlimitedmail.org> <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com>
In-Reply-To: <5bdc1c8b0902240641x332b57a6o524f9c29cc3e4fa1@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200902241848.57154.shrdlu@unlimitedmail.org>
X-UnlimitedMail-MailScanner-From: shrdlu@unlimitedmail.org
X-Spam-Status: No
X-Archives-Salt: 4e3e07b7-3690-41ba-9207-60e8e9486c62
X-Archives-Hash: 45b153963852f6a6e8e5963558df3391

On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>
> > So, in my understanding this is what we want to accomplish so far:
> >
> > given an input of the form
> >
> > D1,T1,a1,b1,c1,d1,...,R1
> > D2,T2,a2,b2,c2,d2,...,R2
> > D3,T3,a3,b3,c3,d3,...,R3
> > D4,T4,a4,b4,c4,d4,...,R4
> > D5,T5,a5,b5,c5,d5,...,R5
> >
> > (the ... mean that an =C2=A0arbitrary number of columns can follow)
> >
> > You want to group lines by n at a time, keeping the D and T column
> > from the first line of each group, and keeping the R column from the
> > last line of the group, so for example with n=3D3 we would have:
> >
> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
> >
> > (and you're right, that produces an output that is roughly n times
> > the size of the original file)
> >
> > Now, in addition to that, you also want to drop an arbitrary number
> > of columns in the a,b,c... group. So for example, you want to drop
> > columns 2 and 3 (b and c in the example), so you'd end up with
> > something like
> >
> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
> >
> > Please confirm that my understanding is correct, so I can come up
> > with some code to do that.
>
> Perfectly correct for all the data rows.
>
> For the header I now see that we have a slightly harder job. What we'd
> need to do is read the first line of the file, duplicate it N times,
> and then drop the same columns as we drop in the rows. The problem is
> that now I have the same header value for N columns which won't make
> sense to the tool that uses this data. If we could read the header and
> then automatically postpend the number N to each duplicated name. (or
> some string like _N)

So in the last example the header would be something like

D,T,a,b,c,d,...,R

in the original file, and would become=20

D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R

and

D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R

respectively for the two sample outputs above.

> Maybe better would be a separate small program to do the header part
> and then this program could read that header and make it the first
> line of the output file. My worry is that when this data file becomes
> very large - say 1GB or more of data - I probably cannot open the file
> with vi to edit the header. Better if I could put the header in it's
> own file. That file would be 1 line long. I could check it for the
> name edits, make sure it's right, and then the program you are so
> kindly building would just read it, cut out columns, and put it at the
> start of the new large file.
>
> Does that make sense?

Maybe, but while we're at it my personal preference would be to just add=20
header handling to the existing program. Here's the revised code (also=20
cleaned up a bit and more structured):

# returns a ",a1,b1,c1,.." line
# drops unwanted columns
function do_line(num) {
  line=3D""
  for(j=3D3;j<NF;j++)
    if(!((j-2) in dc)){line=3Dline OFS $j (num>0?"_"num:"")}
  return line
}

# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)

BEGIN {FS=3DOFS=3D","
       t=3Dsplit(dcols,tmp,/,/)
       for(i=3D1;i<=3Dt;i++)dc[tmp[i]]
      }

# process the header
NR=3D=3D1{
  l=3D$1 OFS $2;
  for(i=3D1;i<=3Dn;i++)l=3Dl do_line(i)
  l=3Dl OFS $NF
  print l
}

NR>=3D2{
  for(i=3D1;i<n;i++){
    s[i]=3Ds[i+1]
    dt[i]=3Ddt[i+1]
  }

  dt[n]=3D$1 OFS $2
  s[n]=3Ddo_line(-1)

  if(NR>=3Dn+1){
    l=3Ddt[1]
    for(i=3D1;i<=3Dn;i++)l=3Dl s[i]
    l=3Dl OFS $NF
    print l
  }
}'

I think two examples are enough here to demonstrate how it works (blank=20
lines between commands added for clarity - remember first line is header=20
line):

$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6

$ awk -v n=3D3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6

$ awk -v n=3D3 -v dcols=3D'1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6

The program still assumes that first two columns and last column must be=20
excluded from the general processing, and just prepended and appended to=20
the output lines. That can be changed though.