From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-91366-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1LbLNQ-0001lk-A2
	for garchives@archives.gentoo.org; Sun, 22 Feb 2009 20:56:40 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 22BB9E030D;
	Sun, 22 Feb 2009 20:56:38 +0000 (UTC)
Received: from Princeton.EDU (postoffice06.Princeton.EDU [128.112.133.8])
	by pigeon.gentoo.org (Postfix) with ESMTP id 0219CE030D
	for <gentoo-user@lists.gentoo.org>; Sun, 22 Feb 2009 20:56:37 +0000 (UTC)
Received: from smtpserver1.Princeton.EDU (smtpserver1.Princeton.EDU [128.112.129.65])
	by Princeton.EDU (8.13.8/8.13.8) with ESMTP id n1MKubkB018676
	for <gentoo-user@lists.gentoo.org>; Sun, 22 Feb 2009 15:56:37 -0500 (EST)
Received: from sep.dynalias.net (fez.Princeton.EDU [128.112.129.190])
	(authenticated bits=0)
	by smtpserver1.Princeton.EDU (8.12.9/8.12.9) with ESMTP id n1MKuagA026482
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NOT)
	for <gentoo-user@lists.gentoo.org>; Sun, 22 Feb 2009 15:56:37 -0500 (EST)
Received: by sep.dynalias.net (Postfix, from userid 1001)
	id 5D442332E7; Sun, 22 Feb 2009 15:59:39 -0500 (EST)
Date: Sun, 22 Feb 2009 15:59:39 -0500
From: Willie Wong <wwong@Princeton.EDU>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Message-ID: <20090222205938.GA455@princeton.edu>
Mail-Followup-To: gentoo-user@lists.gentoo.org
References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com>
User-Agent: Mutt/1.5.16 (2007-06-09)
X-Archives-Salt: 3726df51-b6ba-42ef-8332-4ed9d1d2d714
X-Archives-Hash: 1be28c7d11bc10457dfc284c61203b76

On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
>    I've got a really big data file in essentially a *.csv format.
> (comma delimited) I need to scan this file and create a new output
> file. I'm wondering if there is a reasonably easy command line way of
> doing this using something like sed or awk which I know nothing about.
> Thanks in advance.

Definitely more than doable in sed or awk. If you want a reference
book, try http://oreilly.com/catalog/9781565922259/

Unfortunately I haven't used awk in the longest time and can't
remember how it will go. The following sed recipe may work, modulo
some small modifications

>    The basic idea goes something like this:
> 
> 1) The input file might look this the following where some of it is
> attributes (shown as letters) and other parts are results. (shown as
> numbers)
> 
> A,B,C,D,1
> E,F,G,H,2
> I,J,K,L,3
> M,N,O,P,4
> Q,R,S,T,5
> U,V,W,X,6
> 
> 2) From the above data input file I want to take the attributes from a
> few preceeding lines (say 3 in this example) and write them to the
> output file along with the result on the last of the 3 lines. The
> output file might look like this:
> 
> A,B,C,D,E,F,G,H,I,J,K,L,3
> E,F,G,H,I,J,K,L,M,N,O,P,4
> I,J,K,L,M,N,O,P,Q,R,S,T,5
> M,N,O,P,Q,R,S,T,U,V,W,X,6
> 
> 3) This must be done as a read/process/write operation of some sort
> because the input file may be far larger than system memory.
> (Currently it isn't, but it likely will eventually be.)
> 
> 4) In my example above I suggested that there is a single result but
> their may be more than one. (Don't know yet.) I showed 3 lines but
> might be doing 10. I don't know. It's important to me to pick a
> moderately flexible way of dealing with this as the order of columns
> and number of results will likely change over time and I'll certainly
> need to adjust.

First create the sedscript

sedscript1:
--------------------------
1 {
	N
	N
}
{
	p
	D
	N
}
--------------------------

The first block only hits when the first line of input is read. It
forces it to read the next two lines. 

The second block hits for every pattern space, it prints the three
line blocks, deletes the first line, and reads the next line. 

Now create the sedscript

sedscript2:
--------------------------
{
	N
	N
	s/,[^,]\n/,/gp
	d
}
--------------------------

This reads a three-line block at a time, removes the last field (and
the new line character) from all but the last line, replacing it with
a comma. Then it prints. And then it clears the pattern space. 

So you can do

cat INPUT | sed -f sedscript1 | sed -f sedscript2

should give you what you want. Like I said, the whole thing can
probably be done a lot more eloquently in awk. But my awk-fu is not
what it used to be. 

For a quick reference for sed, try
http://www.grymoire.com/Unix/Sed.html

W
-- 
Ever stop to think, and forget to start again?
Sortir en Pantoufles: up 807 days, 18:51