From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-91374-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1LbNXT-00053g-Nw
	for garchives@archives.gentoo.org; Sun, 22 Feb 2009 23:15:12 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 4F917E02AB;
	Sun, 22 Feb 2009 23:15:10 +0000 (UTC)
Received: from rv-out-0708.google.com (rv-out-0708.google.com [209.85.198.245])
	by pigeon.gentoo.org (Postfix) with ESMTP id 10F88E02AB
	for <gentoo-user@lists.gentoo.org>; Sun, 22 Feb 2009 23:15:09 +0000 (UTC)
Received: by rv-out-0708.google.com with SMTP id f25so1730201rvb.46
        for <gentoo-user@lists.gentoo.org>; Sun, 22 Feb 2009 15:15:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:received:in-reply-to:references
         :date:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        bh=nZdtier3BBkbXvMTT7fle/L/PZRk963uu24EdrhA5lY=;
        b=PvwqoG9pFi9VQRTtN9kg8PFVvYvWzQnWswlE2MLDBAa0d+bo941WHqevY9gRNz6Qsn
         oVCHf2OInFSdpN3xrWwZUt7JTC/WDousAM79wTphooK2gjpBBskiNck67U5/M49tK9BY
         gab2eywlKZeKKcjSwTECmoIulvIQbBtlPCwAY=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=eRDLtUzg8KL407OTZFVFaOdIhQI/K/QnRAWldnoueG5Q5lwwi9mLeQj0OjnoZ8B4lv
         MWNXF69WIFnzwJSiORLc7bkIjNYholPN4PAZs3hZvlRuY3ulEyubS70/mdMkyn0Qoeb4
         Wk5cDLnGeCNqGODXe0u5ROREFl+uKg9jZK1AE=
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Received: by 10.142.241.10 with SMTP id o10mr1666482wfh.23.1235344509394; Sun, 
	22 Feb 2009 15:15:09 -0800 (PST)
In-Reply-To: <20090222205938.GA455@princeton.edu>
References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com>
	 <20090222205938.GA455@princeton.edu>
Date: Sun, 22 Feb 2009 15:15:09 -0800
Message-ID: <5bdc1c8b0902221515g3b932654k47568031f45d76d8@mail.gmail.com>
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
From: Mark Knecht <markknecht@gmail.com>
To: gentoo-user@lists.gentoo.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Archives-Salt: 504501a1-6509-4618-afcd-01d0a11034f9
X-Archives-Hash: d6fb6591fbe4bf47c8c4187e95e7d8ac

On Sun, Feb 22, 2009 at 12:59 PM, Willie Wong <wwong@princeton.edu> wrote:
> On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
>>    I've got a really big data file in essentially a *.csv format.
>> (comma delimited) I need to scan this file and create a new output
>> file. I'm wondering if there is a reasonably easy command line way of
>> doing this using something like sed or awk which I know nothing about.
>> Thanks in advance.
>
> Definitely more than doable in sed or awk. If you want a reference
> book, try http://oreilly.com/catalog/9781565922259/
>
> Unfortunately I haven't used awk in the longest time and can't
> remember how it will go. The following sed recipe may work, modulo
> some small modifications
>
>>    The basic idea goes something like this:
>>
>> 1) The input file might look this the following where some of it is
>> attributes (shown as letters) and other parts are results. (shown as
>> numbers)
>>
>> A,B,C,D,1
>> E,F,G,H,2
>> I,J,K,L,3
>> M,N,O,P,4
>> Q,R,S,T,5
>> U,V,W,X,6
>>
>> 2) From the above data input file I want to take the attributes from a
>> few preceeding lines (say 3 in this example) and write them to the
>> output file along with the result on the last of the 3 lines. The
>> output file might look like this:
>>
>> A,B,C,D,E,F,G,H,I,J,K,L,3
>> E,F,G,H,I,J,K,L,M,N,O,P,4
>> I,J,K,L,M,N,O,P,Q,R,S,T,5
>> M,N,O,P,Q,R,S,T,U,V,W,X,6
>>
>> 3) This must be done as a read/process/write operation of some sort
>> because the input file may be far larger than system memory.
>> (Currently it isn't, but it likely will eventually be.)
>>
>> 4) In my example above I suggested that there is a single result but
>> their may be more than one. (Don't know yet.) I showed 3 lines but
>> might be doing 10. I don't know. It's important to me to pick a
>> moderately flexible way of dealing with this as the order of columns
>> and number of results will likely change over time and I'll certainly
>> need to adjust.
>
> First create the sedscript
>
> sedscript1:
> --------------------------
> 1 {
>        N
>        N
> }
> {
>        p
>        D
>        N
> }
> --------------------------
>
> The first block only hits when the first line of input is read. It
> forces it to read the next two lines.
>
> The second block hits for every pattern space, it prints the three
> line blocks, deletes the first line, and reads the next line.
>
> Now create the sedscript
>
> sedscript2:
> --------------------------
> {
>        N
>        N
>        s/,[^,]\n/,/gp
>        d
> }
> --------------------------
>
> This reads a three-line block at a time, removes the last field (and
> the new line character) from all but the last line, replacing it with
> a comma. Then it prints. And then it clears the pattern space.
>
> So you can do
>
> cat INPUT | sed -f sedscript1 | sed -f sedscript2
>
> should give you what you want. Like I said, the whole thing can
> probably be done a lot more eloquently in awk. But my awk-fu is not
> what it used to be.
>
> For a quick reference for sed, try
> http://www.grymoire.com/Unix/Sed.html
>
> W
> --
> Ever stop to think, and forget to start again?
> Sortir en Pantoufles: up 807 days, 18:51

Thanks Willie. That's a good start. The first two lines out were
mangled but that's totally cool. I can deal with that by hand.

There are two places where I'd like to improve things which probably
apply to the awk code Etaoin just sent me also. Both have to do with
excluding columns but in different ways.

1) My actual input data starts with two fields which date & time. For
lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output
corresponding to line 1, so these 3 lines:

Date1,Time1,A,B,C,D,0
Date2,Time2,E,F,G,H,1
Date3,Time3,I,J,K,L,2

should generate

Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2

Essentially Date & Time from line 1, results from line 3.

2) The second is that possibly I don't need attribute G in my output
file. I'm thinking that possibly a 3rd sed script that counts a
certain number of commas and then doesn't copy up through the next
comma? That's messy in the sense that I probably need to drop 10-15
columns out as my real data is maybe 100 fields wide so I'd have 10-15
addition scripts which is too much of a hack to be maintainable.
Anyway, I appreciate the ideas. What you sent worked great.

I suspect this is somehow similar to what you did in the second
script? I'll go play around and see if I can figure that out.

In reality I'm not sure yet whether the results can be guaranteed to
be at the end in the real file, and probably there will be more than
one result column although if I have to I might be able to take care
of combining two results into a single value at the data sounce if
necessary.

Great help! Thanks!

Cheers,
Mark