* [gentoo-user] [OT] - command line read *.csv & create new file
@ 2009-02-22 19:06 Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
0 siblings, 2 replies; 19+ messages in thread
From: Mark Knecht @ 2009-02-22 19:06 UTC (permalink / raw
To: gentoo-user
Hi,
Very off topic other than I'd do this on my Gentoo box prior to
using R on my Gentoo box. Please ignore if not of interest.
I've got a really big data file in essentially a *.csv format.
(comma delimited) I need to scan this file and create a new output
file. I'm wondering if there is a reasonably easy command line way of
doing this using something like sed or awk which I know nothing about.
Thanks in advance.
The basic idea goes something like this:
1) The input file might look this the following where some of it is
attributes (shown as letters) and other parts are results. (shown as
numbers)
A,B,C,D,1
E,F,G,H,2
I,J,K,L,3
M,N,O,P,4
Q,R,S,T,5
U,V,W,X,6
2) From the above data input file I want to take the attributes from a
few preceeding lines (say 3 in this example) and write them to the
output file along with the result on the last of the 3 lines. The
output file might look like this:
A,B,C,D,E,F,G,H,I,J,K,L,3
E,F,G,H,I,J,K,L,M,N,O,P,4
I,J,K,L,M,N,O,P,Q,R,S,T,5
M,N,O,P,Q,R,S,T,U,V,W,X,6
3) This must be done as a read/process/write operation of some sort
because the input file may be far larger than system memory.
(Currently it isn't, but it likely will eventually be.)
4) In my example above I suggested that there is a single result but
their may be more than one. (Don't know yet.) I showed 3 lines but
might be doing 10. I don't know. It's important to me to pick a
moderately flexible way of dealing with this as the order of columns
and number of results will likely change over time and I'll certainly
need to adjust.
Thanks in advance for any pointers. Happy to buy a good book if
someone knows what I should look for.
Cheers,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
@ 2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28 ` Mark Knecht
2009-02-22 20:59 ` Willie Wong
1 sibling, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-22 20:15 UTC (permalink / raw
To: gentoo-user
On Sunday 22 February 2009, 20:06, Mark Knecht wrote:
> Hi,
> Very off topic other than I'd do this on my Gentoo box prior to
> using R on my Gentoo box. Please ignore if not of interest.
>
> I've got a really big data file in essentially a *.csv format.
> (comma delimited) I need to scan this file and create a new output
> file. I'm wondering if there is a reasonably easy command line way of
> doing this using something like sed or awk which I know nothing about.
> Thanks in advance.
>
> The basic idea goes something like this:
>
> 1) The input file might look this the following where some of it is
> attributes (shown as letters) and other parts are results. (shown as
> numbers)
>
> A,B,C,D,1
> E,F,G,H,2
> I,J,K,L,3
> M,N,O,P,4
> Q,R,S,T,5
> U,V,W,X,6
Are the results always in the last field, and only a single field?
Is the total number of fields per line always fixed?
> 2) From the above data input file I want to take the attributes from a
> few preceeding lines (say 3 in this example) and write them to the
> output file along with the result on the last of the 3 lines. The
> output file might look like this:
>
> A,B,C,D,E,F,G,H,I,J,K,L,3
> E,F,G,H,I,J,K,L,M,N,O,P,4
> I,J,K,L,M,N,O,P,Q,R,S,T,5
> M,N,O,P,Q,R,S,T,U,V,W,X,6
Is the number of lines you pick for the operation always 3 or can it
vary? And, once you choose a number n of lines, should the whole file be
processed concatenating n lines at a time, and the resulting single line
be ended with the result of the nth line? in other words, does the
following hold for the output format:
<concatenation of attributes of lines 1..n> <result of line n>
<concatenation of attributes of lines 2..n+1> <result of line n+1>
<concatenation of attributes of lines 3..n+2> <result of line n+1>
<concatenation of attributes of lines 4..n+3> <result of line n+1>
...
With answers to the above questions, it's probably possible to hack
together a solution.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
@ 2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15 ` Mark Knecht
1 sibling, 1 reply; 19+ messages in thread
From: Willie Wong @ 2009-02-22 20:59 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
> I've got a really big data file in essentially a *.csv format.
> (comma delimited) I need to scan this file and create a new output
> file. I'm wondering if there is a reasonably easy command line way of
> doing this using something like sed or awk which I know nothing about.
> Thanks in advance.
Definitely more than doable in sed or awk. If you want a reference
book, try http://oreilly.com/catalog/9781565922259/
Unfortunately I haven't used awk in the longest time and can't
remember how it will go. The following sed recipe may work, modulo
some small modifications
> The basic idea goes something like this:
>
> 1) The input file might look this the following where some of it is
> attributes (shown as letters) and other parts are results. (shown as
> numbers)
>
> A,B,C,D,1
> E,F,G,H,2
> I,J,K,L,3
> M,N,O,P,4
> Q,R,S,T,5
> U,V,W,X,6
>
> 2) From the above data input file I want to take the attributes from a
> few preceeding lines (say 3 in this example) and write them to the
> output file along with the result on the last of the 3 lines. The
> output file might look like this:
>
> A,B,C,D,E,F,G,H,I,J,K,L,3
> E,F,G,H,I,J,K,L,M,N,O,P,4
> I,J,K,L,M,N,O,P,Q,R,S,T,5
> M,N,O,P,Q,R,S,T,U,V,W,X,6
>
> 3) This must be done as a read/process/write operation of some sort
> because the input file may be far larger than system memory.
> (Currently it isn't, but it likely will eventually be.)
>
> 4) In my example above I suggested that there is a single result but
> their may be more than one. (Don't know yet.) I showed 3 lines but
> might be doing 10. I don't know. It's important to me to pick a
> moderately flexible way of dealing with this as the order of columns
> and number of results will likely change over time and I'll certainly
> need to adjust.
First create the sedscript
sedscript1:
--------------------------
1 {
N
N
}
{
p
D
N
}
--------------------------
The first block only hits when the first line of input is read. It
forces it to read the next two lines.
The second block hits for every pattern space, it prints the three
line blocks, deletes the first line, and reads the next line.
Now create the sedscript
sedscript2:
--------------------------
{
N
N
s/,[^,]\n/,/gp
d
}
--------------------------
This reads a three-line block at a time, removes the last field (and
the new line character) from all but the last line, replacing it with
a comma. Then it prints. And then it clears the pattern space.
So you can do
cat INPUT | sed -f sedscript1 | sed -f sedscript2
should give you what you want. Like I said, the whole thing can
probably be done a lot more eloquently in awk. But my awk-fu is not
what it used to be.
For a quick reference for sed, try
http://www.grymoire.com/Unix/Sed.html
W
--
Ever stop to think, and forget to start again?
Sortir en Pantoufles: up 807 days, 18:51
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 20:15 ` Etaoin Shrdlu
@ 2009-02-22 22:28 ` Mark Knecht
2009-02-22 22:57 ` Etaoin Shrdlu
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-22 22:28 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 12:15 PM, Etaoin Shrdlu
<shrdlu@unlimitedmail.org> wrote:
> On Sunday 22 February 2009, 20:06, Mark Knecht wrote:
>> Hi,
>> Very off topic other than I'd do this on my Gentoo box prior to
>> using R on my Gentoo box. Please ignore if not of interest.
>>
>> I've got a really big data file in essentially a *.csv format.
>> (comma delimited) I need to scan this file and create a new output
>> file. I'm wondering if there is a reasonably easy command line way of
>> doing this using something like sed or awk which I know nothing about.
>> Thanks in advance.
>>
>> The basic idea goes something like this:
>>
>> 1) The input file might look this the following where some of it is
>> attributes (shown as letters) and other parts are results. (shown as
>> numbers)
>>
>> A,B,C,D,1
>> E,F,G,H,2
>> I,J,K,L,3
>> M,N,O,P,4
>> Q,R,S,T,5
>> U,V,W,X,6
>
> Are the results always in the last field, and only a single field?
> Is the total number of fields per line always fixed?
I don't know that for certain yet but I think the results will not
always be in the last field.
The total number of fields per line is always fixed in a given file
but might change from file to file. If it does I'm willing to do minor
edits (heck - I'll do major edits if I have to!!) to get it working.
>
>> 2) From the above data input file I want to take the attributes from a
>> few preceeding lines (say 3 in this example) and write them to the
>> output file along with the result on the last of the 3 lines. The
>> output file might look like this:
>>
>> A,B,C,D,E,F,G,H,I,J,K,L,3
>> E,F,G,H,I,J,K,L,M,N,O,P,4
>> I,J,K,L,M,N,O,P,Q,R,S,T,5
>> M,N,O,P,Q,R,S,T,U,V,W,X,6
>
> Is the number of lines you pick for the operation always 3 or can it
> vary? And, once you choose a number n of lines, should the whole file be
> processed concatenating n lines at a time, and the resulting single line
> be ended with the result of the nth line? in other words, does the
> following hold for the output format:
>
> <concatenation of attributes of lines 1..n> <result of line n>
> <concatenation of attributes of lines 2..n+1> <result of line n+1>
> <concatenation of attributes of lines 3..n+2> <result of line n+1>
> <concatenation of attributes of lines 4..n+3> <result of line n+1>
The above diagram is correct when the lines chosen is 3. I suspect
that I might chose 10 or 15 lines once I get real data and do some
testing but that was harder to show in this email. A good design for
me would be a single variable I could set. Once a value is chosen I
want to process every line in the input file the same way. I don't use
5 lines sometimes and 10 lines other times. In a given file it's
always the same number of lines.
> ...
>
> With answers to the above questions, it's probably possible to hack
> together a solution.
Thanks!
- Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 22:28 ` Mark Knecht
@ 2009-02-22 22:57 ` Etaoin Shrdlu
2009-02-22 23:31 ` Mark Knecht
0 siblings, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-22 22:57 UTC (permalink / raw
To: gentoo-user
On Sunday 22 February 2009, 23:28, Mark Knecht wrote:
> > <concatenation of attributes of lines 1..n> <result of line n>
> > <concatenation of attributes of lines 2..n+1> <result of line n+1>
> > <concatenation of attributes of lines 3..n+2> <result of line n+1>
> > <concatenation of attributes of lines 4..n+3> <result of line n+1>
>
> The above diagram is correct when the lines chosen is 3. I suspect
> that I might chose 10 or 15 lines once I get real data and do some
> testing but that was harder to show in this email. A good design for
> me would be a single variable I could set. Once a value is chosen I
> want to process every line in the input file the same way. I don't use
> 5 lines sometimes and 10 lines other times. In a given file it's
> always the same number of lines.
Ok, try this for a start:
BEGIN { FS=OFS=","}
{
r=$NF;NF--
for(i=1;i<n;i++){
s[i]=s[i+1]
if(NR>=n)printf "%s%s",s[i],OFS
}
s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r
}
Save the above code in a file (eg, program.awk) and run it with
awk -v n=3 -f program.awk datafile.csv
where the "n=3" part is to be replaced with the actual number of lines
you want to group (eg, n=5, n=4, etc.)
With your sample input and n=3, the above awk program produces the output
you show.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 20:59 ` Willie Wong
@ 2009-02-22 23:15 ` Mark Knecht
2009-02-23 0:57 ` Willie Wong
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-22 23:15 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 12:59 PM, Willie Wong <wwong@princeton.edu> wrote:
> On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
>> I've got a really big data file in essentially a *.csv format.
>> (comma delimited) I need to scan this file and create a new output
>> file. I'm wondering if there is a reasonably easy command line way of
>> doing this using something like sed or awk which I know nothing about.
>> Thanks in advance.
>
> Definitely more than doable in sed or awk. If you want a reference
> book, try http://oreilly.com/catalog/9781565922259/
>
> Unfortunately I haven't used awk in the longest time and can't
> remember how it will go. The following sed recipe may work, modulo
> some small modifications
>
>> The basic idea goes something like this:
>>
>> 1) The input file might look this the following where some of it is
>> attributes (shown as letters) and other parts are results. (shown as
>> numbers)
>>
>> A,B,C,D,1
>> E,F,G,H,2
>> I,J,K,L,3
>> M,N,O,P,4
>> Q,R,S,T,5
>> U,V,W,X,6
>>
>> 2) From the above data input file I want to take the attributes from a
>> few preceeding lines (say 3 in this example) and write them to the
>> output file along with the result on the last of the 3 lines. The
>> output file might look like this:
>>
>> A,B,C,D,E,F,G,H,I,J,K,L,3
>> E,F,G,H,I,J,K,L,M,N,O,P,4
>> I,J,K,L,M,N,O,P,Q,R,S,T,5
>> M,N,O,P,Q,R,S,T,U,V,W,X,6
>>
>> 3) This must be done as a read/process/write operation of some sort
>> because the input file may be far larger than system memory.
>> (Currently it isn't, but it likely will eventually be.)
>>
>> 4) In my example above I suggested that there is a single result but
>> their may be more than one. (Don't know yet.) I showed 3 lines but
>> might be doing 10. I don't know. It's important to me to pick a
>> moderately flexible way of dealing with this as the order of columns
>> and number of results will likely change over time and I'll certainly
>> need to adjust.
>
> First create the sedscript
>
> sedscript1:
> --------------------------
> 1 {
> N
> N
> }
> {
> p
> D
> N
> }
> --------------------------
>
> The first block only hits when the first line of input is read. It
> forces it to read the next two lines.
>
> The second block hits for every pattern space, it prints the three
> line blocks, deletes the first line, and reads the next line.
>
> Now create the sedscript
>
> sedscript2:
> --------------------------
> {
> N
> N
> s/,[^,]\n/,/gp
> d
> }
> --------------------------
>
> This reads a three-line block at a time, removes the last field (and
> the new line character) from all but the last line, replacing it with
> a comma. Then it prints. And then it clears the pattern space.
>
> So you can do
>
> cat INPUT | sed -f sedscript1 | sed -f sedscript2
>
> should give you what you want. Like I said, the whole thing can
> probably be done a lot more eloquently in awk. But my awk-fu is not
> what it used to be.
>
> For a quick reference for sed, try
> http://www.grymoire.com/Unix/Sed.html
>
> W
> --
> Ever stop to think, and forget to start again?
> Sortir en Pantoufles: up 807 days, 18:51
Thanks Willie. That's a good start. The first two lines out were
mangled but that's totally cool. I can deal with that by hand.
There are two places where I'd like to improve things which probably
apply to the awk code Etaoin just sent me also. Both have to do with
excluding columns but in different ways.
1) My actual input data starts with two fields which date & time. For
lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output
corresponding to line 1, so these 3 lines:
Date1,Time1,A,B,C,D,0
Date2,Time2,E,F,G,H,1
Date3,Time3,I,J,K,L,2
should generate
Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2
Essentially Date & Time from line 1, results from line 3.
2) The second is that possibly I don't need attribute G in my output
file. I'm thinking that possibly a 3rd sed script that counts a
certain number of commas and then doesn't copy up through the next
comma? That's messy in the sense that I probably need to drop 10-15
columns out as my real data is maybe 100 fields wide so I'd have 10-15
addition scripts which is too much of a hack to be maintainable.
Anyway, I appreciate the ideas. What you sent worked great.
I suspect this is somehow similar to what you did in the second
script? I'll go play around and see if I can figure that out.
In reality I'm not sure yet whether the results can be guaranteed to
be at the end in the real file, and probably there will be more than
one result column although if I have to I might be able to take care
of combining two results into a single value at the data sounce if
necessary.
Great help! Thanks!
Cheers,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 22:57 ` Etaoin Shrdlu
@ 2009-02-22 23:31 ` Mark Knecht
2009-02-23 6:17 ` Paul Hartman
2009-02-23 9:57 ` Etaoin Shrdlu
0 siblings, 2 replies; 19+ messages in thread
From: Mark Knecht @ 2009-02-22 23:31 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 2:57 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Sunday 22 February 2009, 23:28, Mark Knecht wrote:
>
>> > <concatenation of attributes of lines 1..n> <result of line n>
>> > <concatenation of attributes of lines 2..n+1> <result of line n+1>
>> > <concatenation of attributes of lines 3..n+2> <result of line n+1>
>> > <concatenation of attributes of lines 4..n+3> <result of line n+1>
>>
>> The above diagram is correct when the lines chosen is 3. I suspect
>> that I might chose 10 or 15 lines once I get real data and do some
>> testing but that was harder to show in this email. A good design for
>> me would be a single variable I could set. Once a value is chosen I
>> want to process every line in the input file the same way. I don't use
>> 5 lines sometimes and 10 lines other times. In a given file it's
>> always the same number of lines.
>
> Ok, try this for a start:
>
> BEGIN { FS=OFS=","}
>
> {
> r=$NF;NF--
> for(i=1;i<n;i++){
> s[i]=s[i+1]
> if(NR>=n)printf "%s%s",s[i],OFS
> }
> s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r
> }
>
> Save the above code in a file (eg, program.awk) and run it with
>
> awk -v n=3 -f program.awk datafile.csv
>
> where the "n=3" part is to be replaced with the actual number of lines
> you want to group (eg, n=5, n=4, etc.)
>
> With your sample input and n=3, the above awk program produces the output
> you show.
>
>
Yeah, that's probably almost usable as it is . I tried it with n=3 and
n=10. Worked both times just fine. The initial issue might be (as with
Willie's sed code) that the first line wasn't quite right and required
some hand editing. I'd prefer not to have to hand edit anything as the
files are large and that step will be slow. I can work on that.
As per the message to Willie it would be nice to be able to drop
columns out but technically I suppose it's not really required. All of
this is going into another program which must at some level understand
what the columns are. If I have extra dates and don't use them that's
probably workable.
The down side is the output file is 10x larger than the input file -
roughly - and my current input files are 40-60MB so the output files
will be 600MB. Not huge but if they grew too much more I might get
beyond what a single file can be on ext3, right? Isn't that 2GB or so?
Thanks very much,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 23:15 ` Mark Knecht
@ 2009-02-23 0:57 ` Willie Wong
2009-02-23 1:54 ` Mark Knecht
0 siblings, 1 reply; 19+ messages in thread
From: Willie Wong @ 2009-02-23 0:57 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked:
> 1) My actual input data starts with two fields which date & time. For
> lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output
> corresponding to line 1, so these 3 lines:
>
> Date1,Time1,A,B,C,D,0
> Date2,Time2,E,F,G,H,1
> Date3,Time3,I,J,K,L,2
>
> should generate
>
> Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2
>
> Essentially Date & Time from line 1, results from line 3.
>
> 2) The second is that possibly I don't need attribute G in my output
> file. I'm thinking that possibly a 3rd sed script that counts a
> certain number of commas and then doesn't copy up through the next
> comma? That's messy in the sense that I probably need to drop 10-15
> columns out as my real data is maybe 100 fields wide so I'd have 10-15
> addition scripts which is too much of a hack to be maintainable.
> Anyway, I appreciate the ideas. What you sent worked great.
>
For both of these cases, since you are dropping columns and not
re-organizing, you'd have a much easier time just piping the command
through "cut". Try 'man cut' (it is only a few hundred words) for
usage. But with the sample you gave me, you just need to post process
with
.... | cut -d , -f 1-6,9,10,12,15-
and the Date2, Time2, G, Date3, Time3 columns will be dropped.
As to your problem with the first two lines being mangled: I suspect
that the first two lines were formatted differently? Maybe stray
control characters got into your file or maybe there are leading
spaces? It's bizarre for both Etaoin's and my scripts to
coincidentally mess up the same lines.
(Incidentally, where did you get the csv files from? When I worked in
a physics labs and collected data, I found that a lot of times the
processing of data using basic command-line tools like sed, bash,
perl, and bc can be done a lot more quickly if the initial datasets were
formatted in a sensible fashion. Of course there are times when such
luxury cannot be afforded.)
Best,
W
--
"What's the Lagrangian for a suction dart?"
~DeathMech, Some Student. P-town PHY 205
Sortir en Pantoufles: up 807 days, 23:29
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-23 0:57 ` Willie Wong
@ 2009-02-23 1:54 ` Mark Knecht
0 siblings, 0 replies; 19+ messages in thread
From: Mark Knecht @ 2009-02-23 1:54 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 4:57 PM, Willie Wong <wwong@princeton.edu> wrote:
> On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked:
>> 1) My actual input data starts with two fields which date & time. For
>> lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output
>> corresponding to line 1, so these 3 lines:
>>
>> Date1,Time1,A,B,C,D,0
>> Date2,Time2,E,F,G,H,1
>> Date3,Time3,I,J,K,L,2
>>
>> should generate
>>
>> Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2
>>
>> Essentially Date & Time from line 1, results from line 3.
>>
>> 2) The second is that possibly I don't need attribute G in my output
>> file. I'm thinking that possibly a 3rd sed script that counts a
>> certain number of commas and then doesn't copy up through the next
>> comma? That's messy in the sense that I probably need to drop 10-15
>> columns out as my real data is maybe 100 fields wide so I'd have 10-15
>> addition scripts which is too much of a hack to be maintainable.
>> Anyway, I appreciate the ideas. What you sent worked great.
>>
>
> For both of these cases, since you are dropping columns and not
> re-organizing, you'd have a much easier time just piping the command
> through "cut". Try 'man cut' (it is only a few hundred words) for
> usage. But with the sample you gave me, you just need to post process
> with
>
> .... | cut -d , -f 1-6,9,10,12,15-
>
> and the Date2, Time2, G, Date3, Time3 columns will be dropped.
Thanks. I'll investigate that tomorrow.
>
> As to your problem with the first two lines being mangled: I suspect
> that the first two lines were formatted differently? Maybe stray
> control characters got into your file or maybe there are leading
> spaces? It's bizarre for both Etaoin's and my scripts to
> coincidentally mess up the same lines.
>
> (Incidentally, where did you get the csv files from? When I worked in
> a physics labs and collected data, I found that a lot of times the
> processing of data using basic command-line tools like sed, bash,
> perl, and bc can be done a lot more quickly if the initial datasets were
> formatted in a sensible fashion. Of course there are times when such
> luxury cannot be afforded.)
They are primarialy coming from TradeStation. The data that I'm
working with is stock pricing data along with technical indicators
coming off of charts. Unfortunatelly I don't seem to have any control
at all as to the order that the columns show up. It doesn't seem to be
based on how I build the chart and certain things on the chart I don't
need are still output to the file. It's pretty much take 100% of
what's on the chart or take nothing.
Fortunately the csv files are very good in terms of not dropping out
data. At least every row has all the data.
Cheers,
Mark
>
> Best,
>
> W
> --
> "What's the Lagrangian for a suction dart?"
> ~DeathMech, Some Student. P-town PHY 205
> Sortir en Pantoufles: up 807 days, 23:29
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 23:31 ` Mark Knecht
@ 2009-02-23 6:17 ` Paul Hartman
2009-02-23 9:57 ` Etaoin Shrdlu
1 sibling, 0 replies; 19+ messages in thread
From: Paul Hartman @ 2009-02-23 6:17 UTC (permalink / raw
To: gentoo-user
On Sun, Feb 22, 2009 at 5:31 PM, Mark Knecht <markknecht@gmail.com> wrote:
> The down side is the output file is 10x larger than the input file -
> roughly - and my current input files are 40-60MB so the output files
> will be 600MB. Not huge but if they grew too much more I might get
> beyond what a single file can be on ext3, right? Isn't that 2GB or so?
The maximum file size for ext3 depends on the block size you're using,
but I believe with the default settings it is 2TB max, not 2GB. I have
personally had files as large as 26GB on ext3 without issues.
You could always pipe the whole operation it through gzip/bzip2/lzma
or similar if you want the file on disk to be much smaller.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-22 23:31 ` Mark Knecht
2009-02-23 6:17 ` Paul Hartman
@ 2009-02-23 9:57 ` Etaoin Shrdlu
2009-02-23 16:05 ` Mark Knecht
1 sibling, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-23 9:57 UTC (permalink / raw
To: gentoo-user
On Monday 23 February 2009, 00:31, Mark Knecht wrote:
> Yeah, that's probably almost usable as it is . I tried it with n=3 and
> n=10. Worked both times just fine. The initial issue might be (as with
> Willie's sed code) that the first line wasn't quite right and required
> some hand editing. I'd prefer not to have to hand edit anything as the
> files are large and that step will be slow. I can work on that.
But then could you paste an example of such line, so we can see it? The
first line was not special in the sample you posted...
> As per the message to Willie it would be nice to be able to drop
> columns out but technically I suppose it's not really required. All of
> this is going into another program which must at some level understand
> what the columns are. If I have extra dates and don't use them that's
> probably workable.
Anyway, it's not difficult to add that feature:
BEGIN { FS=OFS=","}
{
r=$NF;NF--
for(i=1;i<n;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS
if(NR>=n)printf "%s%s",s[i],OFS
}
sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
sub("^([^,]*,){"dropcol"}","")
s[n]=$0
if(NR>=n)printf "%s,%s\n", s[n],r
}
There is a new variable "dropcol" which contains the number of columns to
drop. Also, for the above to work, you must add the --re-interval
command line switch to awk, eg
awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv
> The down side is the output file is 10x larger than the input file -
> roughly - and my current input files are 40-60MB so the output files
> will be 600MB. Not huge but if they grew too much more I might get
> beyond what a single file can be on ext3, right? Isn't that 2GB or so?
That is strange, the output file could be bigger but not by that
factor...if you don't mind, again could you paste a sample input file
(maybe just some lines, to get an idea...)?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-23 9:57 ` Etaoin Shrdlu
@ 2009-02-23 16:05 ` Mark Knecht
2009-02-23 22:18 ` Etaoin Shrdlu
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-23 16:05 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 3461 bytes --]
On Mon, Feb 23, 2009 at 1:57 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Monday 23 February 2009, 00:31, Mark Knecht wrote:
>
>> Yeah, that's probably almost usable as it is . I tried it with n=3 and
>> n=10. Worked both times just fine. The initial issue might be (as with
>> Willie's sed code) that the first line wasn't quite right and required
>> some hand editing. I'd prefer not to have to hand edit anything as the
>> files are large and that step will be slow. I can work on that.
>
> But then could you paste an example of such line, so we can see it? The
> first line was not special in the sample you posted...
>
>> As per the message to Willie it would be nice to be able to drop
>> columns out but technically I suppose it's not really required. All of
>> this is going into another program which must at some level understand
>> what the columns are. If I have extra dates and don't use them that's
>> probably workable.
>
> Anyway, it's not difficult to add that feature:
>
> BEGIN { FS=OFS=","}
> {
> r=$NF;NF--
> for(i=1;i<n;i++){
> s[i]=s[i+1]
> dt[i]=dt[i+1]
> if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS
> if(NR>=n)printf "%s%s",s[i],OFS
> }
> sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
> sub("^([^,]*,){"dropcol"}","")
> s[n]=$0
> if(NR>=n)printf "%s,%s\n", s[n],r
> }
>
> There is a new variable "dropcol" which contains the number of columns to
> drop. Also, for the above to work, you must add the --re-interval
> command line switch to awk, eg
>
> awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv
Thanks. I'll give that a try later today. I also like Willie's idea
about using cut. That seems pretty flexible without any programming.
>
>> The down side is the output file is 10x larger than the input file -
>> roughly - and my current input files are 40-60MB so the output files
>> will be 600MB. Not huge but if they grew too much more I might get
>> beyond what a single file can be on ext3, right? Isn't that 2GB or so?
>
> That is strange, the output file could be bigger but not by that
> factor...if you don't mind, again could you paste a sample input file
> (maybe just some lines, to get an idea...)?
>
>
I'm attaching a small (100 line) data file out of TradeStation. Zipped
it's about 2K. It should expand to about 10K. When I run the command
to get 10 lines put together it works correctly and gives me a file
with 91 lines and about 100K in size. (I.e. - 10x on my disk.)
awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv
No mangling of the first line - that must have been something earlier
I guess. Sorry for the confusion on that front.
One other item has come up as I start to play with this farther down
the tool chain. I want to use this data in either R or RapidMiner to
data mine for patterns. Both of those tools are easier to use if the
first line in the file has column titles. I had originally asked
TradeStation not to output the column titles but if I do then for the
first line of our new file I should actually copy the first line of
the input file N times. Something like
For i=1; read line, write N times, write \n
and then
for i>=2 do what we're doing right now.
After I did that I could run it through cut and drop whatever columns
I need to drop, I think... ;-)
This is great help from you all. As someone who doesn't really program
or use the command line too much it's a big advantage. Thanks!
Cheers,
Mark
[-- Attachment #2: awkDataIn.csv.bz2 --]
[-- Type: application/x-bzip2, Size: 1929 bytes --]
[-- Attachment #3: awkScript1.awk --]
[-- Type: application/octet-stream, Size: 153 bytes --]
BEGIN { FS=OFS=","}
{
r=$NF;NF--
for(i=1;i<n;i++){
s[i]=s[i+1]
if(NR>=n)printf "%s%s",s[i],OFS
}
s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-23 16:05 ` Mark Knecht
@ 2009-02-23 22:18 ` Etaoin Shrdlu
2009-02-24 2:26 ` Mark Knecht
0 siblings, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-23 22:18 UTC (permalink / raw
To: gentoo-user
On Monday 23 February 2009, 17:05, Mark Knecht wrote:
> I'm attaching a small (100 line) data file out of TradeStation. Zipped
> it's about 2K. It should expand to about 10K. When I run the command
> to get 10 lines put together it works correctly and gives me a file
> with 91 lines and about 100K in size. (I.e. - 10x on my disk.)
>
> awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv
>
> No mangling of the first line - that must have been something earlier
> I guess. Sorry for the confusion on that front.
>
> One other item has come up as I start to play with this farther down
> the tool chain. I want to use this data in either R or RapidMiner to
> data mine for patterns. Both of those tools are easier to use if the
> first line in the file has column titles. I had originally asked
> TradeStation not to output the column titles but if I do then for the
> first line of our new file I should actually copy the first line of
> the input file N times. Something like
>
> For i=1; read line, write N times, write \n
>
> and then
>
> for i>=2 do what we're doing right now.
That is actually accomplished just by adding a bit of code:
BEGIN {FS=OFS=","}
NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header
NR>=2{
r=$NF;NF--
for(i=1;i<n;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS
if(NR>=n+1)printf "%s%s",s[i],OFS
}
sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
sub("^([^,]*,){"dropcol"}","")
s[n]=$0
if(NR>=n+1)printf "%s,%s\n", s[n],r
}
Note that no column is dropped from the header. If you need to do that,
just tell us how you want to do that.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-23 22:18 ` Etaoin Shrdlu
@ 2009-02-24 2:26 ` Mark Knecht
2009-02-24 10:56 ` Etaoin Shrdlu
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-24 2:26 UTC (permalink / raw
To: gentoo-user
On Mon, Feb 23, 2009 at 2:18 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Monday 23 February 2009, 17:05, Mark Knecht wrote:
>
>> I'm attaching a small (100 line) data file out of TradeStation. Zipped
>> it's about 2K. It should expand to about 10K. When I run the command
>> to get 10 lines put together it works correctly and gives me a file
>> with 91 lines and about 100K in size. (I.e. - 10x on my disk.)
>>
>> awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv
>>
>> No mangling of the first line - that must have been something earlier
>> I guess. Sorry for the confusion on that front.
>>
>> One other item has come up as I start to play with this farther down
>> the tool chain. I want to use this data in either R or RapidMiner to
>> data mine for patterns. Both of those tools are easier to use if the
>> first line in the file has column titles. I had originally asked
>> TradeStation not to output the column titles but if I do then for the
>> first line of our new file I should actually copy the first line of
>> the input file N times. Something like
>>
>> For i=1; read line, write N times, write \n
>>
>> and then
>>
>> for i>=2 do what we're doing right now.
>
> That is actually accomplished just by adding a bit of code:
>
> BEGIN {FS=OFS=","}
>
> NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header
> NR>=2{
> r=$NF;NF--
> for(i=1;i<n;i++){
> s[i]=s[i+1]
> dt[i]=dt[i+1]
> if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS
> if(NR>=n+1)printf "%s%s",s[i],OFS
> }
> sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
> sub("^([^,]*,){"dropcol"}","")
> s[n]=$0
> if(NR>=n+1)printf "%s,%s\n", s[n],r
> }
>
> Note that no column is dropped from the header. If you need to do that,
> just tell us how you want to do that.
>
thanks. that's a good add.
If I drop columns - and I do need to - then something like how cut
works would be good, but it needs to repeat across all the rows being
used. For instance, if I'm dropping columns 6 & 12 from a 20 column
wide data set, then I'm dropping 6 & 12 from all N lines. This is
where using cut after the line is built is difficult as I'm forced to
figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
doing that. If I could say something like "Drop 6 & 12 from all rows,
and 1 & 2 from all rows higher than the first that make up this new
line" then that would be great. That's a lot to ask though.
D1,T1,A1,B1,C1,D1,
D2,T2,A2,B2,C2,D2,
D3,T3,A3,B3,C3,D3,
D4,T4,A4,B4,C4,D4,
D5,T5,A5,B5,C5,D5,
In the data above if I drop column A, then I drop it for all rows.
(For instance, A contains 0 and isn't necessary, etc.) Assuming 3
wide I'd get
D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5
Making that completely flexible - where I can drop 4 or 5 random
columns - is probably a bit too much work. On the other hand maybe
sending it to cut as part of the whole process, line by lone or
something, is more reasonable? I don't know.
I found a web site to study awk so I'm starting to see more or less
how your example works when I have the code in front of me. Creating
the code out of thin air might be a bit of a stretch for me at this
point though.
thanks for your help!
Cheers,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-24 2:26 ` Mark Knecht
@ 2009-02-24 10:56 ` Etaoin Shrdlu
2009-02-24 14:41 ` Mark Knecht
0 siblings, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-24 10:56 UTC (permalink / raw
To: gentoo-user
On Tuesday 24 February 2009, 03:26, Mark Knecht wrote:
> If I drop columns - and I do need to - then something like how cut
> works would be good, but it needs to repeat across all the rows being
> used. For instance, if I'm dropping columns 6 & 12 from a 20 column
> wide data set, then I'm dropping 6 & 12 from all N lines. This is
> where using cut after the line is built is difficult as I'm forced to
> figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
> doing that. If I could say something like "Drop 6 & 12 from all rows,
> and 1 & 2 from all rows higher than the first that make up this new
> line" then that would be great. That's a lot to ask though.
>
> D1,T1,A1,B1,C1,D1,
> D2,T2,A2,B2,C2,D2,
> D3,T3,A3,B3,C3,D3,
> D4,T4,A4,B4,C4,D4,
> D5,T5,A5,B5,C5,D5,
>
> In the data above if I drop column A, then I drop it for all rows.
> (For instance, A contains 0 and isn't necessary, etc.) Assuming 3
> wide I'd get
>
> D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
> D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
> D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5
>
> Making that completely flexible - where I can drop 4 or 5 random
> columns - is probably a bit too much work. On the other hand maybe
> sending it to cut as part of the whole process, line by lone or
> something, is more reasonable? I don't know.
The current "dropcol" variable drops fields from the beginning of line.
Doing that for arbitrary columns can be done, but requires an array
where to save the numbers of the columns to drop.
So, in my understanding this is what we want to accomplish so far:
given an input of the form
D1,T1,a1,b1,c1,d1,...,R1
D2,T2,a2,b2,c2,d2,...,R2
D3,T3,a3,b3,c3,d3,...,R3
D4,T4,a4,b4,c4,d4,...,R4
D5,T5,a5,b5,c5,d5,...,R5
(the ... mean that an arbitrary number of columns can follow)
You want to group lines by n at a time, keeping the D and T column from
the first line of each group, and keeping the R column from the last
line of the group, so for example with n=3 we would have:
D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
(and you're right, that produces an output that is roughly n times the
size of the original file)
Now, in addition to that, you also want to drop an arbitrary number of
columns in the a,b,c... group. So for example, you want to drop columns
2 and 3 (b and c in the example), so you'd end up with something like
D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
Please confirm that my understanding is correct, so I can come up with
some code to do that.
> I found a web site to study awk so I'm starting to see more or less
> how your example works when I have the code in front of me. Creating
> the code out of thin air might be a bit of a stretch for me at this
> point though.
I suggest you start from
http://www.gnu.org/software/gawk/manual/gawk.html
really complete, but gradual so you can have an easy start and move on to
the complexities later.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-24 10:56 ` Etaoin Shrdlu
@ 2009-02-24 14:41 ` Mark Knecht
2009-02-24 17:48 ` Etaoin Shrdlu
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-24 14:41 UTC (permalink / raw
To: gentoo-user
On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
<SNIP>
>
> So, in my understanding this is what we want to accomplish so far:
>
> given an input of the form
>
> D1,T1,a1,b1,c1,d1,...,R1
> D2,T2,a2,b2,c2,d2,...,R2
> D3,T3,a3,b3,c3,d3,...,R3
> D4,T4,a4,b4,c4,d4,...,R4
> D5,T5,a5,b5,c5,d5,...,R5
>
> (the ... mean that an arbitrary number of columns can follow)
>
> You want to group lines by n at a time, keeping the D and T column from
> the first line of each group, and keeping the R column from the last
> line of the group, so for example with n=3 we would have:
>
> D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
>
> (and you're right, that produces an output that is roughly n times the
> size of the original file)
>
> Now, in addition to that, you also want to drop an arbitrary number of
> columns in the a,b,c... group. So for example, you want to drop columns
> 2 and 3 (b and c in the example), so you'd end up with something like
>
> D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
>
> Please confirm that my understanding is correct, so I can come up with
> some code to do that.
Perfectly correct for all the data rows.
For the header I now see that we have a slightly harder job. What we'd
need to do is read the first line of the file, duplicate it N times,
and then drop the same columns as we drop in the rows. The problem is
that now I have the same header value for N columns which won't make
sense to the tool that uses this data. If we could read the header and
then automatically postpend the number N to each duplicated name. (or
some string like _N)
Maybe better would be a separate small program to do the header part
and then this program could read that header and make it the first
line of the output file. My worry is that when this data file becomes
very large - say 1GB or more of data - I probably cannot open the file
with vi to edit the header. Better if I could put the header in it's
own file. That file would be 1 line long. I could check it for the
name edits, make sure it's right, and then the program you are so
kindly building would just read it, cut out columns, and put it at the
start of the new large file.
Does that make sense?
>
>> I found a web site to study awk so I'm starting to see more or less
>> how your example works when I have the code in front of me. Creating
>> the code out of thin air might be a bit of a stretch for me at this
>> point though.
>
> I suggest you start from
>
> http://www.gnu.org/software/gawk/manual/gawk.html
>
> really complete, but gradual so you can have an easy start and move on to
> the complexities later.
>
>
Yes, very complete. A good reference. Thanks!
Cheers,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-24 14:41 ` Mark Knecht
@ 2009-02-24 17:48 ` Etaoin Shrdlu
2009-02-24 22:51 ` Mark Knecht
0 siblings, 1 reply; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-24 17:48 UTC (permalink / raw
To: gentoo-user
On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>
> > So, in my understanding this is what we want to accomplish so far:
> >
> > given an input of the form
> >
> > D1,T1,a1,b1,c1,d1,...,R1
> > D2,T2,a2,b2,c2,d2,...,R2
> > D3,T3,a3,b3,c3,d3,...,R3
> > D4,T4,a4,b4,c4,d4,...,R4
> > D5,T5,a5,b5,c5,d5,...,R5
> >
> > (the ... mean that an arbitrary number of columns can follow)
> >
> > You want to group lines by n at a time, keeping the D and T column
> > from the first line of each group, and keeping the R column from the
> > last line of the group, so for example with n=3 we would have:
> >
> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
> >
> > (and you're right, that produces an output that is roughly n times
> > the size of the original file)
> >
> > Now, in addition to that, you also want to drop an arbitrary number
> > of columns in the a,b,c... group. So for example, you want to drop
> > columns 2 and 3 (b and c in the example), so you'd end up with
> > something like
> >
> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
> >
> > Please confirm that my understanding is correct, so I can come up
> > with some code to do that.
>
> Perfectly correct for all the data rows.
>
> For the header I now see that we have a slightly harder job. What we'd
> need to do is read the first line of the file, duplicate it N times,
> and then drop the same columns as we drop in the rows. The problem is
> that now I have the same header value for N columns which won't make
> sense to the tool that uses this data. If we could read the header and
> then automatically postpend the number N to each duplicated name. (or
> some string like _N)
So in the last example the header would be something like
D,T,a,b,c,d,...,R
in the original file, and would become
D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R
and
D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R
respectively for the two sample outputs above.
> Maybe better would be a separate small program to do the header part
> and then this program could read that header and make it the first
> line of the output file. My worry is that when this data file becomes
> very large - say 1GB or more of data - I probably cannot open the file
> with vi to edit the header. Better if I could put the header in it's
> own file. That file would be 1 line long. I could check it for the
> name edits, make sure it's right, and then the program you are so
> kindly building would just read it, cut out columns, and put it at the
> start of the new large file.
>
> Does that make sense?
Maybe, but while we're at it my personal preference would be to just add
header handling to the existing program. Here's the revised code (also
cleaned up a bit and more structured):
# returns a ",a1,b1,c1,.." line
# drops unwanted columns
function do_line(num) {
line=""
for(j=3;j<NF;j++)
if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
return line
}
# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)
BEGIN {FS=OFS=","
t=split(dcols,tmp,/,/)
for(i=1;i<=t;i++)dc[tmp[i]]
}
# process the header
NR==1{
l=$1 OFS $2;
for(i=1;i<=n;i++)l=l do_line(i)
l=l OFS $NF
print l
}
NR>=2{
for(i=1;i<n;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
}
dt[n]=$1 OFS $2
s[n]=do_line(-1)
if(NR>=n+1){
l=dt[1]
for(i=1;i<=n;i++)l=l s[i]
l=l OFS $NF
print l
}
}'
I think two examples are enough here to demonstrate how it works (blank
lines between commands added for clarity - remember first line is header
line):
$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6
$ awk -v n=3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6
$ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6
The program still assumes that first two columns and last column must be
excluded from the general processing, and just prepended and appended to
the output lines. That can be changed though.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-24 17:48 ` Etaoin Shrdlu
@ 2009-02-24 22:51 ` Mark Knecht
2009-02-25 10:27 ` Etaoin Shrdlu
0 siblings, 1 reply; 19+ messages in thread
From: Mark Knecht @ 2009-02-24 22:51 UTC (permalink / raw
To: gentoo-user
On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote:
> On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
>> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
>> <shrdlu@unlimitedmail.org> wrote: <SNIP>
>>
>> > So, in my understanding this is what we want to accomplish so far:
>> >
>> > given an input of the form
>> >
>> > D1,T1,a1,b1,c1,d1,...,R1
>> > D2,T2,a2,b2,c2,d2,...,R2
>> > D3,T3,a3,b3,c3,d3,...,R3
>> > D4,T4,a4,b4,c4,d4,...,R4
>> > D5,T5,a5,b5,c5,d5,...,R5
>> >
>> > (the ... mean that an arbitrary number of columns can follow)
>> >
>> > You want to group lines by n at a time, keeping the D and T column
>> > from the first line of each group, and keeping the R column from the
>> > last line of the group, so for example with n=3 we would have:
>> >
>> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
>> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
>> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
>> >
>> > (and you're right, that produces an output that is roughly n times
>> > the size of the original file)
>> >
>> > Now, in addition to that, you also want to drop an arbitrary number
>> > of columns in the a,b,c... group. So for example, you want to drop
>> > columns 2 and 3 (b and c in the example), so you'd end up with
>> > something like
>> >
>> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
>> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
>> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
>> >
>> > Please confirm that my understanding is correct, so I can come up
>> > with some code to do that.
>>
>> Perfectly correct for all the data rows.
>>
>> For the header I now see that we have a slightly harder job. What we'd
>> need to do is read the first line of the file, duplicate it N times,
>> and then drop the same columns as we drop in the rows. The problem is
>> that now I have the same header value for N columns which won't make
>> sense to the tool that uses this data. If we could read the header and
>> then automatically postpend the number N to each duplicated name. (or
>> some string like _N)
>
> So in the last example the header would be something like
>
> D,T,a,b,c,d,...,R
>
> in the original file, and would become
>
> D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R
>
> and
>
> D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R
>
> respectively for the two sample outputs above.
>
>> Maybe better would be a separate small program to do the header part
>> and then this program could read that header and make it the first
>> line of the output file. My worry is that when this data file becomes
>> very large - say 1GB or more of data - I probably cannot open the file
>> with vi to edit the header. Better if I could put the header in it's
>> own file. That file would be 1 line long. I could check it for the
>> name edits, make sure it's right, and then the program you are so
>> kindly building would just read it, cut out columns, and put it at the
>> start of the new large file.
>>
>> Does that make sense?
>
> Maybe, but while we're at it my personal preference would be to just add
> header handling to the existing program. Here's the revised code (also
> cleaned up a bit and more structured):
>
> # returns a ",a1,b1,c1,.." line
> # drops unwanted columns
> function do_line(num) {
> line=""
> for(j=3;j<NF;j++)
> if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
> return line
> }
>
> # dcols is a string like '2,3,4,7' with a comma separated values list of
> # the columns to drop (first data column after date/time is column 1
> # here)
>
> BEGIN {FS=OFS=","
> t=split(dcols,tmp,/,/)
> for(i=1;i<=t;i++)dc[tmp[i]]
> }
>
> # process the header
> NR==1{
> l=$1 OFS $2;
> for(i=1;i<=n;i++)l=l do_line(i)
> l=l OFS $NF
> print l
> }
>
> NR>=2{
> for(i=1;i<n;i++){
> s[i]=s[i+1]
> dt[i]=dt[i+1]
> }
>
> dt[n]=$1 OFS $2
> s[n]=do_line(-1)
>
> if(NR>=n+1){
> l=dt[1]
> for(i=1;i<=n;i++)l=l s[i]
> l=l OFS $NF
> print l
> }
> }'
>
> I think two examples are enough here to demonstrate how it works (blank
> lines between commands added for clarity - remember first line is header
> line):
>
> $ cat file.csv
> D,T,a,b,c,d,R
> D1,T1,a1,b1,c1,d1,R1
> D2,T2,a2,b2,c2,d2,R2
> D3,T3,a3,b3,c3,d3,R3
> D4,T4,a4,b4,c4,d4,R4
> D5,T5,a5,b5,c5,d5,R5
> D6,T6,a6,b6,c6,d6,R6
>
> $ awk -v n=3 -f program.awk file.csv
> D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
> D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
> D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
> D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
> D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6
>
> $ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
> D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
> D1,T1,b1,d1,b2,d2,b3,d3,R3
> D2,T2,b2,d2,b3,d3,b4,d4,R4
> D3,T3,b3,d3,b4,d4,b5,d5,R5
> D4,T4,b4,d4,b5,d5,b6,d6,R6
>
> The program still assumes that first two columns and last column must be
> excluded from the general processing, and just prepended and appended to
> the output lines. That can be changed though.
>
>
Excellent! Worked identically for me once I removed an extra single
quote after the last brace that somehow crept into your copy/paste in
this email. Really good stuff.
I tried it on a 40MB data file from TradeStation with 6 years of 5
minute Dow Futures data. Going 10 wide and deleting 1 column it
generated a 292MB output file in about 1 minute on my machine. That's
more than acceptable performance. Data mining will likely take hours
if not days. Certainly more than a few minutes.
Looks like I'm running into one more problem and then I'm ready to
give it a try for real. Unfortunately one vendor platform is putting
quotes around the names in the header row so your _N increment looks
like "High"_4 instead of High_4 or "High_4". I'd like to fix that as
I'm pretty sure that the way we have it won't be acceptable, but I
don't know whether it would be best to have the quotes or not have the
quotes. My two target data mining platforms are R, which is in
portage, and RapidMiner which is available as Open Source from the
Rapid-i web site. I'll try it both ways with both header formats and
see what happens.
I had worried about checking the header on a really large file to see
if I had cut the correct columns but it turns out that
cat awkDataOut.csv | more
in a terminal writes the first few lines very quickly. From there I
can either just look at it or copy/paste into a new csv file, load it
into something like Open Office Calc and make sure I got the right
columns so I don't think there's any practical need to do anything
more with the header other than whatever turns out to be the right
answer with the quotes. My worry had been that when I request 5 data
columns it's not obvious what order they are provided so I'd have to
look at the file and figure out where everything was. Turns out it's
not such a big deal.
This has been extremely helpful. I really am embarrassed that you did
all the work. I own you at lest a beer or two (or whatever your
preferred drink might be) should we ever have a chance to exercise our
right arms together one day. I'll drop you a note back when I figure
out what makes sense about the header quotes.
Cheers,
Mark
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file
2009-02-24 22:51 ` Mark Knecht
@ 2009-02-25 10:27 ` Etaoin Shrdlu
0 siblings, 0 replies; 19+ messages in thread
From: Etaoin Shrdlu @ 2009-02-25 10:27 UTC (permalink / raw
To: gentoo-user
On Tuesday 24 February 2009, 23:51, Mark Knecht wrote:
> Looks like I'm running into one more problem and then I'm ready to
> give it a try for real. Unfortunately one vendor platform is putting
> quotes around the names in the header row so your _N increment looks
> like "High"_4 instead of High_4 or "High_4". I'd like to fix that as
> I'm pretty sure that the way we have it won't be acceptable, but I
> don't know whether it would be best to have the quotes or not have the
> quotes. My two target data mining platforms are R, which is in
> portage, and RapidMiner which is available as Open Source from the
> Rapid-i web site. I'll try it both ways with both header formats and
> see what happens.
Ok, in any case that is a minor fix and adjusting the program is no big
deal.
> I had worried about checking the header on a really large file to see
> if I had cut the correct columns but it turns out that
>
> cat awkDataOut.csv | more
>
> in a terminal writes the first few lines very quickly. From there I
> can either just look at it or copy/paste into a new csv file, load it
> into something like Open Office Calc and make sure I got the right
> columns so I don't think there's any practical need to do anything
> more with the header other than whatever turns out to be the right
> answer with the quotes. My worry had been that when I request 5 data
> columns it's not obvious what order they are provided so I'd have to
> look at the file and figure out where everything was. Turns out it's
> not such a big deal.
The currently implemented rule is as follows: if you request to not have,
say, columns 2 and 5 (out of, say, a total of 5 in the original file -
besides date/time and result), you get the columns in the same order
they were in the original file, minus the ones you don't want, so in
this example you will get columns 1, 3 and 4 in that order.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2009-02-25 10:29 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht
2009-02-22 20:15 ` Etaoin Shrdlu
2009-02-22 22:28 ` Mark Knecht
2009-02-22 22:57 ` Etaoin Shrdlu
2009-02-22 23:31 ` Mark Knecht
2009-02-23 6:17 ` Paul Hartman
2009-02-23 9:57 ` Etaoin Shrdlu
2009-02-23 16:05 ` Mark Knecht
2009-02-23 22:18 ` Etaoin Shrdlu
2009-02-24 2:26 ` Mark Knecht
2009-02-24 10:56 ` Etaoin Shrdlu
2009-02-24 14:41 ` Mark Knecht
2009-02-24 17:48 ` Etaoin Shrdlu
2009-02-24 22:51 ` Mark Knecht
2009-02-25 10:27 ` Etaoin Shrdlu
2009-02-22 20:59 ` Willie Wong
2009-02-22 23:15 ` Mark Knecht
2009-02-23 0:57 ` Willie Wong
2009-02-23 1:54 ` Mark Knecht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox