* [gentoo-user] [OT] - command line read *.csv & create new file @ 2009-02-22 19:06 Mark Knecht 2009-02-22 20:15 ` Etaoin Shrdlu 2009-02-22 20:59 ` Willie Wong 0 siblings, 2 replies; 19+ messages in thread From: Mark Knecht @ 2009-02-22 19:06 UTC (permalink / raw To: gentoo-user Hi, Very off topic other than I'd do this on my Gentoo box prior to using R on my Gentoo box. Please ignore if not of interest. I've got a really big data file in essentially a *.csv format. (comma delimited) I need to scan this file and create a new output file. I'm wondering if there is a reasonably easy command line way of doing this using something like sed or awk which I know nothing about. Thanks in advance. The basic idea goes something like this: 1) The input file might look this the following where some of it is attributes (shown as letters) and other parts are results. (shown as numbers) A,B,C,D,1 E,F,G,H,2 I,J,K,L,3 M,N,O,P,4 Q,R,S,T,5 U,V,W,X,6 2) From the above data input file I want to take the attributes from a few preceeding lines (say 3 in this example) and write them to the output file along with the result on the last of the 3 lines. The output file might look like this: A,B,C,D,E,F,G,H,I,J,K,L,3 E,F,G,H,I,J,K,L,M,N,O,P,4 I,J,K,L,M,N,O,P,Q,R,S,T,5 M,N,O,P,Q,R,S,T,U,V,W,X,6 3) This must be done as a read/process/write operation of some sort because the input file may be far larger than system memory. (Currently it isn't, but it likely will eventually be.) 4) In my example above I suggested that there is a single result but their may be more than one. (Don't know yet.) I showed 3 lines but might be doing 10. I don't know. It's important to me to pick a moderately flexible way of dealing with this as the order of columns and number of results will likely change over time and I'll certainly need to adjust. Thanks in advance for any pointers. Happy to buy a good book if someone knows what I should look for. Cheers, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht @ 2009-02-22 20:15 ` Etaoin Shrdlu 2009-02-22 22:28 ` Mark Knecht 2009-02-22 20:59 ` Willie Wong 1 sibling, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-22 20:15 UTC (permalink / raw To: gentoo-user On Sunday 22 February 2009, 20:06, Mark Knecht wrote: > Hi, > Very off topic other than I'd do this on my Gentoo box prior to > using R on my Gentoo box. Please ignore if not of interest. > > I've got a really big data file in essentially a *.csv format. > (comma delimited) I need to scan this file and create a new output > file. I'm wondering if there is a reasonably easy command line way of > doing this using something like sed or awk which I know nothing about. > Thanks in advance. > > The basic idea goes something like this: > > 1) The input file might look this the following where some of it is > attributes (shown as letters) and other parts are results. (shown as > numbers) > > A,B,C,D,1 > E,F,G,H,2 > I,J,K,L,3 > M,N,O,P,4 > Q,R,S,T,5 > U,V,W,X,6 Are the results always in the last field, and only a single field? Is the total number of fields per line always fixed? > 2) From the above data input file I want to take the attributes from a > few preceeding lines (say 3 in this example) and write them to the > output file along with the result on the last of the 3 lines. The > output file might look like this: > > A,B,C,D,E,F,G,H,I,J,K,L,3 > E,F,G,H,I,J,K,L,M,N,O,P,4 > I,J,K,L,M,N,O,P,Q,R,S,T,5 > M,N,O,P,Q,R,S,T,U,V,W,X,6 Is the number of lines you pick for the operation always 3 or can it vary? And, once you choose a number n of lines, should the whole file be processed concatenating n lines at a time, and the resulting single line be ended with the result of the nth line? in other words, does the following hold for the output format: <concatenation of attributes of lines 1..n> <result of line n> <concatenation of attributes of lines 2..n+1> <result of line n+1> <concatenation of attributes of lines 3..n+2> <result of line n+1> <concatenation of attributes of lines 4..n+3> <result of line n+1> ... With answers to the above questions, it's probably possible to hack together a solution. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 20:15 ` Etaoin Shrdlu @ 2009-02-22 22:28 ` Mark Knecht 2009-02-22 22:57 ` Etaoin Shrdlu 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-22 22:28 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 12:15 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: > On Sunday 22 February 2009, 20:06, Mark Knecht wrote: >> Hi, >> Very off topic other than I'd do this on my Gentoo box prior to >> using R on my Gentoo box. Please ignore if not of interest. >> >> I've got a really big data file in essentially a *.csv format. >> (comma delimited) I need to scan this file and create a new output >> file. I'm wondering if there is a reasonably easy command line way of >> doing this using something like sed or awk which I know nothing about. >> Thanks in advance. >> >> The basic idea goes something like this: >> >> 1) The input file might look this the following where some of it is >> attributes (shown as letters) and other parts are results. (shown as >> numbers) >> >> A,B,C,D,1 >> E,F,G,H,2 >> I,J,K,L,3 >> M,N,O,P,4 >> Q,R,S,T,5 >> U,V,W,X,6 > > Are the results always in the last field, and only a single field? > Is the total number of fields per line always fixed? I don't know that for certain yet but I think the results will not always be in the last field. The total number of fields per line is always fixed in a given file but might change from file to file. If it does I'm willing to do minor edits (heck - I'll do major edits if I have to!!) to get it working. > >> 2) From the above data input file I want to take the attributes from a >> few preceeding lines (say 3 in this example) and write them to the >> output file along with the result on the last of the 3 lines. The >> output file might look like this: >> >> A,B,C,D,E,F,G,H,I,J,K,L,3 >> E,F,G,H,I,J,K,L,M,N,O,P,4 >> I,J,K,L,M,N,O,P,Q,R,S,T,5 >> M,N,O,P,Q,R,S,T,U,V,W,X,6 > > Is the number of lines you pick for the operation always 3 or can it > vary? And, once you choose a number n of lines, should the whole file be > processed concatenating n lines at a time, and the resulting single line > be ended with the result of the nth line? in other words, does the > following hold for the output format: > > <concatenation of attributes of lines 1..n> <result of line n> > <concatenation of attributes of lines 2..n+1> <result of line n+1> > <concatenation of attributes of lines 3..n+2> <result of line n+1> > <concatenation of attributes of lines 4..n+3> <result of line n+1> The above diagram is correct when the lines chosen is 3. I suspect that I might chose 10 or 15 lines once I get real data and do some testing but that was harder to show in this email. A good design for me would be a single variable I could set. Once a value is chosen I want to process every line in the input file the same way. I don't use 5 lines sometimes and 10 lines other times. In a given file it's always the same number of lines. > ... > > With answers to the above questions, it's probably possible to hack > together a solution. Thanks! - Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 22:28 ` Mark Knecht @ 2009-02-22 22:57 ` Etaoin Shrdlu 2009-02-22 23:31 ` Mark Knecht 0 siblings, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-22 22:57 UTC (permalink / raw To: gentoo-user On Sunday 22 February 2009, 23:28, Mark Knecht wrote: > > <concatenation of attributes of lines 1..n> <result of line n> > > <concatenation of attributes of lines 2..n+1> <result of line n+1> > > <concatenation of attributes of lines 3..n+2> <result of line n+1> > > <concatenation of attributes of lines 4..n+3> <result of line n+1> > > The above diagram is correct when the lines chosen is 3. I suspect > that I might chose 10 or 15 lines once I get real data and do some > testing but that was harder to show in this email. A good design for > me would be a single variable I could set. Once a value is chosen I > want to process every line in the input file the same way. I don't use > 5 lines sometimes and 10 lines other times. In a given file it's > always the same number of lines. Ok, try this for a start: BEGIN { FS=OFS=","} { r=$NF;NF-- for(i=1;i<n;i++){ s[i]=s[i+1] if(NR>=n)printf "%s%s",s[i],OFS } s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r } Save the above code in a file (eg, program.awk) and run it with awk -v n=3 -f program.awk datafile.csv where the "n=3" part is to be replaced with the actual number of lines you want to group (eg, n=5, n=4, etc.) With your sample input and n=3, the above awk program produces the output you show. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 22:57 ` Etaoin Shrdlu @ 2009-02-22 23:31 ` Mark Knecht 2009-02-23 6:17 ` Paul Hartman 2009-02-23 9:57 ` Etaoin Shrdlu 0 siblings, 2 replies; 19+ messages in thread From: Mark Knecht @ 2009-02-22 23:31 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 2:57 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: > On Sunday 22 February 2009, 23:28, Mark Knecht wrote: > >> > <concatenation of attributes of lines 1..n> <result of line n> >> > <concatenation of attributes of lines 2..n+1> <result of line n+1> >> > <concatenation of attributes of lines 3..n+2> <result of line n+1> >> > <concatenation of attributes of lines 4..n+3> <result of line n+1> >> >> The above diagram is correct when the lines chosen is 3. I suspect >> that I might chose 10 or 15 lines once I get real data and do some >> testing but that was harder to show in this email. A good design for >> me would be a single variable I could set. Once a value is chosen I >> want to process every line in the input file the same way. I don't use >> 5 lines sometimes and 10 lines other times. In a given file it's >> always the same number of lines. > > Ok, try this for a start: > > BEGIN { FS=OFS=","} > > { > r=$NF;NF-- > for(i=1;i<n;i++){ > s[i]=s[i+1] > if(NR>=n)printf "%s%s",s[i],OFS > } > s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r > } > > Save the above code in a file (eg, program.awk) and run it with > > awk -v n=3 -f program.awk datafile.csv > > where the "n=3" part is to be replaced with the actual number of lines > you want to group (eg, n=5, n=4, etc.) > > With your sample input and n=3, the above awk program produces the output > you show. > > Yeah, that's probably almost usable as it is . I tried it with n=3 and n=10. Worked both times just fine. The initial issue might be (as with Willie's sed code) that the first line wasn't quite right and required some hand editing. I'd prefer not to have to hand edit anything as the files are large and that step will be slow. I can work on that. As per the message to Willie it would be nice to be able to drop columns out but technically I suppose it's not really required. All of this is going into another program which must at some level understand what the columns are. If I have extra dates and don't use them that's probably workable. The down side is the output file is 10x larger than the input file - roughly - and my current input files are 40-60MB so the output files will be 600MB. Not huge but if they grew too much more I might get beyond what a single file can be on ext3, right? Isn't that 2GB or so? Thanks very much, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 23:31 ` Mark Knecht @ 2009-02-23 6:17 ` Paul Hartman 2009-02-23 9:57 ` Etaoin Shrdlu 1 sibling, 0 replies; 19+ messages in thread From: Paul Hartman @ 2009-02-23 6:17 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 5:31 PM, Mark Knecht <markknecht@gmail.com> wrote: > The down side is the output file is 10x larger than the input file - > roughly - and my current input files are 40-60MB so the output files > will be 600MB. Not huge but if they grew too much more I might get > beyond what a single file can be on ext3, right? Isn't that 2GB or so? The maximum file size for ext3 depends on the block size you're using, but I believe with the default settings it is 2TB max, not 2GB. I have personally had files as large as 26GB on ext3 without issues. You could always pipe the whole operation it through gzip/bzip2/lzma or similar if you want the file on disk to be much smaller. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 23:31 ` Mark Knecht 2009-02-23 6:17 ` Paul Hartman @ 2009-02-23 9:57 ` Etaoin Shrdlu 2009-02-23 16:05 ` Mark Knecht 1 sibling, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-23 9:57 UTC (permalink / raw To: gentoo-user On Monday 23 February 2009, 00:31, Mark Knecht wrote: > Yeah, that's probably almost usable as it is . I tried it with n=3 and > n=10. Worked both times just fine. The initial issue might be (as with > Willie's sed code) that the first line wasn't quite right and required > some hand editing. I'd prefer not to have to hand edit anything as the > files are large and that step will be slow. I can work on that. But then could you paste an example of such line, so we can see it? The first line was not special in the sample you posted... > As per the message to Willie it would be nice to be able to drop > columns out but technically I suppose it's not really required. All of > this is going into another program which must at some level understand > what the columns are. If I have extra dates and don't use them that's > probably workable. Anyway, it's not difficult to add that feature: BEGIN { FS=OFS=","} { r=$NF;NF-- for(i=1;i<n;i++){ s[i]=s[i+1] dt[i]=dt[i+1] if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS if(NR>=n)printf "%s%s",s[i],OFS } sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} sub("^([^,]*,){"dropcol"}","") s[n]=$0 if(NR>=n)printf "%s,%s\n", s[n],r } There is a new variable "dropcol" which contains the number of columns to drop. Also, for the above to work, you must add the --re-interval command line switch to awk, eg awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv > The down side is the output file is 10x larger than the input file - > roughly - and my current input files are 40-60MB so the output files > will be 600MB. Not huge but if they grew too much more I might get > beyond what a single file can be on ext3, right? Isn't that 2GB or so? That is strange, the output file could be bigger but not by that factor...if you don't mind, again could you paste a sample input file (maybe just some lines, to get an idea...)? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-23 9:57 ` Etaoin Shrdlu @ 2009-02-23 16:05 ` Mark Knecht 2009-02-23 22:18 ` Etaoin Shrdlu 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-23 16:05 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 3461 bytes --] On Mon, Feb 23, 2009 at 1:57 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: > On Monday 23 February 2009, 00:31, Mark Knecht wrote: > >> Yeah, that's probably almost usable as it is . I tried it with n=3 and >> n=10. Worked both times just fine. The initial issue might be (as with >> Willie's sed code) that the first line wasn't quite right and required >> some hand editing. I'd prefer not to have to hand edit anything as the >> files are large and that step will be slow. I can work on that. > > But then could you paste an example of such line, so we can see it? The > first line was not special in the sample you posted... > >> As per the message to Willie it would be nice to be able to drop >> columns out but technically I suppose it's not really required. All of >> this is going into another program which must at some level understand >> what the columns are. If I have extra dates and don't use them that's >> probably workable. > > Anyway, it's not difficult to add that feature: > > BEGIN { FS=OFS=","} > { > r=$NF;NF-- > for(i=1;i<n;i++){ > s[i]=s[i+1] > dt[i]=dt[i+1] > if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS > if(NR>=n)printf "%s%s",s[i],OFS > } > sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} > sub("^([^,]*,){"dropcol"}","") > s[n]=$0 > if(NR>=n)printf "%s,%s\n", s[n],r > } > > There is a new variable "dropcol" which contains the number of columns to > drop. Also, for the above to work, you must add the --re-interval > command line switch to awk, eg > > awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv Thanks. I'll give that a try later today. I also like Willie's idea about using cut. That seems pretty flexible without any programming. > >> The down side is the output file is 10x larger than the input file - >> roughly - and my current input files are 40-60MB so the output files >> will be 600MB. Not huge but if they grew too much more I might get >> beyond what a single file can be on ext3, right? Isn't that 2GB or so? > > That is strange, the output file could be bigger but not by that > factor...if you don't mind, again could you paste a sample input file > (maybe just some lines, to get an idea...)? > > I'm attaching a small (100 line) data file out of TradeStation. Zipped it's about 2K. It should expand to about 10K. When I run the command to get 10 lines put together it works correctly and gives me a file with 91 lines and about 100K in size. (I.e. - 10x on my disk.) awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv No mangling of the first line - that must have been something earlier I guess. Sorry for the confusion on that front. One other item has come up as I start to play with this farther down the tool chain. I want to use this data in either R or RapidMiner to data mine for patterns. Both of those tools are easier to use if the first line in the file has column titles. I had originally asked TradeStation not to output the column titles but if I do then for the first line of our new file I should actually copy the first line of the input file N times. Something like For i=1; read line, write N times, write \n and then for i>=2 do what we're doing right now. After I did that I could run it through cut and drop whatever columns I need to drop, I think... ;-) This is great help from you all. As someone who doesn't really program or use the command line too much it's a big advantage. Thanks! Cheers, Mark [-- Attachment #2: awkDataIn.csv.bz2 --] [-- Type: application/x-bzip2, Size: 1929 bytes --] [-- Attachment #3: awkScript1.awk --] [-- Type: application/octet-stream, Size: 153 bytes --] BEGIN { FS=OFS=","} { r=$NF;NF-- for(i=1;i<n;i++){ s[i]=s[i+1] if(NR>=n)printf "%s%s",s[i],OFS } s[n]=$0;if(NR>=n)printf "%s,%s\n", s[n],r } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-23 16:05 ` Mark Knecht @ 2009-02-23 22:18 ` Etaoin Shrdlu 2009-02-24 2:26 ` Mark Knecht 0 siblings, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-23 22:18 UTC (permalink / raw To: gentoo-user On Monday 23 February 2009, 17:05, Mark Knecht wrote: > I'm attaching a small (100 line) data file out of TradeStation. Zipped > it's about 2K. It should expand to about 10K. When I run the command > to get 10 lines put together it works correctly and gives me a file > with 91 lines and about 100K in size. (I.e. - 10x on my disk.) > > awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv > > No mangling of the first line - that must have been something earlier > I guess. Sorry for the confusion on that front. > > One other item has come up as I start to play with this farther down > the tool chain. I want to use this data in either R or RapidMiner to > data mine for patterns. Both of those tools are easier to use if the > first line in the file has column titles. I had originally asked > TradeStation not to output the column titles but if I do then for the > first line of our new file I should actually copy the first line of > the input file N times. Something like > > For i=1; read line, write N times, write \n > > and then > > for i>=2 do what we're doing right now. That is actually accomplished just by adding a bit of code: BEGIN {FS=OFS=","} NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header NR>=2{ r=$NF;NF-- for(i=1;i<n;i++){ s[i]=s[i+1] dt[i]=dt[i+1] if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS if(NR>=n+1)printf "%s%s",s[i],OFS } sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} sub("^([^,]*,){"dropcol"}","") s[n]=$0 if(NR>=n+1)printf "%s,%s\n", s[n],r } Note that no column is dropped from the header. If you need to do that, just tell us how you want to do that. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-23 22:18 ` Etaoin Shrdlu @ 2009-02-24 2:26 ` Mark Knecht 2009-02-24 10:56 ` Etaoin Shrdlu 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-24 2:26 UTC (permalink / raw To: gentoo-user On Mon, Feb 23, 2009 at 2:18 PM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: > On Monday 23 February 2009, 17:05, Mark Knecht wrote: > >> I'm attaching a small (100 line) data file out of TradeStation. Zipped >> it's about 2K. It should expand to about 10K. When I run the command >> to get 10 lines put together it works correctly and gives me a file >> with 91 lines and about 100K in size. (I.e. - 10x on my disk.) >> >> awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv >> >> No mangling of the first line - that must have been something earlier >> I guess. Sorry for the confusion on that front. >> >> One other item has come up as I start to play with this farther down >> the tool chain. I want to use this data in either R or RapidMiner to >> data mine for patterns. Both of those tools are easier to use if the >> first line in the file has column titles. I had originally asked >> TradeStation not to output the column titles but if I do then for the >> first line of our new file I should actually copy the first line of >> the input file N times. Something like >> >> For i=1; read line, write N times, write \n >> >> and then >> >> for i>=2 do what we're doing right now. > > That is actually accomplished just by adding a bit of code: > > BEGIN {FS=OFS=","} > > NR==1{for(i=1;i<=n;i++){printf "%s%s", sep, $0;sep=OFS};print""} # header > NR>=2{ > r=$NF;NF-- > for(i=1;i<n;i++){ > s[i]=s[i+1] > dt[i]=dt[i+1] > if((NR>=n+1)&&(i==1))printf "%s%s",dt[1],OFS > if(NR>=n+1)printf "%s%s",s[i],OFS > } > sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} > sub("^([^,]*,){"dropcol"}","") > s[n]=$0 > if(NR>=n+1)printf "%s,%s\n", s[n],r > } > > Note that no column is dropped from the header. If you need to do that, > just tell us how you want to do that. > thanks. that's a good add. If I drop columns - and I do need to - then something like how cut works would be good, but it needs to repeat across all the rows being used. For instance, if I'm dropping columns 6 & 12 from a 20 column wide data set, then I'm dropping 6 & 12 from all N lines. This is where using cut after the line is built is difficult as I'm forced to figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake doing that. If I could say something like "Drop 6 & 12 from all rows, and 1 & 2 from all rows higher than the first that make up this new line" then that would be great. That's a lot to ask though. D1,T1,A1,B1,C1,D1, D2,T2,A2,B2,C2,D2, D3,T3,A3,B3,C3,D3, D4,T4,A4,B4,C4,D4, D5,T5,A5,B5,C5,D5, In the data above if I drop column A, then I drop it for all rows. (For instance, A contains 0 and isn't necessary, etc.) Assuming 3 wide I'd get D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3 D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4 D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5 Making that completely flexible - where I can drop 4 or 5 random columns - is probably a bit too much work. On the other hand maybe sending it to cut as part of the whole process, line by lone or something, is more reasonable? I don't know. I found a web site to study awk so I'm starting to see more or less how your example works when I have the code in front of me. Creating the code out of thin air might be a bit of a stretch for me at this point though. thanks for your help! Cheers, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-24 2:26 ` Mark Knecht @ 2009-02-24 10:56 ` Etaoin Shrdlu 2009-02-24 14:41 ` Mark Knecht 0 siblings, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-24 10:56 UTC (permalink / raw To: gentoo-user On Tuesday 24 February 2009, 03:26, Mark Knecht wrote: > If I drop columns - and I do need to - then something like how cut > works would be good, but it needs to repeat across all the rows being > used. For instance, if I'm dropping columns 6 & 12 from a 20 column > wide data set, then I'm dropping 6 & 12 from all N lines. This is > where using cut after the line is built is difficult as I'm forced to > figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake > doing that. If I could say something like "Drop 6 & 12 from all rows, > and 1 & 2 from all rows higher than the first that make up this new > line" then that would be great. That's a lot to ask though. > > D1,T1,A1,B1,C1,D1, > D2,T2,A2,B2,C2,D2, > D3,T3,A3,B3,C3,D3, > D4,T4,A4,B4,C4,D4, > D5,T5,A5,B5,C5,D5, > > In the data above if I drop column A, then I drop it for all rows. > (For instance, A contains 0 and isn't necessary, etc.) Assuming 3 > wide I'd get > > D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3 > D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4 > D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5 > > Making that completely flexible - where I can drop 4 or 5 random > columns - is probably a bit too much work. On the other hand maybe > sending it to cut as part of the whole process, line by lone or > something, is more reasonable? I don't know. The current "dropcol" variable drops fields from the beginning of line. Doing that for arbitrary columns can be done, but requires an array where to save the numbers of the columns to drop. So, in my understanding this is what we want to accomplish so far: given an input of the form D1,T1,a1,b1,c1,d1,...,R1 D2,T2,a2,b2,c2,d2,...,R2 D3,T3,a3,b3,c3,d3,...,R3 D4,T4,a4,b4,c4,d4,...,R4 D5,T5,a5,b5,c5,d5,...,R5 (the ... mean that an arbitrary number of columns can follow) You want to group lines by n at a time, keeping the D and T column from the first line of each group, and keeping the R column from the last line of the group, so for example with n=3 we would have: D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 (and you're right, that produces an output that is roughly n times the size of the original file) Now, in addition to that, you also want to drop an arbitrary number of columns in the a,b,c... group. So for example, you want to drop columns 2 and 3 (b and c in the example), so you'd end up with something like D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 Please confirm that my understanding is correct, so I can come up with some code to do that. > I found a web site to study awk so I'm starting to see more or less > how your example works when I have the code in front of me. Creating > the code out of thin air might be a bit of a stretch for me at this > point though. I suggest you start from http://www.gnu.org/software/gawk/manual/gawk.html really complete, but gradual so you can have an easy start and move on to the complexities later. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-24 10:56 ` Etaoin Shrdlu @ 2009-02-24 14:41 ` Mark Knecht 2009-02-24 17:48 ` Etaoin Shrdlu 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-24 14:41 UTC (permalink / raw To: gentoo-user On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: <SNIP> > > So, in my understanding this is what we want to accomplish so far: > > given an input of the form > > D1,T1,a1,b1,c1,d1,...,R1 > D2,T2,a2,b2,c2,d2,...,R2 > D3,T3,a3,b3,c3,d3,...,R3 > D4,T4,a4,b4,c4,d4,...,R4 > D5,T5,a5,b5,c5,d5,...,R5 > > (the ... mean that an arbitrary number of columns can follow) > > You want to group lines by n at a time, keeping the D and T column from > the first line of each group, and keeping the R column from the last > line of the group, so for example with n=3 we would have: > > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 > > (and you're right, that produces an output that is roughly n times the > size of the original file) > > Now, in addition to that, you also want to drop an arbitrary number of > columns in the a,b,c... group. So for example, you want to drop columns > 2 and 3 (b and c in the example), so you'd end up with something like > > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 > > Please confirm that my understanding is correct, so I can come up with > some code to do that. Perfectly correct for all the data rows. For the header I now see that we have a slightly harder job. What we'd need to do is read the first line of the file, duplicate it N times, and then drop the same columns as we drop in the rows. The problem is that now I have the same header value for N columns which won't make sense to the tool that uses this data. If we could read the header and then automatically postpend the number N to each duplicated name. (or some string like _N) Maybe better would be a separate small program to do the header part and then this program could read that header and make it the first line of the output file. My worry is that when this data file becomes very large - say 1GB or more of data - I probably cannot open the file with vi to edit the header. Better if I could put the header in it's own file. That file would be 1 line long. I could check it for the name edits, make sure it's right, and then the program you are so kindly building would just read it, cut out columns, and put it at the start of the new large file. Does that make sense? > >> I found a web site to study awk so I'm starting to see more or less >> how your example works when I have the code in front of me. Creating >> the code out of thin air might be a bit of a stretch for me at this >> point though. > > I suggest you start from > > http://www.gnu.org/software/gawk/manual/gawk.html > > really complete, but gradual so you can have an easy start and move on to > the complexities later. > > Yes, very complete. A good reference. Thanks! Cheers, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-24 14:41 ` Mark Knecht @ 2009-02-24 17:48 ` Etaoin Shrdlu 2009-02-24 22:51 ` Mark Knecht 0 siblings, 1 reply; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-24 17:48 UTC (permalink / raw To: gentoo-user On Tuesday 24 February 2009, 15:41, Mark Knecht wrote: > On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu > <shrdlu@unlimitedmail.org> wrote: <SNIP> > > > So, in my understanding this is what we want to accomplish so far: > > > > given an input of the form > > > > D1,T1,a1,b1,c1,d1,...,R1 > > D2,T2,a2,b2,c2,d2,...,R2 > > D3,T3,a3,b3,c3,d3,...,R3 > > D4,T4,a4,b4,c4,d4,...,R4 > > D5,T5,a5,b5,c5,d5,...,R5 > > > > (the ... mean that an arbitrary number of columns can follow) > > > > You want to group lines by n at a time, keeping the D and T column > > from the first line of each group, and keeping the R column from the > > last line of the group, so for example with n=3 we would have: > > > > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 > > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 > > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 > > > > (and you're right, that produces an output that is roughly n times > > the size of the original file) > > > > Now, in addition to that, you also want to drop an arbitrary number > > of columns in the a,b,c... group. So for example, you want to drop > > columns 2 and 3 (b and c in the example), so you'd end up with > > something like > > > > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 > > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 > > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 > > > > Please confirm that my understanding is correct, so I can come up > > with some code to do that. > > Perfectly correct for all the data rows. > > For the header I now see that we have a slightly harder job. What we'd > need to do is read the first line of the file, duplicate it N times, > and then drop the same columns as we drop in the rows. The problem is > that now I have the same header value for N columns which won't make > sense to the tool that uses this data. If we could read the header and > then automatically postpend the number N to each duplicated name. (or > some string like _N) So in the last example the header would be something like D,T,a,b,c,d,...,R in the original file, and would become D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R and D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R respectively for the two sample outputs above. > Maybe better would be a separate small program to do the header part > and then this program could read that header and make it the first > line of the output file. My worry is that when this data file becomes > very large - say 1GB or more of data - I probably cannot open the file > with vi to edit the header. Better if I could put the header in it's > own file. That file would be 1 line long. I could check it for the > name edits, make sure it's right, and then the program you are so > kindly building would just read it, cut out columns, and put it at the > start of the new large file. > > Does that make sense? Maybe, but while we're at it my personal preference would be to just add header handling to the existing program. Here's the revised code (also cleaned up a bit and more structured): # returns a ",a1,b1,c1,.." line # drops unwanted columns function do_line(num) { line="" for(j=3;j<NF;j++) if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")} return line } # dcols is a string like '2,3,4,7' with a comma separated values list of # the columns to drop (first data column after date/time is column 1 # here) BEGIN {FS=OFS="," t=split(dcols,tmp,/,/) for(i=1;i<=t;i++)dc[tmp[i]] } # process the header NR==1{ l=$1 OFS $2; for(i=1;i<=n;i++)l=l do_line(i) l=l OFS $NF print l } NR>=2{ for(i=1;i<n;i++){ s[i]=s[i+1] dt[i]=dt[i+1] } dt[n]=$1 OFS $2 s[n]=do_line(-1) if(NR>=n+1){ l=dt[1] for(i=1;i<=n;i++)l=l s[i] l=l OFS $NF print l } }' I think two examples are enough here to demonstrate how it works (blank lines between commands added for clarity - remember first line is header line): $ cat file.csv D,T,a,b,c,d,R D1,T1,a1,b1,c1,d1,R1 D2,T2,a2,b2,c2,d2,R2 D3,T3,a3,b3,c3,d3,R3 D4,T4,a4,b4,c4,d4,R4 D5,T5,a5,b5,c5,d5,R5 D6,T6,a6,b6,c6,d6,R6 $ awk -v n=3 -f program.awk file.csv D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3 D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4 D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5 D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6 $ awk -v n=3 -v dcols='1,3' -f program.awk file.csv D,T,b_1,d_1,b_2,d_2,b_3,d_3,R D1,T1,b1,d1,b2,d2,b3,d3,R3 D2,T2,b2,d2,b3,d3,b4,d4,R4 D3,T3,b3,d3,b4,d4,b5,d5,R5 D4,T4,b4,d4,b5,d5,b6,d6,R6 The program still assumes that first two columns and last column must be excluded from the general processing, and just prepended and appended to the output lines. That can be changed though. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-24 17:48 ` Etaoin Shrdlu @ 2009-02-24 22:51 ` Mark Knecht 2009-02-25 10:27 ` Etaoin Shrdlu 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-24 22:51 UTC (permalink / raw To: gentoo-user On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu <shrdlu@unlimitedmail.org> wrote: > On Tuesday 24 February 2009, 15:41, Mark Knecht wrote: >> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu >> <shrdlu@unlimitedmail.org> wrote: <SNIP> >> >> > So, in my understanding this is what we want to accomplish so far: >> > >> > given an input of the form >> > >> > D1,T1,a1,b1,c1,d1,...,R1 >> > D2,T2,a2,b2,c2,d2,...,R2 >> > D3,T3,a3,b3,c3,d3,...,R3 >> > D4,T4,a4,b4,c4,d4,...,R4 >> > D5,T5,a5,b5,c5,d5,...,R5 >> > >> > (the ... mean that an arbitrary number of columns can follow) >> > >> > You want to group lines by n at a time, keeping the D and T column >> > from the first line of each group, and keeping the R column from the >> > last line of the group, so for example with n=3 we would have: >> > >> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3 >> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4 >> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5 >> > >> > (and you're right, that produces an output that is roughly n times >> > the size of the original file) >> > >> > Now, in addition to that, you also want to drop an arbitrary number >> > of columns in the a,b,c... group. So for example, you want to drop >> > columns 2 and 3 (b and c in the example), so you'd end up with >> > something like >> > >> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3 >> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4 >> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5 >> > >> > Please confirm that my understanding is correct, so I can come up >> > with some code to do that. >> >> Perfectly correct for all the data rows. >> >> For the header I now see that we have a slightly harder job. What we'd >> need to do is read the first line of the file, duplicate it N times, >> and then drop the same columns as we drop in the rows. The problem is >> that now I have the same header value for N columns which won't make >> sense to the tool that uses this data. If we could read the header and >> then automatically postpend the number N to each duplicated name. (or >> some string like _N) > > So in the last example the header would be something like > > D,T,a,b,c,d,...,R > > in the original file, and would become > > D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R > > and > > D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R > > respectively for the two sample outputs above. > >> Maybe better would be a separate small program to do the header part >> and then this program could read that header and make it the first >> line of the output file. My worry is that when this data file becomes >> very large - say 1GB or more of data - I probably cannot open the file >> with vi to edit the header. Better if I could put the header in it's >> own file. That file would be 1 line long. I could check it for the >> name edits, make sure it's right, and then the program you are so >> kindly building would just read it, cut out columns, and put it at the >> start of the new large file. >> >> Does that make sense? > > Maybe, but while we're at it my personal preference would be to just add > header handling to the existing program. Here's the revised code (also > cleaned up a bit and more structured): > > # returns a ",a1,b1,c1,.." line > # drops unwanted columns > function do_line(num) { > line="" > for(j=3;j<NF;j++) > if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")} > return line > } > > # dcols is a string like '2,3,4,7' with a comma separated values list of > # the columns to drop (first data column after date/time is column 1 > # here) > > BEGIN {FS=OFS="," > t=split(dcols,tmp,/,/) > for(i=1;i<=t;i++)dc[tmp[i]] > } > > # process the header > NR==1{ > l=$1 OFS $2; > for(i=1;i<=n;i++)l=l do_line(i) > l=l OFS $NF > print l > } > > NR>=2{ > for(i=1;i<n;i++){ > s[i]=s[i+1] > dt[i]=dt[i+1] > } > > dt[n]=$1 OFS $2 > s[n]=do_line(-1) > > if(NR>=n+1){ > l=dt[1] > for(i=1;i<=n;i++)l=l s[i] > l=l OFS $NF > print l > } > }' > > I think two examples are enough here to demonstrate how it works (blank > lines between commands added for clarity - remember first line is header > line): > > $ cat file.csv > D,T,a,b,c,d,R > D1,T1,a1,b1,c1,d1,R1 > D2,T2,a2,b2,c2,d2,R2 > D3,T3,a3,b3,c3,d3,R3 > D4,T4,a4,b4,c4,d4,R4 > D5,T5,a5,b5,c5,d5,R5 > D6,T6,a6,b6,c6,d6,R6 > > $ awk -v n=3 -f program.awk file.csv > D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R > D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3 > D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4 > D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5 > D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6 > > $ awk -v n=3 -v dcols='1,3' -f program.awk file.csv > D,T,b_1,d_1,b_2,d_2,b_3,d_3,R > D1,T1,b1,d1,b2,d2,b3,d3,R3 > D2,T2,b2,d2,b3,d3,b4,d4,R4 > D3,T3,b3,d3,b4,d4,b5,d5,R5 > D4,T4,b4,d4,b5,d5,b6,d6,R6 > > The program still assumes that first two columns and last column must be > excluded from the general processing, and just prepended and appended to > the output lines. That can be changed though. > > Excellent! Worked identically for me once I removed an extra single quote after the last brace that somehow crept into your copy/paste in this email. Really good stuff. I tried it on a 40MB data file from TradeStation with 6 years of 5 minute Dow Futures data. Going 10 wide and deleting 1 column it generated a 292MB output file in about 1 minute on my machine. That's more than acceptable performance. Data mining will likely take hours if not days. Certainly more than a few minutes. Looks like I'm running into one more problem and then I'm ready to give it a try for real. Unfortunately one vendor platform is putting quotes around the names in the header row so your _N increment looks like "High"_4 instead of High_4 or "High_4". I'd like to fix that as I'm pretty sure that the way we have it won't be acceptable, but I don't know whether it would be best to have the quotes or not have the quotes. My two target data mining platforms are R, which is in portage, and RapidMiner which is available as Open Source from the Rapid-i web site. I'll try it both ways with both header formats and see what happens. I had worried about checking the header on a really large file to see if I had cut the correct columns but it turns out that cat awkDataOut.csv | more in a terminal writes the first few lines very quickly. From there I can either just look at it or copy/paste into a new csv file, load it into something like Open Office Calc and make sure I got the right columns so I don't think there's any practical need to do anything more with the header other than whatever turns out to be the right answer with the quotes. My worry had been that when I request 5 data columns it's not obvious what order they are provided so I'd have to look at the file and figure out where everything was. Turns out it's not such a big deal. This has been extremely helpful. I really am embarrassed that you did all the work. I own you at lest a beer or two (or whatever your preferred drink might be) should we ever have a chance to exercise our right arms together one day. I'll drop you a note back when I figure out what makes sense about the header quotes. Cheers, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-24 22:51 ` Mark Knecht @ 2009-02-25 10:27 ` Etaoin Shrdlu 0 siblings, 0 replies; 19+ messages in thread From: Etaoin Shrdlu @ 2009-02-25 10:27 UTC (permalink / raw To: gentoo-user On Tuesday 24 February 2009, 23:51, Mark Knecht wrote: > Looks like I'm running into one more problem and then I'm ready to > give it a try for real. Unfortunately one vendor platform is putting > quotes around the names in the header row so your _N increment looks > like "High"_4 instead of High_4 or "High_4". I'd like to fix that as > I'm pretty sure that the way we have it won't be acceptable, but I > don't know whether it would be best to have the quotes or not have the > quotes. My two target data mining platforms are R, which is in > portage, and RapidMiner which is available as Open Source from the > Rapid-i web site. I'll try it both ways with both header formats and > see what happens. Ok, in any case that is a minor fix and adjusting the program is no big deal. > I had worried about checking the header on a really large file to see > if I had cut the correct columns but it turns out that > > cat awkDataOut.csv | more > > in a terminal writes the first few lines very quickly. From there I > can either just look at it or copy/paste into a new csv file, load it > into something like Open Office Calc and make sure I got the right > columns so I don't think there's any practical need to do anything > more with the header other than whatever turns out to be the right > answer with the quotes. My worry had been that when I request 5 data > columns it's not obvious what order they are provided so I'd have to > look at the file and figure out where everything was. Turns out it's > not such a big deal. The currently implemented rule is as follows: if you request to not have, say, columns 2 and 5 (out of, say, a total of 5 in the original file - besides date/time and result), you get the columns in the same order they were in the original file, minus the ones you don't want, so in this example you will get columns 1, 3 and 4 in that order. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht 2009-02-22 20:15 ` Etaoin Shrdlu @ 2009-02-22 20:59 ` Willie Wong 2009-02-22 23:15 ` Mark Knecht 1 sibling, 1 reply; 19+ messages in thread From: Willie Wong @ 2009-02-22 20:59 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked: > I've got a really big data file in essentially a *.csv format. > (comma delimited) I need to scan this file and create a new output > file. I'm wondering if there is a reasonably easy command line way of > doing this using something like sed or awk which I know nothing about. > Thanks in advance. Definitely more than doable in sed or awk. If you want a reference book, try http://oreilly.com/catalog/9781565922259/ Unfortunately I haven't used awk in the longest time and can't remember how it will go. The following sed recipe may work, modulo some small modifications > The basic idea goes something like this: > > 1) The input file might look this the following where some of it is > attributes (shown as letters) and other parts are results. (shown as > numbers) > > A,B,C,D,1 > E,F,G,H,2 > I,J,K,L,3 > M,N,O,P,4 > Q,R,S,T,5 > U,V,W,X,6 > > 2) From the above data input file I want to take the attributes from a > few preceeding lines (say 3 in this example) and write them to the > output file along with the result on the last of the 3 lines. The > output file might look like this: > > A,B,C,D,E,F,G,H,I,J,K,L,3 > E,F,G,H,I,J,K,L,M,N,O,P,4 > I,J,K,L,M,N,O,P,Q,R,S,T,5 > M,N,O,P,Q,R,S,T,U,V,W,X,6 > > 3) This must be done as a read/process/write operation of some sort > because the input file may be far larger than system memory. > (Currently it isn't, but it likely will eventually be.) > > 4) In my example above I suggested that there is a single result but > their may be more than one. (Don't know yet.) I showed 3 lines but > might be doing 10. I don't know. It's important to me to pick a > moderately flexible way of dealing with this as the order of columns > and number of results will likely change over time and I'll certainly > need to adjust. First create the sedscript sedscript1: -------------------------- 1 { N N } { p D N } -------------------------- The first block only hits when the first line of input is read. It forces it to read the next two lines. The second block hits for every pattern space, it prints the three line blocks, deletes the first line, and reads the next line. Now create the sedscript sedscript2: -------------------------- { N N s/,[^,]\n/,/gp d } -------------------------- This reads a three-line block at a time, removes the last field (and the new line character) from all but the last line, replacing it with a comma. Then it prints. And then it clears the pattern space. So you can do cat INPUT | sed -f sedscript1 | sed -f sedscript2 should give you what you want. Like I said, the whole thing can probably be done a lot more eloquently in awk. But my awk-fu is not what it used to be. For a quick reference for sed, try http://www.grymoire.com/Unix/Sed.html W -- Ever stop to think, and forget to start again? Sortir en Pantoufles: up 807 days, 18:51 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 20:59 ` Willie Wong @ 2009-02-22 23:15 ` Mark Knecht 2009-02-23 0:57 ` Willie Wong 0 siblings, 1 reply; 19+ messages in thread From: Mark Knecht @ 2009-02-22 23:15 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 12:59 PM, Willie Wong <wwong@princeton.edu> wrote: > On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked: >> I've got a really big data file in essentially a *.csv format. >> (comma delimited) I need to scan this file and create a new output >> file. I'm wondering if there is a reasonably easy command line way of >> doing this using something like sed or awk which I know nothing about. >> Thanks in advance. > > Definitely more than doable in sed or awk. If you want a reference > book, try http://oreilly.com/catalog/9781565922259/ > > Unfortunately I haven't used awk in the longest time and can't > remember how it will go. The following sed recipe may work, modulo > some small modifications > >> The basic idea goes something like this: >> >> 1) The input file might look this the following where some of it is >> attributes (shown as letters) and other parts are results. (shown as >> numbers) >> >> A,B,C,D,1 >> E,F,G,H,2 >> I,J,K,L,3 >> M,N,O,P,4 >> Q,R,S,T,5 >> U,V,W,X,6 >> >> 2) From the above data input file I want to take the attributes from a >> few preceeding lines (say 3 in this example) and write them to the >> output file along with the result on the last of the 3 lines. The >> output file might look like this: >> >> A,B,C,D,E,F,G,H,I,J,K,L,3 >> E,F,G,H,I,J,K,L,M,N,O,P,4 >> I,J,K,L,M,N,O,P,Q,R,S,T,5 >> M,N,O,P,Q,R,S,T,U,V,W,X,6 >> >> 3) This must be done as a read/process/write operation of some sort >> because the input file may be far larger than system memory. >> (Currently it isn't, but it likely will eventually be.) >> >> 4) In my example above I suggested that there is a single result but >> their may be more than one. (Don't know yet.) I showed 3 lines but >> might be doing 10. I don't know. It's important to me to pick a >> moderately flexible way of dealing with this as the order of columns >> and number of results will likely change over time and I'll certainly >> need to adjust. > > First create the sedscript > > sedscript1: > -------------------------- > 1 { > N > N > } > { > p > D > N > } > -------------------------- > > The first block only hits when the first line of input is read. It > forces it to read the next two lines. > > The second block hits for every pattern space, it prints the three > line blocks, deletes the first line, and reads the next line. > > Now create the sedscript > > sedscript2: > -------------------------- > { > N > N > s/,[^,]\n/,/gp > d > } > -------------------------- > > This reads a three-line block at a time, removes the last field (and > the new line character) from all but the last line, replacing it with > a comma. Then it prints. And then it clears the pattern space. > > So you can do > > cat INPUT | sed -f sedscript1 | sed -f sedscript2 > > should give you what you want. Like I said, the whole thing can > probably be done a lot more eloquently in awk. But my awk-fu is not > what it used to be. > > For a quick reference for sed, try > http://www.grymoire.com/Unix/Sed.html > > W > -- > Ever stop to think, and forget to start again? > Sortir en Pantoufles: up 807 days, 18:51 Thanks Willie. That's a good start. The first two lines out were mangled but that's totally cool. I can deal with that by hand. There are two places where I'd like to improve things which probably apply to the awk code Etaoin just sent me also. Both have to do with excluding columns but in different ways. 1) My actual input data starts with two fields which date & time. For lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output corresponding to line 1, so these 3 lines: Date1,Time1,A,B,C,D,0 Date2,Time2,E,F,G,H,1 Date3,Time3,I,J,K,L,2 should generate Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2 Essentially Date & Time from line 1, results from line 3. 2) The second is that possibly I don't need attribute G in my output file. I'm thinking that possibly a 3rd sed script that counts a certain number of commas and then doesn't copy up through the next comma? That's messy in the sense that I probably need to drop 10-15 columns out as my real data is maybe 100 fields wide so I'd have 10-15 addition scripts which is too much of a hack to be maintainable. Anyway, I appreciate the ideas. What you sent worked great. I suspect this is somehow similar to what you did in the second script? I'll go play around and see if I can figure that out. In reality I'm not sure yet whether the results can be guaranteed to be at the end in the real file, and probably there will be more than one result column although if I have to I might be able to take care of combining two results into a single value at the data sounce if necessary. Great help! Thanks! Cheers, Mark ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-22 23:15 ` Mark Knecht @ 2009-02-23 0:57 ` Willie Wong 2009-02-23 1:54 ` Mark Knecht 0 siblings, 1 reply; 19+ messages in thread From: Willie Wong @ 2009-02-23 0:57 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked: > 1) My actual input data starts with two fields which date & time. For > lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output > corresponding to line 1, so these 3 lines: > > Date1,Time1,A,B,C,D,0 > Date2,Time2,E,F,G,H,1 > Date3,Time3,I,J,K,L,2 > > should generate > > Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2 > > Essentially Date & Time from line 1, results from line 3. > > 2) The second is that possibly I don't need attribute G in my output > file. I'm thinking that possibly a 3rd sed script that counts a > certain number of commas and then doesn't copy up through the next > comma? That's messy in the sense that I probably need to drop 10-15 > columns out as my real data is maybe 100 fields wide so I'd have 10-15 > addition scripts which is too much of a hack to be maintainable. > Anyway, I appreciate the ideas. What you sent worked great. > For both of these cases, since you are dropping columns and not re-organizing, you'd have a much easier time just piping the command through "cut". Try 'man cut' (it is only a few hundred words) for usage. But with the sample you gave me, you just need to post process with .... | cut -d , -f 1-6,9,10,12,15- and the Date2, Time2, G, Date3, Time3 columns will be dropped. As to your problem with the first two lines being mangled: I suspect that the first two lines were formatted differently? Maybe stray control characters got into your file or maybe there are leading spaces? It's bizarre for both Etaoin's and my scripts to coincidentally mess up the same lines. (Incidentally, where did you get the csv files from? When I worked in a physics labs and collected data, I found that a lot of times the processing of data using basic command-line tools like sed, bash, perl, and bc can be done a lot more quickly if the initial datasets were formatted in a sensible fashion. Of course there are times when such luxury cannot be afforded.) Best, W -- "What's the Lagrangian for a suction dart?" ~DeathMech, Some Student. P-town PHY 205 Sortir en Pantoufles: up 807 days, 23:29 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [gentoo-user] [OT] - command line read *.csv & create new file 2009-02-23 0:57 ` Willie Wong @ 2009-02-23 1:54 ` Mark Knecht 0 siblings, 0 replies; 19+ messages in thread From: Mark Knecht @ 2009-02-23 1:54 UTC (permalink / raw To: gentoo-user On Sun, Feb 22, 2009 at 4:57 PM, Willie Wong <wwong@princeton.edu> wrote: > On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked: >> 1) My actual input data starts with two fields which date & time. For >> lines 2 & 3 I need exclude the 2nd & 3rd date & time from the output >> corresponding to line 1, so these 3 lines: >> >> Date1,Time1,A,B,C,D,0 >> Date2,Time2,E,F,G,H,1 >> Date3,Time3,I,J,K,L,2 >> >> should generate >> >> Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2 >> >> Essentially Date & Time from line 1, results from line 3. >> >> 2) The second is that possibly I don't need attribute G in my output >> file. I'm thinking that possibly a 3rd sed script that counts a >> certain number of commas and then doesn't copy up through the next >> comma? That's messy in the sense that I probably need to drop 10-15 >> columns out as my real data is maybe 100 fields wide so I'd have 10-15 >> addition scripts which is too much of a hack to be maintainable. >> Anyway, I appreciate the ideas. What you sent worked great. >> > > For both of these cases, since you are dropping columns and not > re-organizing, you'd have a much easier time just piping the command > through "cut". Try 'man cut' (it is only a few hundred words) for > usage. But with the sample you gave me, you just need to post process > with > > .... | cut -d , -f 1-6,9,10,12,15- > > and the Date2, Time2, G, Date3, Time3 columns will be dropped. Thanks. I'll investigate that tomorrow. > > As to your problem with the first two lines being mangled: I suspect > that the first two lines were formatted differently? Maybe stray > control characters got into your file or maybe there are leading > spaces? It's bizarre for both Etaoin's and my scripts to > coincidentally mess up the same lines. > > (Incidentally, where did you get the csv files from? When I worked in > a physics labs and collected data, I found that a lot of times the > processing of data using basic command-line tools like sed, bash, > perl, and bc can be done a lot more quickly if the initial datasets were > formatted in a sensible fashion. Of course there are times when such > luxury cannot be afforded.) They are primarialy coming from TradeStation. The data that I'm working with is stock pricing data along with technical indicators coming off of charts. Unfortunatelly I don't seem to have any control at all as to the order that the columns show up. It doesn't seem to be based on how I build the chart and certain things on the chart I don't need are still output to the file. It's pretty much take 100% of what's on the chart or take nothing. Fortunately the csv files are very good in terms of not dropping out data. At least every row has all the data. Cheers, Mark > > Best, > > W > -- > "What's the Lagrangian for a suction dart?" > ~DeathMech, Some Student. P-town PHY 205 > Sortir en Pantoufles: up 807 days, 23:29 > > ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2009-02-25 10:29 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-22 19:06 [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht 2009-02-22 20:15 ` Etaoin Shrdlu 2009-02-22 22:28 ` Mark Knecht 2009-02-22 22:57 ` Etaoin Shrdlu 2009-02-22 23:31 ` Mark Knecht 2009-02-23 6:17 ` Paul Hartman 2009-02-23 9:57 ` Etaoin Shrdlu 2009-02-23 16:05 ` Mark Knecht 2009-02-23 22:18 ` Etaoin Shrdlu 2009-02-24 2:26 ` Mark Knecht 2009-02-24 10:56 ` Etaoin Shrdlu 2009-02-24 14:41 ` Mark Knecht 2009-02-24 17:48 ` Etaoin Shrdlu 2009-02-24 22:51 ` Mark Knecht 2009-02-25 10:27 ` Etaoin Shrdlu 2009-02-22 20:59 ` Willie Wong 2009-02-22 23:15 ` Mark Knecht 2009-02-23 0:57 ` Willie Wong 2009-02-23 1:54 ` Mark Knecht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox