From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1LcH1H-0005aj-GQ for garchives@archives.gentoo.org; Wed, 25 Feb 2009 10:29:39 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id EBE6CE02A1; Wed, 25 Feb 2009 10:29:33 +0000 (UTC) Received: from dcnode-02.unlimitedmail.net (smtp.unlimitedmail.net [94.127.184.242]) by pigeon.gentoo.org (Postfix) with ESMTP id 9A873E02A1 for ; Wed, 25 Feb 2009 10:29:33 +0000 (UTC) Received: from ppp.zz ([137.204.208.98]) (authenticated bits=0) by dcnode-02.unlimitedmail.net (8.14.3/8.14.3) with ESMTP id n1PATMKx011377 for ; Wed, 25 Feb 2009 11:29:22 +0100 From: Etaoin Shrdlu To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file Date: Wed, 25 Feb 2009 11:27:43 +0100 User-Agent: KMail/1.9.9 References: <5bdc1c8b0902221106h71a8783y698aa209ace59a6@mail.gmail.com> <200902241848.57154.shrdlu@unlimitedmail.org> <5bdc1c8b0902241451t154dbca1ub02b1a140466ca52@mail.gmail.com> In-Reply-To: <5bdc1c8b0902241451t154dbca1ub02b1a140466ca52@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200902251127.43710.shrdlu@unlimitedmail.org> X-UnlimitedMail-MailScanner-From: shrdlu@unlimitedmail.org X-Spam-Status: No X-Archives-Salt: 1d3d3bf1-1742-412b-b809-d541b98646bb X-Archives-Hash: 254b263228031b6e12ddf01639531c23 On Tuesday 24 February 2009, 23:51, Mark Knecht wrote: > Looks like I'm running into one more problem and then I'm ready to > give it a try for real. Unfortunately one vendor platform is putting > quotes around the names in the header row so your _N increment looks > like "High"_4 instead of High_4 or "High_4". I'd like to fix that as > I'm pretty sure that the way we have it won't be acceptable, but I > don't know whether it would be best to have the quotes or not have the > quotes. My two target data mining platforms are R, which is in > portage, and RapidMiner which is available as Open Source from the > Rapid-i web site. I'll try it both ways with both header formats and > see what happens. Ok, in any case that is a minor fix and adjusting the program is no big deal. > I had worried about checking the header on a really large file to see > if I had cut the correct columns but it turns out that > > cat awkDataOut.csv | more > > in a terminal writes the first few lines very quickly. From there I > can either just look at it or copy/paste into a new csv file, load it > into something like Open Office Calc and make sure I got the right > columns so I don't think there's any practical need to do anything > more with the header other than whatever turns out to be the right > answer with the quotes. My worry had been that when I request 5 data > columns it's not obvious what order they are provided so I'd have to > look at the file and figure out where everything was. Turns out it's > not such a big deal. The currently implemented rule is as follows: if you request to not have, say, columns 2 and 5 (out of, say, a total of 5 in the original file - besides date/time and result), you get the columns in the same order they were in the original file, minus the ones you don't want, so in this example you will get columns 1, 3 and 4 in that order.