It might be that your hard drive is not that much slower than memory, then, but I really doubt this one ...or it could mean that reading gzip out is much slower than reading cat - and this one is highly probable. I mean, file size of gzip. Actually it's elementary logic that decompressing is faster than just loading. What I personally did use was *much *faster than without compressing, but that was also c++ application having this zip always in memory and this was also highly inefficiently stored data at first. I suggest you such test to understand me - make some file and write there character "a" about 10 000 000 times to get it that big, then try the same thing on that file. I think it's probable that it will be real fast to decompress the resulting file. Anyway, you have still made me think that at first, no zip should be used :) - just because your tests took several new variables in like speed of reading decompression utility from disk. Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/2 Alec Warner > On Tue, Dec 2, 2008 at 4:42 AM, Tambet wrote: > > About zipping.. Default settings might not really be good idea - i think > > that "fastest" might be even better. Considering that portage tree > contains > > same word again and again (like "applications") it needs pretty small > > dictionary to make it much smaller. Decompressing will not be reading > from > > disc, decompressing and writing back to disc as in your case probably - > try > > decompression to memory drive and you might get better numbers. > > I ran gzip -d -c file.gz > /dev/null, which should not write to disk. > > I tried again with gzip -1 and it still takes 29ms to decompress (even > with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which > is probably relevant to gzip decompression speed) > > > > > I have personally used compression in one c++ application and with > optimum > > settings, it made things much faster - those were files, where i had for > > example 65536 16-byte integers, which could be zeros and mostly were; I > > didnt care about creating better file format, but just compressed the > whole > > thing. > > I'm not saying compression won't make the index smaller. I'm saying > making the index smaller does not improve performance. If you have a > 10 meg file and you make it 1 meg, you do not increase performance > because you (on average) are not saving enough time reading the > smaller file; since you pay it in decompressing the smaller file > later. > > > > > I suggest you to compress esearch db, then decompress it to memory drive > and > > give us those numbers - might be considerably faster. > > gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes > on average (6 trials, dropped caches between trials) takes 35.1666ms > > cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, > 24ms. > > The point is you use compression when you need to save space (sending > data over the network, or storing large amounts of data or a lot of > something). The index isn't going to be big (if it is bigger than 20 > or 30 meg I'll be surprised), the index isn't going over the network > and there is only 1 index, not say a million indexes (where > compression might actually be useful for some kind of LRU subset of > indexes to meet disk requirements). > > Anyway this is all moot since as you stated so well earlier, > optimization comes last, so stop trying to do it now ;) > > -Alec > > > > > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip > support. > > Try open of that and normal open on esearch db; also compress with the > same > > lib to get right kind of file. > > > > Anyway - maybe this compression should be later added and optional. > > > > Tambet - technique evolves to art, art evolves to magic, magic evolves to > > just doing. > > > > > > 2008/12/2 Alec Warner > >> > >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet wrote: > >> > 2008/12/2 Emma Strubell > >> >> > >> >> True, true. Like I said, I don't really use overlays, so excuse my > >> >> igonrance. > >> > > >> > Do you know an order of doing things: > >> > > >> > Rules of Optimization: > >> > > >> > Rule 1: Don't do it. > >> > Rule 2 (for experts only): Don't do it yet. > >> > > >> > What this actually means - functionality comes first. Readability > comes > >> > next. Optimization comes last. Unless you are creating a fancy 3D > engine > >> > for > >> > kung fu game. > >> > > >> > If you are going to exclude overlays, you are removing functionality - > >> > and, > >> > indeed, absolutely has-to-be-there functionality, because noone would > >> > intuitively expect search function to search only one subset of > >> > packages, > >> > however reasonable this subset would be. So, you can't, just can't, > add > >> > this > >> > package into portage base - you could write just another external > search > >> > package for portage. > >> > > >> > I looked this code a bit and: > >> > Portage's "__init__.py" contains comment "# search functionality". > After > >> > this comment, there is a nice and simple search class. > >> > It also contains method "def action_sync(...)", which contains > >> > synchronization stuff. > >> > > >> > Now, search class will be initialized by setting up 3 databases - > >> > porttree, > >> > bintree and vartree, whatever those are. Those will be in self._dbs > >> > array > >> > and porttree will be in self._portdb. > >> > > >> > It contains some more methods: > >> > _findname(...) will return result of self._portdb.findname(...) with > >> > same > >> > parameters or None if it does not exist. > >> > Other methods will do similar things - map one or another method. > >> > execute will do the real search... > >> > Now - "for package in self.portdb.cp_all()" is important here ...it > >> > currently loops over whole portage tree. All kinds of matching will be > >> > done > >> > inside. > >> > self.portdb obviously points to porttree.py (unless it points to fake > >> > tree). > >> > cp_all will take all porttrees and do simple file search inside. This > >> > method > >> > should contain optional index search. > >> > > >> > self.porttrees = [self.porttree_root] + \ > >> > [os.path.realpath(t) for t in > >> > self.mysettings["PORTDIR_OVERLAY"].split()] > >> > > >> > So, self.porttrees contains list of trees - first of them is root, > >> > others > >> > are overlays. > >> > > >> > Now, what you have to do will not be harder just because of having > >> > overlay > >> > search, too. > >> > > >> > You have to create method def cp_index(self), which will return > >> > dictionary > >> > containing package names as keys. For oroot... will be > >> > "self.porttrees[1:]", > >> > not "self.porttrees" - this will only search overlays. d = {} will be > >> > replaced with d = self.cp_index(). If index is not there, old version > >> > will > >> > be used (thus, you have to make internal porttrees variable, which > >> > contains > >> > all or all except first). > >> > > >> > Other methods used by search are xmatch and aux_get - first used > several > >> > times and last one used to get description. You have to cache results > of > >> > those specific queries and make them use your cache - as you can see, > >> > those > >> > parts of portage are already able to use overlays. Thus, you have to > put > >> > your code again in beginning of those functions - create index_xmatch > >> > and > >> > index_aux_get methods, then make those methods use them and return > their > >> > results unless those are None (or something other in case none is > >> > already > >> > legal result) - if they return None, old code will be run and do it's > >> > job. > >> > If index is not created, result is None. In index_** methods, just > check > >> > if > >> > query is what you can answer and if it is, then answer it. > >> > > >> > Obviously, the simplest way to create your index is to delete index, > >> > then > >> > use those same methods to query for all nessecary information - and > >> > fastest > >> > way would be to add updating index directly into sync, which you could > >> > do > >> > later. > >> > > >> > Please, also, make those commands to turn index on and off (last one > >> > should > >> > also delete it to save disk space). Default should be off until it's > >> > fast, > >> > small and reliable. Also notice that if index is kept on hard drive, > it > >> > might be faster if it's compressed (gz, for example) - decompressing > >> > takes > >> > less time and more processing power than reading it fully out. > >> > >> I'm pretty sure your mistaken here, unless your index is stored on a > >> floppy or something really slow. > >> > >> A disk read has 2 primary costs. > >> > >> Seek Time: Time for the head to seek to the sector of disk you want. > >> Spin Time: Time for the platter to spin around such that the sector > >> you want is under the read head. > >> > >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 > >> rotations per second, so worst case (you just passed the sector you > >> need) you need to wait 1/120th of a second (or 8ms). > >> > >> Seek Time is per hard drive, but most drives provide average seek > >> times under 10ms. > >> > >> So it takes on average 18ms to get to your data, then you start > >> reading. The index will not be that large (my esearchdb is 2 megs, > >> but lets assume 10MB for this compressed index). > >> > >> I took a 10MB meg sqlite database and compressed it with gzip (default > >> settings) down to 5 megs. > >> gzip -d on the database takes 300ms, catting the decompressed data > >> base takes 88ms (average of 5 runs, drop disk caches between runs). > >> > >> I then tried my vdb_metadata.pickle from > >> /var/cache/edb/vdb_metadata.pickle > >> > >> 1.3megs compresses to 390k. > >> > >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from > >> disk. > >> > >> Your index would have to be very large or very fragmented on disk > >> (requiring more than one seek) to see a significant gain in file > >> compression (gzip scales linearly). > >> > >> In short, don't compress the index ;p > >> > >> > > >> > Have luck! > >> > > >> >>> -----BEGIN PGP SIGNED MESSAGE----- > >> >>> Hash: SHA1 > >> >>> > >> >>> Emma Strubell schrieb: > >> >>> > 2) does anyone really need to search an overlay anyway? > >> >>> > >> >>> Of course. Take large (semi-)official overlays like sunrise. They > can > >> >>> easily be seen as a second portage tree. > >> >>> -----BEGIN PGP SIGNATURE----- > >> >>> Version: GnuPG v2.0.9 (GNU/Linux) > >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >> >>> > >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt > >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S > >> >>> =+lCO > >> >>> -----END PGP SIGNATURE----- > >> >>> > >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann > > >> >> wrote: > >> >> > >> > > >> > > > > > >