It might be that your hard drive is not that much slower than memory, then,
but I really doubt this one ...or it could mean that reading gzip out is
much slower than reading cat - and this one is highly probable. I mean, file
size of gzip.

Actually it's elementary logic that decompressing is faster than just
loading. What I personally did use was *much *faster than without
compressing, but that was also c++ application having this zip always in
memory and this was also highly inefficiently stored data at first.

I suggest you such test to understand me - make some file and write there
character "a" about 10 000 000 times to get it that big, then try the same
thing on that file. I think it's probable that it will be real fast to
decompress the resulting file.

Anyway, you have still made me think that at first, no zip should be used :)
- just because your tests took several new variables in like speed of
reading decompression utility from disk.

Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.


2008/12/2 Alec Warner <antarus@gentoo.org>

> On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote:
> > About zipping.. Default settings might not really be good idea - i think
> > that "fastest" might be even better. Considering that portage tree
> contains
> > same word again and again (like "applications") it needs pretty small
> > dictionary to make it much smaller. Decompressing will not be reading
> from
> > disc, decompressing and writing back to disc as in your case probably -
> try
> > decompression to memory drive and you might get better numbers.
>
> I ran gzip -d -c file.gz > /dev/null, which should not write to disk.
>
> I tried again with gzip -1 and it still takes 29ms to decompress (even
> with gzip -1) where a bare read takes 26ms.  (I have a 2.6Ghz X2 which
> is probably relevant to gzip decompression speed)
>
> >
> > I have personally used compression in one c++ application and with
> optimum
> > settings, it made things much faster - those were files, where i had for
> > example 65536 16-byte integers, which could be zeros and mostly were; I
> > didnt care about creating better file format, but just compressed the
> whole
> > thing.
>
> I'm not saying compression won't make the index smaller.  I'm saying
> making the index smaller does not improve performance.  If you have a
> 10 meg file and you make it 1 meg, you do not increase performance
> because you (on average) are not saving enough time reading the
> smaller file; since you pay it in decompressing the smaller file
> later.
>
> >
> > I suggest you to compress esearch db, then decompress it to memory drive
> and
> > give us those numbers - might be considerably faster.
>
> gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes
> on average (6 trials, dropped caches between trials) takes 35.1666ms
>
> cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials,
> 24ms.
>
> The point is you use compression when you need to save space (sending
> data over the network, or storing large amounts of data or a lot of
> something).  The index isn't going to be big (if it is bigger than 20
> or 30 meg I'll be surprised), the index isn't going over the network
> and there is only 1 index, not say a million indexes (where
> compression might actually be useful for some kind of LRU subset of
> indexes to meet disk requirements).
>
> Anyway this is all moot since as you stated so well earlier,
> optimization comes last, so stop trying to do it now ;)
>
> -Alec
>
> >
> > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
> support.
> > Try open of that and normal open on esearch db; also compress with the
> same
> > lib to get right kind of file.
> >
> > Anyway - maybe this compression should be later added and optional.
> >
> > Tambet - technique evolves to art, art evolves to magic, magic evolves to
> > just doing.
> >
> >
> > 2008/12/2 Alec Warner <antarus@gentoo.org>
> >>
> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
> >> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
> >> >>
> >> >> True, true. Like I said, I don't really use overlays, so excuse my
> >> >> igonrance.
> >> >
> >> > Do you know an order of doing things:
> >> >
> >> > Rules of Optimization:
> >> >
> >> > Rule 1: Don't do it.
> >> > Rule 2 (for experts only): Don't do it yet.
> >> >
> >> > What this actually means - functionality comes first. Readability
> comes
> >> > next. Optimization comes last. Unless you are creating a fancy 3D
> engine
> >> > for
> >> > kung fu game.
> >> >
> >> > If you are going to exclude overlays, you are removing functionality -
> >> > and,
> >> > indeed, absolutely has-to-be-there functionality, because noone would
> >> > intuitively expect search function to search only one subset of
> >> > packages,
> >> > however reasonable this subset would be. So, you can't, just can't,
> add
> >> > this
> >> > package into portage base - you could write just another external
> search
> >> > package for portage.
> >> >
> >> > I looked this code a bit and:
> >> > Portage's "__init__.py" contains comment "# search functionality".
> After
> >> > this comment, there is a nice and simple search class.
> >> > It also contains method "def action_sync(...)", which contains
> >> > synchronization stuff.
> >> >
> >> > Now, search class will be initialized by setting up 3 databases -
> >> > porttree,
> >> > bintree and vartree, whatever those are. Those will be in self._dbs
> >> > array
> >> > and porttree will be in self._portdb.
> >> >
> >> > It contains some more methods:
> >> > _findname(...) will return result of self._portdb.findname(...) with
> >> > same
> >> > parameters or None if it does not exist.
> >> > Other methods will do similar things - map one or another method.
> >> > execute will do the real search...
> >> > Now - "for package in self.portdb.cp_all()" is important here ...it
> >> > currently loops over whole portage tree. All kinds of matching will be
> >> > done
> >> > inside.
> >> > self.portdb obviously points to porttree.py (unless it points to fake
> >> > tree).
> >> > cp_all will take all porttrees and do simple file search inside. This
> >> > method
> >> > should contain optional index search.
> >> >
> >> >               self.porttrees = [self.porttree_root] + \
> >> >                       [os.path.realpath(t) for t in
> >> > self.mysettings["PORTDIR_OVERLAY"].split()]
> >> >
> >> > So, self.porttrees contains list of trees - first of them is root,
> >> > others
> >> > are overlays.
> >> >
> >> > Now, what you have to do will not be harder just because of having
> >> > overlay
> >> > search, too.
> >> >
> >> > You have to create method def cp_index(self), which will return
> >> > dictionary
> >> > containing package names as keys. For oroot... will be
> >> > "self.porttrees[1:]",
> >> > not "self.porttrees" - this will only search overlays. d = {} will be
> >> > replaced with d = self.cp_index(). If index is not there, old version
> >> > will
> >> > be used (thus, you have to make internal porttrees variable, which
> >> > contains
> >> > all or all except first).
> >> >
> >> > Other methods used by search are xmatch and aux_get - first used
> several
> >> > times and last one used to get description. You have to cache results
> of
> >> > those specific queries and make them use your cache - as you can see,
> >> > those
> >> > parts of portage are already able to use overlays. Thus, you have to
> put
> >> > your code again in beginning of those functions - create index_xmatch
> >> > and
> >> > index_aux_get methods, then make those methods use them and return
> their
> >> > results unless those are None (or something other in case none is
> >> > already
> >> > legal result) - if they return None, old code will be run and do it's
> >> > job.
> >> > If index is not created, result is None. In index_** methods, just
> check
> >> > if
> >> > query is what you can answer and if it is, then answer it.
> >> >
> >> > Obviously, the simplest way to create your index is to delete index,
> >> > then
> >> > use those same methods to query for all nessecary information - and
> >> > fastest
> >> > way would be to add updating index directly into sync, which you could
> >> > do
> >> > later.
> >> >
> >> > Please, also, make those commands to turn index on and off (last one
> >> > should
> >> > also delete it to save disk space). Default should be off until it's
> >> > fast,
> >> > small and reliable. Also notice that if index is kept on hard drive,
> it
> >> > might be faster if it's compressed (gz, for example) - decompressing
> >> > takes
> >> > less time and more processing power than reading it fully out.
> >>
> >> I'm pretty sure your mistaken here, unless your index is stored on a
> >> floppy or something really slow.
> >>
> >> A disk read has 2 primary costs.
> >>
> >> Seek Time: Time for the head to seek to the sector of disk you want.
> >> Spin Time: Time for the platter to spin around such that the sector
> >> you want is under the read head.
> >>
> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
> >> rotations per second, so worst case (you just passed the sector you
> >> need) you need to wait 1/120th of a second (or 8ms).
> >>
> >> Seek Time is per hard drive, but most drives provide average seek
> >> times under 10ms.
> >>
> >> So it takes on average 18ms to get to your data, then you start
> >> reading.  The index will not be that large (my esearchdb is 2 megs,
> >> but lets assume 10MB for this compressed index).
> >>
> >> I took a 10MB meg sqlite database and compressed it with gzip (default
> >> settings) down to 5 megs.
> >> gzip -d on the database takes 300ms, catting the decompressed data
> >> base takes 88ms (average of 5 runs, drop disk caches between runs).
> >>
> >> I then tried my vdb_metadata.pickle from
> >> /var/cache/edb/vdb_metadata.pickle
> >>
> >> 1.3megs compresses to 390k.
> >>
> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
> >> disk.
> >>
> >> Your index would have to be very large or very fragmented on disk
> >> (requiring more than one seek) to see a significant gain in file
> >> compression (gzip scales linearly).
> >>
> >> In short, don't compress the index ;p
> >>
> >> >
> >> > Have luck!
> >> >
> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> Hash: SHA1
> >> >>>
> >> >>> Emma Strubell schrieb:
> >> >>> > 2) does anyone really need to search an overlay anyway?
> >> >>>
> >> >>> Of course. Take large (semi-)official overlays like sunrise. They
> can
> >> >>> easily be seen as a second portage tree.
> >> >>> -----BEGIN PGP SIGNATURE-----
> >> >>> Version: GnuPG v2.0.9 (GNU/Linux)
> >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >> >>>
> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
> >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
> >> >>> =+lCO
> >> >>> -----END PGP SIGNATURE-----
> >> >>>
> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann
> <lists@necoro.eu>
> >> >> wrote:
> >> >>
> >> >
> >> >
> >
> >
>