From: Tambet <qtvali@gmail.com>
To: gentoo-portage-dev@lists.gentoo.org
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
Date: Tue, 2 Dec 2008 23:47:20 +0200 [thread overview]
Message-ID: <cea53e3c0812021347k21138fcet66d9dfd19a195886@mail.gmail.com> (raw)
In-Reply-To: <b41005390812021154kf3e0494k37bd3c54cc0e316d@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 11413 bytes --]
It might be that your hard drive is not that much slower than memory, then,
but I really doubt this one ...or it could mean that reading gzip out is
much slower than reading cat - and this one is highly probable. I mean, file
size of gzip.
Actually it's elementary logic that decompressing is faster than just
loading. What I personally did use was *much *faster than without
compressing, but that was also c++ application having this zip always in
memory and this was also highly inefficiently stored data at first.
I suggest you such test to understand me - make some file and write there
character "a" about 10 000 000 times to get it that big, then try the same
thing on that file. I think it's probable that it will be real fast to
decompress the resulting file.
Anyway, you have still made me think that at first, no zip should be used :)
- just because your tests took several new variables in like speed of
reading decompression utility from disk.
Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.
2008/12/2 Alec Warner <antarus@gentoo.org>
> On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote:
> > About zipping.. Default settings might not really be good idea - i think
> > that "fastest" might be even better. Considering that portage tree
> contains
> > same word again and again (like "applications") it needs pretty small
> > dictionary to make it much smaller. Decompressing will not be reading
> from
> > disc, decompressing and writing back to disc as in your case probably -
> try
> > decompression to memory drive and you might get better numbers.
>
> I ran gzip -d -c file.gz > /dev/null, which should not write to disk.
>
> I tried again with gzip -1 and it still takes 29ms to decompress (even
> with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which
> is probably relevant to gzip decompression speed)
>
> >
> > I have personally used compression in one c++ application and with
> optimum
> > settings, it made things much faster - those were files, where i had for
> > example 65536 16-byte integers, which could be zeros and mostly were; I
> > didnt care about creating better file format, but just compressed the
> whole
> > thing.
>
> I'm not saying compression won't make the index smaller. I'm saying
> making the index smaller does not improve performance. If you have a
> 10 meg file and you make it 1 meg, you do not increase performance
> because you (on average) are not saving enough time reading the
> smaller file; since you pay it in decompressing the smaller file
> later.
>
> >
> > I suggest you to compress esearch db, then decompress it to memory drive
> and
> > give us those numbers - might be considerably faster.
>
> gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes
> on average (6 trials, dropped caches between trials) takes 35.1666ms
>
> cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials,
> 24ms.
>
> The point is you use compression when you need to save space (sending
> data over the network, or storing large amounts of data or a lot of
> something). The index isn't going to be big (if it is bigger than 20
> or 30 meg I'll be surprised), the index isn't going over the network
> and there is only 1 index, not say a million indexes (where
> compression might actually be useful for some kind of LRU subset of
> indexes to meet disk requirements).
>
> Anyway this is all moot since as you stated so well earlier,
> optimization comes last, so stop trying to do it now ;)
>
> -Alec
>
> >
> > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
> support.
> > Try open of that and normal open on esearch db; also compress with the
> same
> > lib to get right kind of file.
> >
> > Anyway - maybe this compression should be later added and optional.
> >
> > Tambet - technique evolves to art, art evolves to magic, magic evolves to
> > just doing.
> >
> >
> > 2008/12/2 Alec Warner <antarus@gentoo.org>
> >>
> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
> >> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
> >> >>
> >> >> True, true. Like I said, I don't really use overlays, so excuse my
> >> >> igonrance.
> >> >
> >> > Do you know an order of doing things:
> >> >
> >> > Rules of Optimization:
> >> >
> >> > Rule 1: Don't do it.
> >> > Rule 2 (for experts only): Don't do it yet.
> >> >
> >> > What this actually means - functionality comes first. Readability
> comes
> >> > next. Optimization comes last. Unless you are creating a fancy 3D
> engine
> >> > for
> >> > kung fu game.
> >> >
> >> > If you are going to exclude overlays, you are removing functionality -
> >> > and,
> >> > indeed, absolutely has-to-be-there functionality, because noone would
> >> > intuitively expect search function to search only one subset of
> >> > packages,
> >> > however reasonable this subset would be. So, you can't, just can't,
> add
> >> > this
> >> > package into portage base - you could write just another external
> search
> >> > package for portage.
> >> >
> >> > I looked this code a bit and:
> >> > Portage's "__init__.py" contains comment "# search functionality".
> After
> >> > this comment, there is a nice and simple search class.
> >> > It also contains method "def action_sync(...)", which contains
> >> > synchronization stuff.
> >> >
> >> > Now, search class will be initialized by setting up 3 databases -
> >> > porttree,
> >> > bintree and vartree, whatever those are. Those will be in self._dbs
> >> > array
> >> > and porttree will be in self._portdb.
> >> >
> >> > It contains some more methods:
> >> > _findname(...) will return result of self._portdb.findname(...) with
> >> > same
> >> > parameters or None if it does not exist.
> >> > Other methods will do similar things - map one or another method.
> >> > execute will do the real search...
> >> > Now - "for package in self.portdb.cp_all()" is important here ...it
> >> > currently loops over whole portage tree. All kinds of matching will be
> >> > done
> >> > inside.
> >> > self.portdb obviously points to porttree.py (unless it points to fake
> >> > tree).
> >> > cp_all will take all porttrees and do simple file search inside. This
> >> > method
> >> > should contain optional index search.
> >> >
> >> > self.porttrees = [self.porttree_root] + \
> >> > [os.path.realpath(t) for t in
> >> > self.mysettings["PORTDIR_OVERLAY"].split()]
> >> >
> >> > So, self.porttrees contains list of trees - first of them is root,
> >> > others
> >> > are overlays.
> >> >
> >> > Now, what you have to do will not be harder just because of having
> >> > overlay
> >> > search, too.
> >> >
> >> > You have to create method def cp_index(self), which will return
> >> > dictionary
> >> > containing package names as keys. For oroot... will be
> >> > "self.porttrees[1:]",
> >> > not "self.porttrees" - this will only search overlays. d = {} will be
> >> > replaced with d = self.cp_index(). If index is not there, old version
> >> > will
> >> > be used (thus, you have to make internal porttrees variable, which
> >> > contains
> >> > all or all except first).
> >> >
> >> > Other methods used by search are xmatch and aux_get - first used
> several
> >> > times and last one used to get description. You have to cache results
> of
> >> > those specific queries and make them use your cache - as you can see,
> >> > those
> >> > parts of portage are already able to use overlays. Thus, you have to
> put
> >> > your code again in beginning of those functions - create index_xmatch
> >> > and
> >> > index_aux_get methods, then make those methods use them and return
> their
> >> > results unless those are None (or something other in case none is
> >> > already
> >> > legal result) - if they return None, old code will be run and do it's
> >> > job.
> >> > If index is not created, result is None. In index_** methods, just
> check
> >> > if
> >> > query is what you can answer and if it is, then answer it.
> >> >
> >> > Obviously, the simplest way to create your index is to delete index,
> >> > then
> >> > use those same methods to query for all nessecary information - and
> >> > fastest
> >> > way would be to add updating index directly into sync, which you could
> >> > do
> >> > later.
> >> >
> >> > Please, also, make those commands to turn index on and off (last one
> >> > should
> >> > also delete it to save disk space). Default should be off until it's
> >> > fast,
> >> > small and reliable. Also notice that if index is kept on hard drive,
> it
> >> > might be faster if it's compressed (gz, for example) - decompressing
> >> > takes
> >> > less time and more processing power than reading it fully out.
> >>
> >> I'm pretty sure your mistaken here, unless your index is stored on a
> >> floppy or something really slow.
> >>
> >> A disk read has 2 primary costs.
> >>
> >> Seek Time: Time for the head to seek to the sector of disk you want.
> >> Spin Time: Time for the platter to spin around such that the sector
> >> you want is under the read head.
> >>
> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
> >> rotations per second, so worst case (you just passed the sector you
> >> need) you need to wait 1/120th of a second (or 8ms).
> >>
> >> Seek Time is per hard drive, but most drives provide average seek
> >> times under 10ms.
> >>
> >> So it takes on average 18ms to get to your data, then you start
> >> reading. The index will not be that large (my esearchdb is 2 megs,
> >> but lets assume 10MB for this compressed index).
> >>
> >> I took a 10MB meg sqlite database and compressed it with gzip (default
> >> settings) down to 5 megs.
> >> gzip -d on the database takes 300ms, catting the decompressed data
> >> base takes 88ms (average of 5 runs, drop disk caches between runs).
> >>
> >> I then tried my vdb_metadata.pickle from
> >> /var/cache/edb/vdb_metadata.pickle
> >>
> >> 1.3megs compresses to 390k.
> >>
> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
> >> disk.
> >>
> >> Your index would have to be very large or very fragmented on disk
> >> (requiring more than one seek) to see a significant gain in file
> >> compression (gzip scales linearly).
> >>
> >> In short, don't compress the index ;p
> >>
> >> >
> >> > Have luck!
> >> >
> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> Hash: SHA1
> >> >>>
> >> >>> Emma Strubell schrieb:
> >> >>> > 2) does anyone really need to search an overlay anyway?
> >> >>>
> >> >>> Of course. Take large (semi-)official overlays like sunrise. They
> can
> >> >>> easily be seen as a second portage tree.
> >> >>> -----BEGIN PGP SIGNATURE-----
> >> >>> Version: GnuPG v2.0.9 (GNU/Linux)
> >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >> >>>
> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
> >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
> >> >>> =+lCO
> >> >>> -----END PGP SIGNATURE-----
> >> >>>
> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann
> <lists@necoro.eu>
> >> >> wrote:
> >> >>
> >> >
> >> >
> >
> >
>
[-- Attachment #2: Type: text/html, Size: 14595 bytes --]
next prev parent reply other threads:[~2008-12-02 21:47 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell
2008-11-23 14:01 ` tvali
2008-11-23 14:33 ` Pacho Ramos
2008-11-23 14:43 ` Emma Strubell
2008-11-23 16:56 ` Lucian Poston
2008-11-23 18:49 ` Emma Strubell
2008-11-23 20:00 ` tvali
2008-11-23 21:20 ` Mike Auty
2008-11-23 21:59 ` René 'Necoro' Neumann
2008-11-24 0:53 ` tvali
2008-11-24 9:34 ` René 'Necoro' Neumann
2008-11-24 9:48 ` Fabian Groffen
2008-11-24 14:30 ` tvali
2008-11-24 15:14 ` tvali
2008-11-24 15:15 ` René 'Necoro' Neumann
2008-11-24 15:18 ` tvali
2008-11-24 17:15 ` tvali
2008-11-30 23:42 ` Emma Strubell
2008-12-01 7:34 ` [gentoo-portage-dev] " Duncan
2008-12-01 10:40 ` Emma Strubell
2008-12-01 17:52 ` Zac Medico
2008-12-01 21:25 ` Emma Strubell
2008-12-01 21:52 ` Tambet
2008-12-01 22:08 ` Emma Strubell
2008-12-01 22:17 ` René 'Necoro' Neumann
2008-12-01 22:47 ` Emma Strubell
2008-12-02 0:20 ` Tambet
2008-12-02 2:23 ` Emma Strubell
2008-12-02 10:21 ` Alec Warner
2008-12-02 12:42 ` Tambet
2008-12-02 13:51 ` Tambet
2008-12-02 19:54 ` Alec Warner
2008-12-02 21:47 ` Tambet [this message]
2008-12-02 17:42 ` Tambet
2008-11-23 14:56 ` [gentoo-portage-dev] " Douglas Anderson
2008-11-24 3:12 ` Marius Mauch
2008-11-24 5:01 ` devsk
2008-11-24 6:25 ` Marius Mauch
2008-11-24 6:47 ` [gentoo-portage-dev] " Duncan
2009-02-12 19:16 ` [gentoo-portage-dev] " René 'Necoro' Neumann
[not found] ` <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com>
2009-02-12 21:01 ` Fwd: " Emma Strubell
2009-02-12 21:05 ` Mike Auty
2009-02-12 21:14 ` Emma Strubell
2009-02-13 13:37 ` Marijn Schouten (hkBst)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cea53e3c0812021347k21138fcet66d9dfd19a195886@mail.gmail.com \
--to=qtvali@gmail.com \
--cc=gentoo-portage-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox