Re: [gentoo-portage-dev] Re: search functionality in emerge

public inbox for gentoo-portage-dev@lists.gentoo.org
 help / color / mirror / Atom feed

From: Tambet <qtvali@gmail.com>
To: gentoo-portage-dev@lists.gentoo.org
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
Date: Tue, 2 Dec 2008 19:42:56 +0200	[thread overview]
Message-ID: <cea53e3c0812020942n2b8da1d4v370e9b4f8cf7b1a5@mail.gmail.com> (raw)
In-Reply-To: <b41005390812020221s3645d862t62866ab2d62ce85@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8967 bytes --]

To remove this fuzz maybe some useflags would be possible - for example,
useflag which makes portage use SQLite. Then, it's db can used for indexing
and portage searches could be implemented as SQL queries. SQLite then does
what it can to make it fast and disk cache will make sure that for
simultaneous searches it's kept in memory.


Just some thoughts on compression...

http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s
http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s
So the *fastest *hard drives are 8 times slower, than the *fastest *memory
devices in pair.

Anyway, we live in the real world...
http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to
133MB/s
Therefore, the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s

Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for
2MB file without seek times and partitioning. At the same time, you can
read/write about 31.5MB of data in memory, doing different operations with
it. If your compression algorithm has dictionary of repeated words, which
repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms
and loose about 1ms for seeking this thing. Processor usage might be
somewhat bigger, but you really don't care about processor usage when
searching packages and it's not *that* big.

What it really gives us is ability to make our file format highly
inefficient in terms of disk space - for example, what I mentioned before
was 65536 64-bit integers (there was a mistake as I said they were 16 byte,
they were 8 bytes long). Most of them started with at least 32 bits of
zeros, which were wiped out by compression lib, but which I needed to do
fast operations. It doesnt matter much, how data is stored, but there are
repeated patterns if you want to make it fast.

I really don't know if pickle has some good file format included or not. For
common pickle files, I think that compression is noticeable. Anyway, using
database, for example, would be clever idea and making keyword search fully
SQL-query specific makes it clearly better and removes chance to compress.

Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.


2008/12/2 Alec Warner <antarus@gentoo.org>

> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
> >>
> >> True, true. Like I said, I don't really use overlays, so excuse my
> >> igonrance.
> >
> > Do you know an order of doing things:
> >
> > Rules of Optimization:
> >
> > Rule 1: Don't do it.
> > Rule 2 (for experts only): Don't do it yet.
> >
> > What this actually means - functionality comes first. Readability comes
> > next. Optimization comes last. Unless you are creating a fancy 3D engine
> for
> > kung fu game.
> >
> > If you are going to exclude overlays, you are removing functionality -
> and,
> > indeed, absolutely has-to-be-there functionality, because noone would
> > intuitively expect search function to search only one subset of packages,
> > however reasonable this subset would be. So, you can't, just can't, add
> this
> > package into portage base - you could write just another external search
> > package for portage.
> >
> > I looked this code a bit and:
> > Portage's "__init__.py" contains comment "# search functionality". After
> > this comment, there is a nice and simple search class.
> > It also contains method "def action_sync(...)", which contains
> > synchronization stuff.
> >
> > Now, search class will be initialized by setting up 3 databases -
> porttree,
> > bintree and vartree, whatever those are. Those will be in self._dbs array
> > and porttree will be in self._portdb.
> >
> > It contains some more methods:
> > _findname(...) will return result of self._portdb.findname(...) with same
> > parameters or None if it does not exist.
> > Other methods will do similar things - map one or another method.
> > execute will do the real search...
> > Now - "for package in self.portdb.cp_all()" is important here ...it
> > currently loops over whole portage tree. All kinds of matching will be
> done
> > inside.
> > self.portdb obviously points to porttree.py (unless it points to fake
> tree).
> > cp_all will take all porttrees and do simple file search inside. This
> method
> > should contain optional index search.
> >
> >               self.porttrees = [self.porttree_root] + \
> >                       [os.path.realpath(t) for t in
> self.mysettings["PORTDIR_OVERLAY"].split()]
> >
> > So, self.porttrees contains list of trees - first of them is root, others
> > are overlays.
> >
> > Now, what you have to do will not be harder just because of having
> overlay
> > search, too.
> >
> > You have to create method def cp_index(self), which will return
> dictionary
> > containing package names as keys. For oroot... will be
> "self.porttrees[1:]",
> > not "self.porttrees" - this will only search overlays. d = {} will be
> > replaced with d = self.cp_index(). If index is not there, old version
> will
> > be used (thus, you have to make internal porttrees variable, which
> contains
> > all or all except first).
> >
> > Other methods used by search are xmatch and aux_get - first used several
> > times and last one used to get description. You have to cache results of
> > those specific queries and make them use your cache - as you can see,
> those
> > parts of portage are already able to use overlays. Thus, you have to put
> > your code again in beginning of those functions - create index_xmatch and
> > index_aux_get methods, then make those methods use them and return their
> > results unless those are None (or something other in case none is already
> > legal result) - if they return None, old code will be run and do it's
> job.
> > If index is not created, result is None. In index_** methods, just check
> if
> > query is what you can answer and if it is, then answer it.
> >
> > Obviously, the simplest way to create your index is to delete index, then
> > use those same methods to query for all nessecary information - and
> fastest
> > way would be to add updating index directly into sync, which you could do
> > later.
> >
> > Please, also, make those commands to turn index on and off (last one
> should
> > also delete it to save disk space). Default should be off until it's
> fast,
> > small and reliable. Also notice that if index is kept on hard drive, it
> > might be faster if it's compressed (gz, for example) - decompressing
> takes
> > less time and more processing power than reading it fully out.
>
> I'm pretty sure your mistaken here, unless your index is stored on a
> floppy or something really slow.
>
> A disk read has 2 primary costs.
>
> Seek Time: Time for the head to seek to the sector of disk you want.
> Spin Time: Time for the platter to spin around such that the sector
> you want is under the read head.
>
> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
> rotations per second, so worst case (you just passed the sector you
> need) you need to wait 1/120th of a second (or 8ms).
>
> Seek Time is per hard drive, but most drives provide average seek
> times under 10ms.
>
> So it takes on average 18ms to get to your data, then you start
> reading.  The index will not be that large (my esearchdb is 2 megs,
> but lets assume 10MB for this compressed index).
>
> I took a 10MB meg sqlite database and compressed it with gzip (default
> settings) down to 5 megs.
> gzip -d on the database takes 300ms, catting the decompressed data
> base takes 88ms (average of 5 runs, drop disk caches between runs).
>
> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle
>
> 1.3megs compresses to 390k.
>
> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
> disk.
>
> Your index would have to be very large or very fragmented on disk
> (requiring more than one seek) to see a significant gain in file
> compression (gzip scales linearly).
>
> In short, don't compress the index ;p
>
> >
> > Have luck!
> >
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>> Emma Strubell schrieb:
> >>> > 2) does anyone really need to search an overlay anyway?
> >>>
> >>> Of course. Take large (semi-)official overlays like sunrise. They can
> >>> easily be seen as a second portage tree.
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: GnuPG v2.0.9 (GNU/Linux)
> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >>>
> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
> >>> =+lCO
> >>> -----END PGP SIGNATURE-----
> >>>
> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>
> >> wrote:
> >>
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 10697 bytes --]

next prev parent reply	other threads:[~2008-12-02 17:56 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell
2008-11-23 14:01 ` tvali
2008-11-23 14:33   ` Pacho Ramos
2008-11-23 14:43     ` Emma Strubell
2008-11-23 16:56       ` Lucian Poston
2008-11-23 18:49         ` Emma Strubell
2008-11-23 20:00           ` tvali
2008-11-23 21:20           ` Mike Auty
2008-11-23 21:59             ` René 'Necoro' Neumann
2008-11-24  0:53               ` tvali
2008-11-24  9:34                 ` René 'Necoro' Neumann
2008-11-24  9:48                   ` Fabian Groffen
2008-11-24 14:30                     ` tvali
2008-11-24 15:14                       ` tvali
2008-11-24 15:15                       ` René 'Necoro' Neumann
2008-11-24 15:18                         ` tvali
2008-11-24 17:15                           ` tvali
2008-11-30 23:42                             ` Emma Strubell
2008-12-01  7:34                               ` [gentoo-portage-dev] " Duncan
2008-12-01 10:40                                 ` Emma Strubell
2008-12-01 17:52                                   ` Zac Medico
2008-12-01 21:25                                     ` Emma Strubell
2008-12-01 21:52                                       ` Tambet
2008-12-01 22:08                                         ` Emma Strubell
2008-12-01 22:17                                           ` René 'Necoro' Neumann
2008-12-01 22:47                                             ` Emma Strubell
2008-12-02  0:20                                               ` Tambet
2008-12-02  2:23                                                 ` Emma Strubell
2008-12-02 10:21                                                 ` Alec Warner
2008-12-02 12:42                                                   ` Tambet
2008-12-02 13:51                                                     ` Tambet
2008-12-02 19:54                                                     ` Alec Warner
2008-12-02 21:47                                                       ` Tambet
2008-12-02 17:42                                                   ` Tambet [this message]
2008-11-23 14:56     ` [gentoo-portage-dev] " Douglas Anderson
2008-11-24  3:12 ` Marius Mauch
2008-11-24  5:01   ` devsk
2008-11-24  6:25     ` Marius Mauch
2008-11-24  6:47     ` [gentoo-portage-dev] " Duncan
2009-02-12 19:16 ` [gentoo-portage-dev] " René 'Necoro' Neumann
     [not found]   ` <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com>
2009-02-12 21:01     ` Fwd: " Emma Strubell
2009-02-12 21:05       ` Mike Auty
2009-02-12 21:14         ` Emma Strubell
2009-02-13 13:37       ` Marijn Schouten (hkBst)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cea53e3c0812020942n2b8da1d4v370e9b4f8cf7b1a5@mail.gmail.com \
    --to=qtvali@gmail.com \
    --cc=gentoo-portage-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox