From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-portage-dev+bounces-2208-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1L7Ua4-0002PR-C4
	for garchives@archives.gentoo.org; Tue, 02 Dec 2008 12:42:20 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 70ADAE00C2;
	Tue,  2 Dec 2008 12:42:19 +0000 (UTC)
Received: from yx-out-1718.google.com (yx-out-1718.google.com [74.125.44.153])
	by pigeon.gentoo.org (Postfix) with ESMTP id 3258AE00C2
	for <gentoo-portage-dev@lists.gentoo.org>; Tue,  2 Dec 2008 12:42:19 +0000 (UTC)
Received: by yx-out-1718.google.com with SMTP id 4so1304620yxp.46
        for <gentoo-portage-dev@lists.gentoo.org>; Tue, 02 Dec 2008 04:42:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:received:received:message-id:date:from:to
         :subject:in-reply-to:mime-version:content-type:references;
        bh=N+Tvxyt6+4v/FJ5tqtzOnSCenr61p8Q7kPPnk6c9eZo=;
        b=K2a2xIKPVZbnu87kCHIswo3yBAnbgJ6kNGW4wHvfrEFCDHkox/0iAUR0zDeavD2bJ2
         yPhh5PxxcZ7DCVovSky0L9chO/JDXVlz+NRgo01ny0iTUo2e+pMmnxOdAIcFcqrISOWC
         MohdYxpVVCnip/z7NT90GmkpPGdIuo4yYaVYA=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=VsX/HGWy7xc9v5DeLukNAjWGyyfW4mSQFEAJkLOSnTCKgVluju3z7Lb/KYheneQcoM
         4FxhtMzR8k9COKwRdQrjPY/aYuKqjwYyZzOLFtdvy0f3Me25SGH6q/rOj5gGRRT4fupx
         T6x6Ief83cmM7af/Rc1xAl9lfDSuwEHWhO67g=
Received: by 10.143.2.19 with SMTP id e19mr2766303wfi.96.1228221738419;
        Tue, 02 Dec 2008 04:42:18 -0800 (PST)
Received: by 10.142.76.20 with HTTP; Tue, 2 Dec 2008 04:42:18 -0800 (PST)
Message-ID: <cea53e3c0812020442s6fcb36ao7e8fcee400a3370b@mail.gmail.com>
Date: Tue, 2 Dec 2008 14:42:18 +0200
From: Tambet <qtvali@gmail.com>
To: gentoo-portage-dev@lists.gentoo.org
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
In-Reply-To: <b41005390812020221s3645d862t62866ab2d62ce85@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-portage-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-portage-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-portage-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-portage-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-portage-dev.gentoo.org>
X-BeenThere: gentoo-portage-dev@lists.gentoo.org
Reply-to: gentoo-portage-dev@lists.gentoo.org
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_65782_20403769.1228221738426"
References: <5a8c638a0811230417r5bcf912fka14a18edc9c711b6@mail.gmail.com>
	 <5a8c638a0812010240m1e9a64a1t6ea0980dcb1baffa@mail.gmail.com>
	 <49342452.1050606@gentoo.org>
	 <5a8c638a0812011325l7c4231d7n85bbe63e69f2d0fe@mail.gmail.com>
	 <cea53e3c0812011352tf83a107kb4548d71aab1ec52@mail.gmail.com>
	 <5a8c638a0812011408j792bbda4r3716d04088efab4f@mail.gmail.com>
	 <49346292.2000307@necoro.eu>
	 <5a8c638a0812011447g37c4900y1446401b85beb87a@mail.gmail.com>
	 <cea53e3c0812011620w94e8847vb3777d2b05832ded@mail.gmail.com>
	 <b41005390812020221s3645d862t62866ab2d62ce85@mail.gmail.com>
X-Archives-Salt: 71d0c4af-099e-4e18-88cd-7314879cd38f
X-Archives-Hash: ecb6dd6c92dba5051d11944b13b11a16

------=_Part_65782_20403769.1228221738426
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

About zipping.. Default settings might not really be good idea - i think
that "fastest" might be even better. Considering that portage tree contains
same word again and again (like "applications") it needs pretty small
dictionary to make it much smaller. Decompressing will not be reading from
disc, decompressing and writing back to disc as in your case probably - try
decompression to memory drive and you might get better numbers.

I have personally used compression in one c++ application and with optimum
settings, it made things much faster - those were files, where i had for
example 65536 16-byte integers, which could be zeros and mostly were; I
didnt care about creating better file format, but just compressed the whole
thing.

I suggest you to compress esearch db, then decompress it to memory drive an=
d
give us those numbers - might be considerably faster.

http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.
Try open of that and normal open on esearch db; also compress with the same
lib to get right kind of file.

Anyway - maybe this compression should be later added and optional.

Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.


2008/12/2 Alec Warner <antarus@gentoo.org>

> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
> >>
> >> True, true. Like I said, I don't really use overlays, so excuse my
> >> igonrance.
> >
> > Do you know an order of doing things:
> >
> > Rules of Optimization:
> >
> > Rule 1: Don't do it.
> > Rule 2 (for experts only): Don't do it yet.
> >
> > What this actually means - functionality comes first. Readability comes
> > next. Optimization comes last. Unless you are creating a fancy 3D engin=
e
> for
> > kung fu game.
> >
> > If you are going to exclude overlays, you are removing functionality -
> and,
> > indeed, absolutely has-to-be-there functionality, because noone would
> > intuitively expect search function to search only one subset of package=
s,
> > however reasonable this subset would be. So, you can't, just can't, add
> this
> > package into portage base - you could write just another external searc=
h
> > package for portage.
> >
> > I looked this code a bit and:
> > Portage's "__init__.py" contains comment "# search functionality". Afte=
r
> > this comment, there is a nice and simple search class.
> > It also contains method "def action_sync(...)", which contains
> > synchronization stuff.
> >
> > Now, search class will be initialized by setting up 3 databases -
> porttree,
> > bintree and vartree, whatever those are. Those will be in self._dbs arr=
ay
> > and porttree will be in self._portdb.
> >
> > It contains some more methods:
> > _findname(...) will return result of self._portdb.findname(...) with sa=
me
> > parameters or None if it does not exist.
> > Other methods will do similar things - map one or another method.
> > execute will do the real search...
> > Now - "for package in self.portdb.cp_all()" is important here ...it
> > currently loops over whole portage tree. All kinds of matching will be
> done
> > inside.
> > self.portdb obviously points to porttree.py (unless it points to fake
> tree).
> > cp_all will take all porttrees and do simple file search inside. This
> method
> > should contain optional index search.
> >
> >               self.porttrees =3D [self.porttree_root] + \
> >                       [os.path.realpath(t) for t in
> self.mysettings["PORTDIR_OVERLAY"].split()]
> >
> > So, self.porttrees contains list of trees - first of them is root, othe=
rs
> > are overlays.
> >
> > Now, what you have to do will not be harder just because of having
> overlay
> > search, too.
> >
> > You have to create method def cp_index(self), which will return
> dictionary
> > containing package names as keys. For oroot... will be
> "self.porttrees[1:]",
> > not "self.porttrees" - this will only search overlays. d =3D {} will be
> > replaced with d =3D self.cp_index(). If index is not there, old version
> will
> > be used (thus, you have to make internal porttrees variable, which
> contains
> > all or all except first).
> >
> > Other methods used by search are xmatch and aux_get - first used severa=
l
> > times and last one used to get description. You have to cache results o=
f
> > those specific queries and make them use your cache - as you can see,
> those
> > parts of portage are already able to use overlays. Thus, you have to pu=
t
> > your code again in beginning of those functions - create index_xmatch a=
nd
> > index_aux_get methods, then make those methods use them and return thei=
r
> > results unless those are None (or something other in case none is alrea=
dy
> > legal result) - if they return None, old code will be run and do it's
> job.
> > If index is not created, result is None. In index_** methods, just chec=
k
> if
> > query is what you can answer and if it is, then answer it.
> >
> > Obviously, the simplest way to create your index is to delete index, th=
en
> > use those same methods to query for all nessecary information - and
> fastest
> > way would be to add updating index directly into sync, which you could =
do
> > later.
> >
> > Please, also, make those commands to turn index on and off (last one
> should
> > also delete it to save disk space). Default should be off until it's
> fast,
> > small and reliable. Also notice that if index is kept on hard drive, it
> > might be faster if it's compressed (gz, for example) - decompressing
> takes
> > less time and more processing power than reading it fully out.
>
> I'm pretty sure your mistaken here, unless your index is stored on a
> floppy or something really slow.
>
> A disk read has 2 primary costs.
>
> Seek Time: Time for the head to seek to the sector of disk you want.
> Spin Time: Time for the platter to spin around such that the sector
> you want is under the read head.
>
> Spin Time is based on rpm, so average 7200 rpm / 60 seconds =3D 120
> rotations per second, so worst case (you just passed the sector you
> need) you need to wait 1/120th of a second (or 8ms).
>
> Seek Time is per hard drive, but most drives provide average seek
> times under 10ms.
>
> So it takes on average 18ms to get to your data, then you start
> reading.  The index will not be that large (my esearchdb is 2 megs,
> but lets assume 10MB for this compressed index).
>
> I took a 10MB meg sqlite database and compressed it with gzip (default
> settings) down to 5 megs.
> gzip -d on the database takes 300ms, catting the decompressed data
> base takes 88ms (average of 5 runs, drop disk caches between runs).
>
> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pick=
le
>
> 1.3megs compresses to 390k.
>
> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
> disk.
>
> Your index would have to be very large or very fragmented on disk
> (requiring more than one seek) to see a significant gain in file
> compression (gzip scales linearly).
>
> In short, don't compress the index ;p
>
> >
> > Have luck!
> >
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>> Emma Strubell schrieb:
> >>> > 2) does anyone really need to search an overlay anyway?
> >>>
> >>> Of course. Take large (semi-)official overlays like sunrise. They can
> >>> easily be seen as a second portage tree.
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: GnuPG v2.0.9 (GNU/Linux)
> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >>>
> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
> >>> =3D+lCO
> >>> -----END PGP SIGNATURE-----
> >>>
> >> On Mon, Dec 1, 2008 at 5:17 PM, Ren=E9 'Necoro' Neumann <lists@necoro.=
eu>
> >> wrote:
> >>
> >
> >
>

------=_Part_65782_20403769.1228221738426
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

About zipping.. Default settings might not really be good idea - i think th=
at &quot;fastest&quot; might be even better. Considering that portage tree =
contains same word again and again (like &quot;applications&quot;) it needs=
 pretty small dictionary to make it much smaller. Decompressing will not be=
 reading from disc, decompressing and writing back to disc as in your case =
probably - try decompression to memory drive and you might get better numbe=
rs.<br>
<br>I have personally used compression in one c++ application and with opti=
mum settings, it made things much faster - those were files, where i had fo=
r example 65536 16-byte integers, which could be zeros and mostly were; I d=
idnt care about creating better file format, but just compressed the whole =
thing.<br>
<br>I suggest you to compress esearch db, then decompress it to memory driv=
e and give us those numbers - might be considerably faster.<br><br><a href=
=3D"http://www.python.org/doc/2.5.2/lib/module-gzip.html">http://www.python=
.org/doc/2.5.2/lib/module-gzip.html</a> - Python gzip support. Try open of =
that and normal open on esearch db; also compress with the same lib to get =
right kind of file.<br>
<br>Anyway - maybe this compression should be later added and optional.<br =
clear=3D"all"><br>Tambet - technique evolves to art, art evolves to magic, =
magic evolves to just doing.<br>
<br><br><div class=3D"gmail_quote">2008/12/2 Alec Warner <span dir=3D"ltr">=
&lt;<a href=3D"mailto:antarus@gentoo.org">antarus@gentoo.org</a>&gt;</span>=
<br><blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(2=
04, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div><div></div><div class=3D"Wj3C7c">On Mon, Dec 1, 2008 at 4:20 PM, Tambe=
t &lt;<a href=3D"mailto:qtvali@gmail.com">qtvali@gmail.com</a>&gt; wrote:<b=
r>
&gt; 2008/12/2 Emma Strubell &lt;<a href=3D"mailto:emma.strubell@gmail.com"=
>emma.strubell@gmail.com</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; True, true. Like I said, I don&#39;t really use overlays, so excus=
e my<br>
&gt;&gt; igonrance.<br>
&gt;<br>
&gt; Do you know an order of doing things:<br>
&gt;<br>
&gt; Rules of Optimization:<br>
&gt;<br>
&gt; Rule 1: Don&#39;t do it.<br>
&gt; Rule 2 (for experts only): Don&#39;t do it yet.<br>
&gt;<br>
&gt; What this actually means - functionality comes first. Readability come=
s<br>
&gt; next. Optimization comes last. Unless you are creating a fancy 3D engi=
ne for<br>
&gt; kung fu game.<br>
&gt;<br>
&gt; If you are going to exclude overlays, you are removing functionality -=
 and,<br>
&gt; indeed, absolutely has-to-be-there functionality, because noone would<=
br>
&gt; intuitively expect search function to search only one subset of packag=
es,<br>
&gt; however reasonable this subset would be. So, you can&#39;t, just can&#=
39;t, add this<br>
&gt; package into portage base - you could write just another external sear=
ch<br>
&gt; package for portage.<br>
&gt;<br>
&gt; I looked this code a bit and:<br>
&gt; Portage&#39;s &quot;__init__.py&quot; contains comment &quot;# search =
functionality&quot;. After<br>
&gt; this comment, there is a nice and simple search class.<br>
&gt; It also contains method &quot;def action_sync(...)&quot;, which contai=
ns<br>
&gt; synchronization stuff.<br>
&gt;<br>
&gt; Now, search class will be initialized by setting up 3 databases - port=
tree,<br>
&gt; bintree and vartree, whatever those are. Those will be in self._dbs ar=
ray<br>
&gt; and porttree will be in self._portdb.<br>
&gt;<br>
&gt; It contains some more methods:<br>
&gt; _findname(...) will return result of self._portdb.findname(...) with s=
ame<br>
&gt; parameters or None if it does not exist.<br>
&gt; Other methods will do similar things - map one or another method.<br>
&gt; execute will do the real search...<br>
&gt; Now - &quot;for package in self.portdb.cp_all()&quot; is important her=
e ...it<br>
&gt; currently loops over whole portage tree. All kinds of matching will be=
 done<br>
&gt; inside.<br>
&gt; self.portdb obviously points to porttree.py (unless it points to fake =
tree).<br>
&gt; cp_all will take all porttrees and do simple file search inside. This =
method<br>
&gt; should contain optional index search.<br>
&gt;<br>
&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self.porttrees =3D [s=
elf.porttree_root] + \<br>
&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; [os.path.realpath(t) for t in self.mysettings[&quot;PORTDIR_OVERLAY&=
quot;].split()]<br>
&gt;<br>
&gt; So, self.porttrees contains list of trees - first of them is root, oth=
ers<br>
&gt; are overlays.<br>
&gt;<br>
&gt; Now, what you have to do will not be harder just because of having ove=
rlay<br>
&gt; search, too.<br>
&gt;<br>
&gt; You have to create method def cp_index(self), which will return dictio=
nary<br>
&gt; containing package names as keys. For oroot... will be &quot;self.port=
trees[1:]&quot;,<br>
&gt; not &quot;self.porttrees&quot; - this will only search overlays. d =3D=
 {} will be<br>
&gt; replaced with d =3D self.cp_index(). If index is not there, old versio=
n will<br>
&gt; be used (thus, you have to make internal porttrees variable, which con=
tains<br>
&gt; all or all except first).<br>
&gt;<br>
&gt; Other methods used by search are xmatch and aux_get - first used sever=
al<br>
&gt; times and last one used to get description. You have to cache results =
of<br>
&gt; those specific queries and make them use your cache - as you can see, =
those<br>
&gt; parts of portage are already able to use overlays. Thus, you have to p=
ut<br>
&gt; your code again in beginning of those functions - create index_xmatch =
and<br>
&gt; index_aux_get methods, then make those methods use them and return the=
ir<br>
&gt; results unless those are None (or something other in case none is alre=
ady<br>
&gt; legal result) - if they return None, old code will be run and do it&#3=
9;s job.<br>
&gt; If index is not created, result is None. In index_** methods, just che=
ck if<br>
&gt; query is what you can answer and if it is, then answer it.<br>
&gt;<br>
&gt; Obviously, the simplest way to create your index is to delete index, t=
hen<br>
&gt; use those same methods to query for all nessecary information - and fa=
stest<br>
&gt; way would be to add updating index directly into sync, which you could=
 do<br>
&gt; later.<br>
&gt;<br>
&gt; Please, also, make those commands to turn index on and off (last one s=
hould<br>
&gt; also delete it to save disk space). Default should be off until it&#39=
;s fast,<br>
&gt; small and reliable. Also notice that if index is kept on hard drive, i=
t<br>
&gt; might be faster if it&#39;s compressed (gz, for example) - decompressi=
ng takes<br>
&gt; less time and more processing power than reading it fully out.<br>
<br>
</div></div>I&#39;m pretty sure your mistaken here, unless your index is st=
ored on a<br>
floppy or something really slow.<br>
<br>
A disk read has 2 primary costs.<br>
<br>
Seek Time: Time for the head to seek to the sector of disk you want.<br>
Spin Time: Time for the platter to spin around such that the sector<br>
you want is under the read head.<br>
<br>
Spin Time is based on rpm, so average 7200 rpm / 60 seconds =3D 120<br>
rotations per second, so worst case (you just passed the sector you<br>
need) you need to wait 1/120th of a second (or 8ms).<br>
<br>
Seek Time is per hard drive, but most drives provide average seek<br>
times under 10ms.<br>
<br>
So it takes on average 18ms to get to your data, then you start<br>
reading. &nbsp;The index will not be that large (my esearchdb is 2 megs,<br=
>
but lets assume 10MB for this compressed index).<br>
<br>
I took a 10MB meg sqlite database and compressed it with gzip (default<br>
settings) down to 5 megs.<br>
gzip -d on the database takes 300ms, catting the decompressed data<br>
base takes 88ms (average of 5 runs, drop disk caches between runs).<br>
<br>
I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle=
<br>
<br>
1.3megs compresses to 390k.<br>
<br>
36ms to decompress the 390k file, but 26ms to read the 1.3meg file from dis=
k.<br>
<br>
Your index would have to be very large or very fragmented on disk<br>
(requiring more than one seek) to see a significant gain in file<br>
compression (gzip scales linearly).<br>
<br>
In short, don&#39;t compress the index ;p<br>
<div><div></div><div class=3D"Wj3C7c"><br>
&gt;<br>
&gt; Have luck!<br>
&gt;<br>
&gt;&gt;&gt; -----BEGIN PGP SIGNED MESSAGE-----<br>
&gt;&gt;&gt; Hash: SHA1<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Emma Strubell schrieb:<br>
&gt;&gt;&gt; &gt; 2) does anyone really need to search an overlay anyway?<b=
r>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Of course. Take large (semi-)official overlays like sunrise. T=
hey can<br>
&gt;&gt;&gt; easily be seen as a second portage tree.<br>
&gt;&gt;&gt; -----BEGIN PGP SIGNATURE-----<br>
&gt;&gt;&gt; Version: GnuPG v2.0.9 (GNU/Linux)<br>
&gt;&gt;&gt; Comment: Using GnuPG with Mozilla - <a href=3D"http://enigmail=
.mozdev.org" target=3D"_blank">http://enigmail.mozdev.org</a><br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1=
Tt<br>
&gt;&gt;&gt; 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S<br>
&gt;&gt;&gt; =3D+lCO<br>
&gt;&gt;&gt; -----END PGP SIGNATURE-----<br>
&gt;&gt;&gt;<br>
&gt;&gt; On Mon, Dec 1, 2008 at 5:17 PM, Ren=E9 &#39;Necoro&#39; Neumann &l=
t;lists@necoro.eu&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt;<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br>

------=_Part_65782_20403769.1228221738426--