From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-portage-dev+bounces-2160-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1L4RtF-00070M-Eq
	for garchives@archives.gentoo.org; Mon, 24 Nov 2008 03:13:33 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 0CF45E06BF;
	Mon, 24 Nov 2008 03:13:34 +0000 (UTC)
Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.9])
	by pigeon.gentoo.org (Postfix) with ESMTP id 1079CE06BF
	for <gentoo-portage-dev@lists.gentoo.org>; Mon, 24 Nov 2008 03:13:33 +0000 (UTC)
Received: from sheridan (dslb-088-070-220-113.pools.arcor-ip.net [88.70.220.113])
	by mrelayeu.kundenserver.de (node=mrelayeu6) with ESMTP (Nemesis)
	id 0ML29c-1L4RtC1j23-00031b; Mon, 24 Nov 2008 04:13:30 +0100
Date: Mon, 24 Nov 2008 04:12:57 +0100
From: Marius Mauch <genone@gentoo.org>
To: gentoo-portage-dev@lists.gentoo.org
Subject: Re: [gentoo-portage-dev] search functionality in emerge
Message-Id: <20081124041257.755679eb.genone@gentoo.org>
In-Reply-To: <5a8c638a0811230417r5bcf912fka14a18edc9c711b6@mail.gmail.com>
References: <5a8c638a0811230417r5bcf912fka14a18edc9c711b6@mail.gmail.com>
X-Mailer: Sylpheed 2.4.8 (GTK+ 2.10.14; i686-pc-mingw32)
Precedence: bulk
List-Post: <mailto:gentoo-portage-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-portage-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-portage-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-portage-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-portage-dev.gentoo.org>
X-BeenThere: gentoo-portage-dev@lists.gentoo.org
Reply-to: gentoo-portage-dev@lists.gentoo.org
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Provags-ID: V01U2FsdGVkX1/AkhFFPJAP5brgZ2JXitozxyT/cV2yDMPH0rO
 CzQWI8F5HBLivftIf5YfT+P8NRMeEOHorgb1cgj7/R0Iuv6YXJ
 OfcTjb7WeBV+vkwFtGboA==
X-Archives-Salt: e6ae76fc-7912-40e5-ba8e-57d9e9faf8b3
X-Archives-Hash: dd544c6e95c8a07351103cb1ae8f946c

On Sun, 23 Nov 2008 07:17:40 -0500
"Emma Strubell" <emma.strubell@gmail.com> wrote:

> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone
> on here could give me a quick overview of how the search function
> currently works, an idea as to what could be modified or implemented
> in order to improve the running time of this code, or any tip really
> as to where I should start or what I should start looking at. I'd
> really appreciate any help or advice!!

Well, it depends how much effort you want to put into this. The current
interface doesn't actually provide a "search" interface, but merely
functions to
1) list all package names - dbapi.cp_all()
2) list all package names and versions - dbapi.cpv_all()
3) list all versions for a given package name - dbapi.cp_list()
4) read metadata (like DESCRIPTION) for a given package name and
version - dbapi.aux_get()

One of the main performance problems of --search is that there is no
persistent cache for functions 1, 2 and 3, so if you're "just"
interested in performance aspects you might want to look into that.
The issue with implementing a persistent cache is that you have to
consider both cold and hot filesystem cache cases: Loading an index
file with package names and versions might improve the cold-cache case,
but slow things down when the filesystem cache is populated.
As has been mentioned, keeping the index updated is the other major
issue, especially as it has to be portable and should require little or
no configuration/setup for the user (so no extra daemons or special
filesystems running permanently in the background). The obvious
solution would be to generate the cache after `emerge --sync` (and other
sync implementations) and hope that people don't modify their tree and
search for the changes in between (that's what all the external tools
do). I don't know if there is actually a way to do online updates while
still improving performance and not relying on custom system daemons
running in the background.

As for --searchdesc, one problem is that dbapi.aux_get() can only
operate on a single package-version on each call (though it can read
multiple metadata variables). So for description searches the control
flow is like this (obviously simplified):

result = []
# iterate over all packages
for package in dbapi.cp_all():
    # determine the current version of each package, this is 
    # another performance issue.
    version = get_current_version(package)
    # read package description from metadata cache
    description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
    # check if the description matches
    if matches(description, searchkey):
        result.append(package)

There you see the three bottlenecks: the lack of a pregenerated package
list, the version lookup for *each* package and the actual metadata
read. I've already talked about the first, so lets look at the other
two. The core problem there is that DESCRIPTION (like all standard
metadata variables) is version specific, so to access it you need to
determine a version to use, even though in almost all cases the
description is the same (or very similar) for all versions. So the
proper solution would be to make the description a property of the
package name instead of the package version, but that's a _huge_ task
you're probably not interested in. What _might_ work here is to add
support for an optional package-name->description cache that can be
generated offline and includes those packages where all versions have
the same description, and fall back to the current method if the
package is not included in the cache. (Don't think about caching the
version lookup, that's system dependent and therefore not suitable for
caching).

Hope it has become clear that while the actual search algorithm might
be simple and not very efficient, the real problem lies in getting the
data to operate on. 

That and the somewhat limited dbapi interface.

Disclaimer: The stuff below involves extending and redesigning some
core portage APIs. This isn't something you can do on a weekend, only
work on this if you want to commit yourself to portage development
for a long time.

The functions listed above are the bare minimum to
perform queries on the package repositories, but they're very
low-level. That means that whenever you want to select packages by
name, description, license, dependencies or other variables you need
quite a bit of custom code, more if you want to combine multiple
searches, and much more if you want to do it efficient and flexible.
See http://dev.gentoo.org/~genone/scripts/metalib.py and
http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible,
but very inefficient search tool (might not work anymore due to old
age).

Ideally repository searches could be done without writing any
application code using some kind of query language, similar to how SQL
works for generic database searches (obviously not that complex).
But before thinking about that we'd need a query API that actually
a) allows tools to assemble queries without having to worry about
implementation details
b) run them efficiently without bothering the API user

Simple example: Find all package-versions in the sys-apps category that
are BSD-licensed.

Currently that would involve something like:

result = []
for package is dbapi.cp_all():
    if not package.startswith("sys-apps/"):
        continue
    for version in dbapi.cp_list(package):
        license = dbapi.aux_get(version, ["LICENSE"])[0]
        # for simplicity perform a equivalence check, in reality you'd
        # have to account for complex license definitions 
        if license == "BSD":
            result.append(version)

Not very friendly to maintain, and not very efficient (we'd only need
to iterate over packages in the 'sys-apps' category, but the interface
doesn't allow that).
And now how it might look with a extensive query interface:

query = AndQuery()
query.add(CategoryQuery("sys-apps", FullStringMatch()))
query.add(MetadataQuery("BSD", FullStringMatch()))
result = repository.selectPackages(query)

Much nicer, don't you think?

As said, implementing such a thing would be a huge amount of work, even
if just implemented as wrappers on top of the current interface (which
would prevent many efficiency improvements), but if you (or anyone else
for that matter) are truly interested in this contact me off-list,
maybe I can find some of my old design ideas and (incomplete)
prototypes to give you a start.

Marius