From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from <gentoo-portage-dev+bounces-2160-garchives=archives.gentoo.org@lists.gentoo.org>) id 1L4RtF-00070M-Eq for garchives@archives.gentoo.org; Mon, 24 Nov 2008 03:13:33 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 0CF45E06BF; Mon, 24 Nov 2008 03:13:34 +0000 (UTC) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.9]) by pigeon.gentoo.org (Postfix) with ESMTP id 1079CE06BF for <gentoo-portage-dev@lists.gentoo.org>; Mon, 24 Nov 2008 03:13:33 +0000 (UTC) Received: from sheridan (dslb-088-070-220-113.pools.arcor-ip.net [88.70.220.113]) by mrelayeu.kundenserver.de (node=mrelayeu6) with ESMTP (Nemesis) id 0ML29c-1L4RtC1j23-00031b; Mon, 24 Nov 2008 04:13:30 +0100 Date: Mon, 24 Nov 2008 04:12:57 +0100 From: Marius Mauch <genone@gentoo.org> To: gentoo-portage-dev@lists.gentoo.org Subject: Re: [gentoo-portage-dev] search functionality in emerge Message-Id: <20081124041257.755679eb.genone@gentoo.org> In-Reply-To: <5a8c638a0811230417r5bcf912fka14a18edc9c711b6@mail.gmail.com> References: <5a8c638a0811230417r5bcf912fka14a18edc9c711b6@mail.gmail.com> X-Mailer: Sylpheed 2.4.8 (GTK+ 2.10.14; i686-pc-mingw32) Precedence: bulk List-Post: <mailto:gentoo-portage-dev@lists.gentoo.org> List-Help: <mailto:gentoo-portage-dev+help@lists.gentoo.org> List-Unsubscribe: <mailto:gentoo-portage-dev+unsubscribe@lists.gentoo.org> List-Subscribe: <mailto:gentoo-portage-dev+subscribe@lists.gentoo.org> List-Id: Gentoo Linux mail <gentoo-portage-dev.gentoo.org> X-BeenThere: gentoo-portage-dev@lists.gentoo.org Reply-to: gentoo-portage-dev@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1/AkhFFPJAP5brgZ2JXitozxyT/cV2yDMPH0rO CzQWI8F5HBLivftIf5YfT+P8NRMeEOHorgb1cgj7/R0Iuv6YXJ OfcTjb7WeBV+vkwFtGboA== X-Archives-Salt: e6ae76fc-7912-40e5-ba8e-57d9e9faf8b3 X-Archives-Hash: dd544c6e95c8a07351103cb1ae8f946c On Sun, 23 Nov 2008 07:17:40 -0500 "Emma Strubell" <emma.strubell@gmail.com> wrote: > However, I've started looking at the code, and I must admit I'm pretty > overwhelmed! I don't know where to start. I was wondering if anyone > on here could give me a quick overview of how the search function > currently works, an idea as to what could be modified or implemented > in order to improve the running time of this code, or any tip really > as to where I should start or what I should start looking at. I'd > really appreciate any help or advice!! Well, it depends how much effort you want to put into this. The current interface doesn't actually provide a "search" interface, but merely functions to 1) list all package names - dbapi.cp_all() 2) list all package names and versions - dbapi.cpv_all() 3) list all versions for a given package name - dbapi.cp_list() 4) read metadata (like DESCRIPTION) for a given package name and version - dbapi.aux_get() One of the main performance problems of --search is that there is no persistent cache for functions 1, 2 and 3, so if you're "just" interested in performance aspects you might want to look into that. The issue with implementing a persistent cache is that you have to consider both cold and hot filesystem cache cases: Loading an index file with package names and versions might improve the cold-cache case, but slow things down when the filesystem cache is populated. As has been mentioned, keeping the index updated is the other major issue, especially as it has to be portable and should require little or no configuration/setup for the user (so no extra daemons or special filesystems running permanently in the background). The obvious solution would be to generate the cache after `emerge --sync` (and other sync implementations) and hope that people don't modify their tree and search for the changes in between (that's what all the external tools do). I don't know if there is actually a way to do online updates while still improving performance and not relying on custom system daemons running in the background. As for --searchdesc, one problem is that dbapi.aux_get() can only operate on a single package-version on each call (though it can read multiple metadata variables). So for description searches the control flow is like this (obviously simplified): result = [] # iterate over all packages for package in dbapi.cp_all(): # determine the current version of each package, this is # another performance issue. version = get_current_version(package) # read package description from metadata cache description = dbapi.aux_get(version, ["DESCRIPTION"])[0] # check if the description matches if matches(description, searchkey): result.append(package) There you see the three bottlenecks: the lack of a pregenerated package list, the version lookup for *each* package and the actual metadata read. I've already talked about the first, so lets look at the other two. The core problem there is that DESCRIPTION (like all standard metadata variables) is version specific, so to access it you need to determine a version to use, even though in almost all cases the description is the same (or very similar) for all versions. So the proper solution would be to make the description a property of the package name instead of the package version, but that's a _huge_ task you're probably not interested in. What _might_ work here is to add support for an optional package-name->description cache that can be generated offline and includes those packages where all versions have the same description, and fall back to the current method if the package is not included in the cache. (Don't think about caching the version lookup, that's system dependent and therefore not suitable for caching). Hope it has become clear that while the actual search algorithm might be simple and not very efficient, the real problem lies in getting the data to operate on. That and the somewhat limited dbapi interface. Disclaimer: The stuff below involves extending and redesigning some core portage APIs. This isn't something you can do on a weekend, only work on this if you want to commit yourself to portage development for a long time. The functions listed above are the bare minimum to perform queries on the package repositories, but they're very low-level. That means that whenever you want to select packages by name, description, license, dependencies or other variables you need quite a bit of custom code, more if you want to combine multiple searches, and much more if you want to do it efficient and flexible. See http://dev.gentoo.org/~genone/scripts/metalib.py and http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible, but very inefficient search tool (might not work anymore due to old age). Ideally repository searches could be done without writing any application code using some kind of query language, similar to how SQL works for generic database searches (obviously not that complex). But before thinking about that we'd need a query API that actually a) allows tools to assemble queries without having to worry about implementation details b) run them efficiently without bothering the API user Simple example: Find all package-versions in the sys-apps category that are BSD-licensed. Currently that would involve something like: result = [] for package is dbapi.cp_all(): if not package.startswith("sys-apps/"): continue for version in dbapi.cp_list(package): license = dbapi.aux_get(version, ["LICENSE"])[0] # for simplicity perform a equivalence check, in reality you'd # have to account for complex license definitions if license == "BSD": result.append(version) Not very friendly to maintain, and not very efficient (we'd only need to iterate over packages in the 'sys-apps' category, but the interface doesn't allow that). And now how it might look with a extensive query interface: query = AndQuery() query.add(CategoryQuery("sys-apps", FullStringMatch())) query.add(MetadataQuery("BSD", FullStringMatch())) result = repository.selectPackages(query) Much nicer, don't you think? As said, implementing such a thing would be a huge amount of work, even if just implemented as wrappers on top of the current interface (which would prevent many efficiency improvements), but if you (or anyone else for that matter) are truly interested in this contact me off-list, maybe I can find some of my old design ideas and (incomplete) prototypes to give you a start. Marius