* [gentoo-portage-dev] search functionality in emerge @ 2008-11-23 12:17 Emma Strubell 2008-11-23 14:01 ` tvali ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Emma Strubell @ 2008-11-23 12:17 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 1636 bytes --] Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!! Anyway, for my final project in my Data Structures & Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies - whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing. However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Thanks a lot, and keep on making my Debian-using professor jealous :] Emma [-- Attachment #2: Type: text/html, Size: 1741 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell @ 2008-11-23 14:01 ` tvali 2008-11-23 14:33 ` Pacho Ramos 2008-11-24 3:12 ` Marius Mauch 2009-02-12 19:16 ` [gentoo-portage-dev] " René 'Necoro' Neumann 2 siblings, 1 reply; 44+ messages in thread From: tvali @ 2008-11-23 14:01 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 2017 bytes --] Try esearch. emerge esearch esearch ... 2008/11/23 Emma Strubell <emma.strubell@gmail.com> > Hi everyone. My name is Emma, and I am completely new to this list. I've > been using Gentoo since 2004, including Portage of course, and before I say > anything else I'd like to say thanks to everyone for such a kickass package > management system!! > > Anyway, for my final project in my Data Structures & Algorithms class this > semester, I would like to modify the search functionality in emerge. > Something I've always noticed about 'emerge -s' or '-S' is that, in general, > it takes a very long time to perform the searches. (Although, lately it does > seem to be running faster, specifically on my laptop as opposed to my > desktop. Strangely, though, it seems that when I do a simple 'emerge -av > whatever' on my laptop it takes a very long time for emerge to find the > package and/or determine the dependecies - whatever it's doing behind that > spinner. I can definitely go into more detail about this if anyone's > interested. It's really been puzzling me!) So, as my final project I've > proposed to improve the time it takes to perform a search using emerge. My > professor suggested that I look into implementing indexing. > > However, I've started looking at the code, and I must admit I'm pretty > overwhelmed! I don't know where to start. I was wondering if anyone on here > could give me a quick overview of how the search function currently works, > an idea as to what could be modified or implemented in order to improve the > running time of this code, or any tip really as to where I should start or > what I should start looking at. I'd really appreciate any help or advice!! > > Thanks a lot, and keep on making my Debian-using professor jealous :] > Emma > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 2406 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 14:01 ` tvali @ 2008-11-23 14:33 ` Pacho Ramos 2008-11-23 14:43 ` Emma Strubell 2008-11-23 14:56 ` [gentoo-portage-dev] " Douglas Anderson 0 siblings, 2 replies; 44+ messages in thread From: Pacho Ramos @ 2008-11-23 14:33 UTC (permalink / raw To: gentoo-portage-dev El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: > Try esearch. > > emerge esearch > esearch ... > > 2008/11/23 Emma Strubell <emma.strubell@gmail.com> > Hi everyone. My name is Emma, and I am completely new to this > list. I've been using Gentoo since 2004, including Portage of > course, and before I say anything else I'd like to say thanks > to everyone for such a kickass package management system!! > > Anyway, for my final project in my Data Structures & > Algorithms class this semester, I would like to modify the > search functionality in emerge. Something I've always noticed > about 'emerge -s' or '-S' is that, in general, it takes a very > long time to perform the searches. (Although, lately it does > seem to be running faster, specifically on my laptop as > opposed to my desktop. Strangely, though, it seems that when I > do a simple 'emerge -av whatever' on my laptop it takes a very > long time for emerge to find the package and/or determine the > dependecies - whatever it's doing behind that spinner. I can > definitely go into more detail about this if anyone's > interested. It's really been puzzling me!) So, as my final > project I've proposed to improve the time it takes to perform > a search using emerge. My professor suggested that I look into > implementing indexing. > > However, I've started looking at the code, and I must admit > I'm pretty overwhelmed! I don't know where to start. I was > wondering if anyone on here could give me a quick overview of > how the search function currently works, an idea as to what > could be modified or implemented in order to improve the > running time of this code, or any tip really as to where I > should start or what I should start looking at. I'd really > appreciate any help or advice!! > > Thanks a lot, and keep on making my Debian-using professor > jealous :] > Emma > > > > -- > tvali > > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju > mingi täica pea nagu prügikast... I use eix: emerge eix ;-) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 14:33 ` Pacho Ramos @ 2008-11-23 14:43 ` Emma Strubell 2008-11-23 16:56 ` Lucian Poston 2008-11-23 14:56 ` [gentoo-portage-dev] " Douglas Anderson 1 sibling, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-11-23 14:43 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 3334 bytes --] Thanks for the replies! I know there are a couple programs out there that basically already do what I'm looking to do... Unfortunately I wasn't aware of these pre-existing utilities until after I submitted my project proposal to my professor. So, I'm looking to implement a better search myself. Preferably by editing the existing portage code, not writing a separate program. So if anyone can offer any help regarding the actual implementation of search in portage, I would greatly appreciate it! Or, if anyone has an idea for a more productive/useful project I could work on relating to portage (about the same difficulty, preferably at least a little bit data structure related), please, let me know! Thanks again guys, Emma On Sun, Nov 23, 2008 at 9:33 AM, Pacho Ramos < pacho@condmat1.ciencias.uniovi.es> wrote: > El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: > > Try esearch. > > > > emerge esearch > > esearch ... > > > > 2008/11/23 Emma Strubell <emma.strubell@gmail.com> > > Hi everyone. My name is Emma, and I am completely new to this > > list. I've been using Gentoo since 2004, including Portage of > > course, and before I say anything else I'd like to say thanks > > to everyone for such a kickass package management system!! > > > > Anyway, for my final project in my Data Structures & > > Algorithms class this semester, I would like to modify the > > search functionality in emerge. Something I've always noticed > > about 'emerge -s' or '-S' is that, in general, it takes a very > > long time to perform the searches. (Although, lately it does > > seem to be running faster, specifically on my laptop as > > opposed to my desktop. Strangely, though, it seems that when I > > do a simple 'emerge -av whatever' on my laptop it takes a very > > long time for emerge to find the package and/or determine the > > dependecies - whatever it's doing behind that spinner. I can > > definitely go into more detail about this if anyone's > > interested. It's really been puzzling me!) So, as my final > > project I've proposed to improve the time it takes to perform > > a search using emerge. My professor suggested that I look into > > implementing indexing. > > > > However, I've started looking at the code, and I must admit > > I'm pretty overwhelmed! I don't know where to start. I was > > wondering if anyone on here could give me a quick overview of > > how the search function currently works, an idea as to what > > could be modified or implemented in order to improve the > > running time of this code, or any tip really as to where I > > should start or what I should start looking at. I'd really > > appreciate any help or advice!! > > > > Thanks a lot, and keep on making my Debian-using professor > > jealous :] > > Emma > > > > > > > > -- > > tvali > > > > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. > > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju > > mingi täica pea nagu prügikast... > > I use eix: > emerge eix > > ;-) > > > [-- Attachment #2: Type: text/html, Size: 4719 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 14:43 ` Emma Strubell @ 2008-11-23 16:56 ` Lucian Poston 2008-11-23 18:49 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: Lucian Poston @ 2008-11-23 16:56 UTC (permalink / raw To: gentoo-portage-dev > Thanks for the replies! I know there are a couple programs out there that > basically already do what I'm looking to do... Unfortunately I wasn't aware > of these pre-existing utilities until after I submitted my project proposal > to my professor. So, I'm looking to implement a better search myself. > Preferably by editing the existing portage code, not writing a separate > program. So if anyone can offer any help regarding the actual implementation > of search in portage, I would greatly appreciate it! Most of the search implementation is in /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's execute() method simply iterates over all packages (and descriptions and package sets) and matches against the searchkey. You might need to look into pym/portage/dbapi/porttree.py for portdbapi as well. If you intend to index and support fast regex lookup, then you need to do some fancy indexing, which I'm not terribly familiar with. You could follow in the steps of eix[1] or other indexed search utilities and design some sort of index layout, which is easier than the following thought. You might consider implementing a suffix trie or similar that has sublinear regexp lookup and marshalling the structure for the index. I couldn't find a python implementation for something like this, but here is a general trie class[2] that you might start with if you go that route. There is a perl module[3], Tie::Hash::Regex, that does that, but properly implementing that in python would be a chore. :) That project sounds interesting and fun. Good luck! Lucian Poston [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout [2] http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx [3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 16:56 ` Lucian Poston @ 2008-11-23 18:49 ` Emma Strubell 2008-11-23 20:00 ` tvali 2008-11-23 21:20 ` Mike Auty 0 siblings, 2 replies; 44+ messages in thread From: Emma Strubell @ 2008-11-23 18:49 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 2448 bytes --] Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the suffix trie sounds like a great idea. The trie class example is really helpful too, because this will be my first time programming in Python, and it's a bit easier to figure out what's going on syntax-wise in that simple trie class than in the middle of the portage source code! Seriously, thanks again :] On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@gmail.com>wrote: > > Thanks for the replies! I know there are a couple programs out there that > > basically already do what I'm looking to do... Unfortunately I wasn't > aware > > of these pre-existing utilities until after I submitted my project > proposal > > to my professor. So, I'm looking to implement a better search myself. > > Preferably by editing the existing portage code, not writing a separate > > program. So if anyone can offer any help regarding the actual > implementation > > of search in portage, I would greatly appreciate it! > > Most of the search implementation is in > /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's > execute() method simply iterates over all packages (and descriptions > and package sets) and matches against the searchkey. You might need > to look into pym/portage/dbapi/porttree.py for portdbapi as well. > > If you intend to index and support fast regex lookup, then you need to > do some fancy indexing, which I'm not terribly familiar with. You > could follow in the steps of eix[1] or other indexed search utilities > and design some sort of index layout, which is easier than the > following thought. You might consider implementing a suffix trie or > similar that has sublinear regexp lookup and marshalling the structure > for the index. I couldn't find a python implementation for something > like this, but here is a general trie class[2] that you might start > with if you go that route. There is a perl module[3], > Tie::Hash::Regex, that does that, but properly implementing that in > python would be a chore. :) > > That project sounds interesting and fun. Good luck! > > Lucian Poston > > [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout > [2] > http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx > [3] > http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm<http://search.cpan.org/%7Edavecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm> > > [-- Attachment #2: Type: text/html, Size: 3124 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 18:49 ` Emma Strubell @ 2008-11-23 20:00 ` tvali 2008-11-23 21:20 ` Mike Auty 1 sibling, 0 replies; 44+ messages in thread From: tvali @ 2008-11-23 20:00 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 6484 bytes --] Yes if it would be a low-level implementation to portage, speeding up it's native code for searching and using indexes, then it would make everything faster, including emerge (because emerge does search first for package relations). I have actually wanted to do it myself several years ago, so reacting here to have my ideas discussed, too. Douglas Anderson 16:46 reply is about locks and I think that it would need to rethink portages locking methods - what, when and why it locks. This is probably quite hard task by itself. Anyway, as portage actually lets user make two emerges at the same time, locks might be OK, too. I think that the best thing would be bottom-up refactoring - first to make a list of lowest-level functions, which have to do with reading data from portage tree or writing into it; then making indexer class, which will be used by all of those low-level functions. To have it OOP, it should be implemented in such way: - Low-level portage tree handler does everything with portage tree, no function in portage uses it directly. - Tree handler has several needed and several optional methods - so that implementing new handler would be easy, but creating things like native regex search would be possible. - One could implement a new tree handler with SQLite or other interface instead of filesystem and do other tricks through this interface; for example, boost it. So, nice way to go would be: 1. Implementing portage tree handler and it's proxy, which uses current portage tree in non-indexed way and simply gives methods for the same kind of access, as currently implemented one. 2. Refactoring portage to rely only on portage tree handler and use direct portage tree nowhere. To test if it is so, portage tree should be moved to another directory, about which only this handler knows, and check if portage works well. Indexing all places, where portage uses it's tree handler (by std. comment, for example) and making clear, which methods would contain all boostable code of it. 3. Implementing those methods in proxy, which could simulate fast regex search and other stuff using simplest possible interface of portage tree handler (smth. like four methods add, remove, get, list). Proxy should be able to use handler's methods if they are implemented. 4. Refactoring portage to use advanced methods in proxy. 5. Now, having taken all code together into one place and looking this nice and readable code, real optimizations could be discussed here, for example. Ideally, i think, portage would have such tree handlers: - Filesystem handler - fast searches over current portage tree structure. - SQL handler - rewrite of tree functions into SQL queries. - Network-based handler - maybe sometimes it would nice to have portage tree only in one machine of cluster, for example if I want to have 100 really small computers with fast connection with mother-computer and portage tree is too big to be wholly copied to all of them. - Memory-buffered handler with daemon, which is actually proxy to some other handler - daemon, which reads whole tree (from filesystem or SQL) into memory on boot or first use, creates really fast index (because now it does matter to have better indexing) and optionally deletes some [less needed] parts of it's index from memory when it's becoming full and behaves as really simple proxy if it stays full. This should be implemented after critical parts of filesystem or SQL handler. 2008/11/23 Emma Strubell <emma.strubell@gmail.com> > Wow, that's extremely helpful!! I happen to particularly enjoy tries, so > the suffix trie sounds like a great idea. The trie class example is really > helpful too, because this will be my first time programming in Python, and > it's a bit easier to figure out what's going on syntax-wise in that simple > trie class than in the middle of the portage source code! > > Seriously, thanks again :] > > > On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@gmail.com>wrote: > >> > Thanks for the replies! I know there are a couple programs out there >> that >> > basically already do what I'm looking to do... Unfortunately I wasn't >> aware >> > of these pre-existing utilities until after I submitted my project >> proposal >> > to my professor. So, I'm looking to implement a better search myself. >> > Preferably by editing the existing portage code, not writing a separate >> > program. So if anyone can offer any help regarding the actual >> implementation >> > of search in portage, I would greatly appreciate it! >> >> Most of the search implementation is in >> /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's >> execute() method simply iterates over all packages (and descriptions >> and package sets) and matches against the searchkey. You might need >> to look into pym/portage/dbapi/porttree.py for portdbapi as well. >> >> If you intend to index and support fast regex lookup, then you need to >> do some fancy indexing, which I'm not terribly familiar with. You >> could follow in the steps of eix[1] or other indexed search utilities >> and design some sort of index layout, which is easier than the >> following thought. You might consider implementing a suffix trie or >> similar that has sublinear regexp lookup and marshalling the structure >> for the index. I couldn't find a python implementation for something >> like this, but here is a general trie class[2] that you might start >> with if you go that route. There is a perl module[3], >> Tie::Hash::Regex, that does that, but properly implementing that in >> python would be a chore. :) >> >> That project sounds interesting and fun. Good luck! >> >> Lucian Poston >> >> [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout >> [2] >> http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx >> [3] >> http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm<http://search.cpan.org/%7Edavecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm> >> >> > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 7402 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 18:49 ` Emma Strubell 2008-11-23 20:00 ` tvali @ 2008-11-23 21:20 ` Mike Auty 2008-11-23 21:59 ` René 'Necoro' Neumann 1 sibling, 1 reply; 44+ messages in thread From: Mike Auty @ 2008-11-23 21:20 UTC (permalink / raw To: gentoo-portage-dev Hiya Emma, Good luck on your project. A couple of things to be weary of are disk I/O, metadata cache backends and overlays. Disk I/O can be a significant bottleneck. Loading up a lot of files from disk (be it the metadata cache or whatever) can take a long time initially, but then be cached in RAM and so be much faster to access in the future. Portage allows for its internal metadata cache to be stored in a variety of formats, as long as there's a backend to support it. This means simple speedups can be achieved using cdb or sqlite (if you google these and portage you'll get gentoo-wiki tips, which unfortunately you'll have to read from google's cache at the moment). It also means that if you want to make use of this metadata from within portage, you'll have to rely on the API to tell the backend to get you all the data (and it may be difficult to speed up without writing your own backend). Finally there are overlays, and since these can change outside of an "emerge --sync" (as indeed can the main tree), you'll have to reindex these before each search request, or give the user stale data until they manually reindex. If you're interesting in implementing this in python, you may be interested in another package manager that can handle the main tree, also implemented in python, called pkgcore. From what I understand, it's a similar code-base to portage, but its internal architecture may have changed a lot. I hope some of that helps, and isn't off putting. I look forward to seeing the results! 5:) Mike 5:) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 21:20 ` Mike Auty @ 2008-11-23 21:59 ` René 'Necoro' Neumann 2008-11-24 0:53 ` tvali 0 siblings, 1 reply; 44+ messages in thread From: René 'Necoro' Neumann @ 2008-11-23 21:59 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mike Auty schrieb: > Finally there are overlays, and since these can change outside of an > "emerge --sync" (as indeed can the main tree), you'll have to reindex > these before each search request, or give the user stale data until they > manually reindex. Determining whether there has been a change to the ebuild system is a major point in the whole thing. What does a great index serves you, if it does not notice the changes the user made in his own local overlay? :) Manually re-indexing is not a good choice I think... If somebody comes up here with a good (and fast) solution, this would be a nice thing ;) (need it myself). Regards, René -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH 3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI =2I2j -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 21:59 ` René 'Necoro' Neumann @ 2008-11-24 0:53 ` tvali 2008-11-24 9:34 ` René 'Necoro' Neumann 0 siblings, 1 reply; 44+ messages in thread From: tvali @ 2008-11-24 0:53 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 5859 bytes --] There is daemon, which notices about filesystem changes - http://pyinotify.sourceforge.net/ would be a good choice. In case many different applications use portage tree directly without using any portage API (which is a bad choice, I think, and should be deprecated), then there is a kind of "hack" - using http://www.freenet.org.nz/python/lufs-python/ to create a new filesystem (damn now I would like to have some time to join this game). I hope it's possible to build it everywhere where gentoo should work, but it'n no problem if it's not - you can implement it in such way that it's not needed. I totally agree, that filesystem is a bottleneck, but this suffix trie would check for directories first, I guess. Now, having this custom filesystem, which actually serves portage tree like some odd API, you can have backwards compability and still create your own thing. Having such classes (numbers show implementation order; this is not specified here if proxies are abstract classes, base classes or smth. other, just it shows some relations between some imaginary objects): - *1. PortageTreeApi* - Proxy for different portage trees on FS or SQL or other. - *2. PortageTreeCachedApi *- same, as previous, but contains boosted memory cache. It should be able to save it's state, which is simply writing it's inner variables into file. - *3. PortageTreeDaemon *- has interface compatible with PortageTreeAPI, this daemon serves portage tree to PortageTreeFS and portage tree itself. In reality, it should be base class of *PortageTreeApi* and * PortageTreeCachedApi* so that they could be directly used as daemons. When cached API is used as daemon, it should be able to check filesystem changes - thus, implementations should contain change trigger callbacks. - *4. PortageTreeFS *- filesystem, which can be used to map any of those to filesystem. Connectable with PortageTreeApi or PortageTreeDaemon. This creates filesystems, which can be used for backwards-compability. This cannot be used on architectures, which dont implement lufs-python or analog. - *6. PortageTreeServer *- server, which serves data from PortageTreeDaemon, PortageTreeCachedApi or PortageTreeApi to some other computer. - Implementations can be proxied through *PortageTreeApi*, * PortageTreeCachedApi* or *PortageTreeDaemon*. - *5. PortageTreeImplementationAsSqlDb* - *1. PortageTreeImplementationAsFilesystem* - *3. PortageTreeImplementationAsDaemon* - client, actually. - *6. PortageTreeImplementationAsServer* - client, too. So, *1* - creating PortageTreeApi and PortageTreeImplementationAsFilesystem is pure refactoring task, at first. Then, adding more advanced functions to PortageTreeApi is basically refactoring, too. PortageTreeApi should not become too complex or contain any advanced tasks, which are not purely db-specific, so some common baseclass could implement more high-level things. Then, *2* - this is finishing your schoolwork, but not yet in most powerful way as we are having only index then, and first search is still slow. At beginning this cache is unable to provide data about changes in portage tree (which could be implemented by some versioning after this new api is only place to update it), so it should have index update command and be only used in search. Then, *3* - having portage tree daemon means that things can really be cached now and this cache can be kept in memory; also it means updates on filesystem changes. Then, *4* - having PortageTreeFS means that now you can easily implement portage tree on faster medium without losing backwards-compability. Now, *5* - implementation as SQL DB is logical as SQL is standardized and common language for creating fast databases. Eventually, *6* - this has really nothing to do with boosting search, but in fast network it could still boost emerge by removing need for emerge --sync for local networks. I think that then it would be considered to have synchronization also in those classes - CachedApi almost needs it to be faster with server-client connections. After that, ImplementationAsSync and ImplementationAsWebRsSync could be added and sync server built onto this daemon. As I really doubt that emerge --sync is currently also ultraslow - I see no meaning in waiting a long time to get few new items as currently seems to happen -, it would boost another life-critical part of portage. So, hope that helps a bit - have luck! 2008/11/23 René 'Necoro' Neumann <lists@necoro.eu> > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Mike Auty schrieb: > > Finally there are overlays, and since these can change outside of an > > "emerge --sync" (as indeed can the main tree), you'll have to reindex > > these before each search request, or give the user stale data until they > > manually reindex. > > Determining whether there has been a change to the ebuild system is a > major point in the whole thing. What does a great index serves you, if > it does not notice the changes the user made in his own local overlay? > :) Manually re-indexing is not a good choice I think... > > If somebody comes up here with a good (and fast) solution, this would be > a nice thing ;) (need it myself). > > Regards, > René > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH > 3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI > =2I2j > -----END PGP SIGNATURE----- > > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 6561 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 0:53 ` tvali @ 2008-11-24 9:34 ` René 'Necoro' Neumann 2008-11-24 9:48 ` Fabian Groffen 0 siblings, 1 reply; 44+ messages in thread From: René 'Necoro' Neumann @ 2008-11-24 9:34 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 tvali schrieb: > There is daemon, which notices about filesystem changes - > http://pyinotify.sourceforge.net/ would be a good choice. Disadvantage: Has to run all the time (I see already some people crying: "oh noez. not yet another daemon..."). Problem with offline changes (which might be overcome by a one-time check on daemon-startup ... but this would really increase the startup time). I have built an algorithm, which does sth like: for overlay in OVERLAYS + PORTDIR: db[overlay] = md5("".join(f.st_mtime for files(overlay))) and then compare the MD5-values on later runs. This is fast if the portage stuff is already cached - else it is quite slow ;). Another disadvantage is, that it does not know, WHAT changes do have occurred and thus has to re-read the complete overlay. I like the filesystem idea more, than the one with the daemon :). Write a new FS (using FUSE f.ex. (LUFS is deprecated)) which provides a logfile. This logfile can either just contain the time of the latest change in the complete subtree, or even some kind of log stating WHICH files have been changed. I think, this should even be possible, if the tree is not on its own partition. Of course, this should be clearly an opt-in solution: If the user does not modify the trees by hand, or does so seldomly, the "create index after sync" (similarly to 'eix-sync') is sufficient. Regards, René -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkkqdSMACgkQ4UOg/zhYFuDFLACaAn7skiCsy9pHutXf5ETa5db5 BP8AnR8lqj7c6u8HPKVbOsHVTFuGAqfG =G+lV -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 9:34 ` René 'Necoro' Neumann @ 2008-11-24 9:48 ` Fabian Groffen 2008-11-24 14:30 ` tvali 0 siblings, 1 reply; 44+ messages in thread From: Fabian Groffen @ 2008-11-24 9:48 UTC (permalink / raw To: gentoo-portage-dev On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote: > tvali schrieb: > > There is daemon, which notices about filesystem changes - > > http://pyinotify.sourceforge.net/ would be a good choice. > > Disadvantage: Has to run all the time (I see already some people crying: > "oh noez. not yet another daemon..."). ... and it is Linux only, which spoils the fun. -- Fabian Groffen Gentoo on a different level ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 9:48 ` Fabian Groffen @ 2008-11-24 14:30 ` tvali 2008-11-24 15:14 ` tvali 2008-11-24 15:15 ` René 'Necoro' Neumann 0 siblings, 2 replies; 44+ messages in thread From: tvali @ 2008-11-24 14:30 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 4413 bytes --] So, mornings are smarter than evenings (it's Estonian saying) ...at night, I thought more about this filesystem thing and found that it simply answers all needs, actually. Now I did read some messages here and thought how it could be made real simple, at least as I understand this word. Yesterday I searched if custom filesystems could have custom functionality and did not find any, so I wrote this list of big bunch of classes, which might be overkill as I think now. First thing about that indexing - if you dont create daemon nor filesystem, you can create commands "emerge --indexon", "emerge --indexoff", "emerge --indexrenew". Then, index is renewed on "emerge --sync" and such, but when user changes files manually, she has to renew index manually - not much asked, isn't it? If someone is going to open the cover of her computer, she will take the responsibility to know some basic things about electricity and that they should change smth in bios after adding and removing some parts of computer. Maybe it should even be "emerge --commithandmadechanges", which will index or do some other things, which are needed after handmade changes. More such things might emerge in future, I guess. But about filesystem... Consider such thing that when you have filesystem, you might have some directory, which you could not list, but where you can read files. Imagine some function, which is able to encode and decode queryes into filesystem format. If you have such function: search(packagename, "dependencies") you can write it as file path: /cgi-bin/search/packagename/dependencies - and packagename can be encoded by replacing some characters with some codes and separating long strings with /. Also, you could have API, which has one file in directory, from where you can read some tmp filename, then write your query to that file and read the result from the same or similarly-named file with different extension. So, FS provides some ways to create custom queries - actually that idea came because there was idea of creating FS as cgi server on LUFS page, thus this "cgi-bin" starting here is to simplify. I think it's similar to how files in /dev/ directory behave - you open some file and start writing and reading, but this file actually is zero-sized and contains nothing. Under such case, API could be written to provide this filesystem and nothing more. If it is custom-mapped filesystem, then it could provide search and such directories, which can be used by portage and others. If not, it would work as it used to. So, having filesystem, which contains such stuff (i call this subdir "dev" here): - /dev/search - write your query here and read the result. - /dev/search/searchstring - another way for user to just read some listings with her custom script. - /portage/directory/category/packagename/depslist.dev - contains dynamic list of package dependencies. - /dev/version - some integer, which will grow every time any change to portage tree is made. Then, other functions would be added eventually. Now, things simple: - Create standard filesystem, which can be used to contain portage tree. - Add all nessecary notifications to change and update files. - *Mount this filesystem to the same dir, where actual files are placed - if it's not mounted, portage will almost not notice this (so in emergency, things are just slower). You can navigate to a directory, then mount new one - I am not on linux box right now, but if I remember correctly, you can use files in real directory after mounting smth other there in such way.* - Create indexes and other stuff. 2008/11/24 Fabian Groffen <grobian@gentoo.org> > On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote: > > tvali schrieb: > > > There is daemon, which notices about filesystem changes - > > > http://pyinotify.sourceforge.net/ would be a good choice. > > > > Disadvantage: Has to run all the time (I see already some people crying: > > "oh noez. not yet another daemon..."). > > ... and it is Linux only, which spoils the fun. > > > -- > Fabian Groffen > Gentoo on a different level > > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 5097 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 14:30 ` tvali @ 2008-11-24 15:14 ` tvali 2008-11-24 15:15 ` René 'Necoro' Neumann 1 sibling, 0 replies; 44+ messages in thread From: tvali @ 2008-11-24 15:14 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 6734 bytes --] There is one clear problem: 1. Some other app opens some portage file. 2. Tree is mounted and indexed. 3. Other app changes this file. 4. Index is out-of-date. To disallow such thing it should be first suggested that all scripts change portage tree only after mount. As defence against those, which dont listen to that suggestion, portage should just not use this altered data - portage should totally rely on it's internal index and when you change some file and index is not updated, you change should be as well as lost. Does this make portage tree twice as big as it is? I guess not, because: - Useflags can be indexed and refferred with numbers. - Licence, homepage and such data is not needed to be duplicated. Also, as overlay directories are suggested anyway, is it needed at all to check *all* files for updates? I think that when one does something wrong, it's OK when everything goes boom and if someone has update scripts, which dont use overlays and other suggested ways to do thing, then adding one more thing, which breaks, is not bad. Hashing those few files isnt bad idea and keeping internal duplicate of overlay directory is not so bad, too - then you need to "emerge --commithandmadeupdates" and that's all. Some things, which could be used to boost: - Dependancy searches are saved - so that "emerge -p pck1 pck2 pck3" saves data about deps of those 3 packages. - Package name list is saved. - All packages are given integer ID. - List of all words in package descriptions are saved and connected to their internal ID's. This could be used to make smaller index file. So when i search for "al", then all words containing those chars like "all" are considered and -S search will run only on those packages. - Hash file of whole portage tree is saved to understand if it's changed after last remount. 2008/11/24 tvali <qtvali@gmail.com> > So, mornings are smarter than evenings (it's Estonian saying) ...at night, > I thought more about this filesystem thing and found that it simply answers > all needs, actually. Now I did read some messages here and thought how it > could be made real simple, at least as I understand this word. Yesterday I > searched if custom filesystems could have custom functionality and did not > find any, so I wrote this list of big bunch of classes, which might be > overkill as I think now. > > First thing about that indexing - if you dont create daemon nor filesystem, > you can create commands "emerge --indexon", "emerge --indexoff", "emerge > --indexrenew". Then, index is renewed on "emerge --sync" and such, but when > user changes files manually, she has to renew index manually - not much > asked, isn't it? If someone is going to open the cover of her computer, she > will take the responsibility to know some basic things about electricity and > that they should change smth in bios after adding and removing some parts of > computer. Maybe it should even be "emerge --commithandmadechanges", which > will index or do some other things, which are needed after handmade changes. > More such things might emerge in future, I guess. > > But about filesystem... > > Consider such thing that when you have filesystem, you might have some > directory, which you could not list, but where you can read files. Imagine > some function, which is able to encode and decode queryes into filesystem > format. > > If you have such function: search(packagename, "dependencies") you can > write it as file path: > /cgi-bin/search/packagename/dependencies - and packagename can be encoded > by replacing some characters with some codes and separating long strings > with /. Also, you could have API, which has one file in directory, from > where you can read some tmp filename, then write your query to that file and > read the result from the same or similarly-named file with different > extension. So, FS provides some ways to create custom queries - actually > that idea came because there was idea of creating FS as cgi server on LUFS > page, thus this "cgi-bin" starting here is to simplify. I think it's similar > to how files in /dev/ directory behave - you open some file and start > writing and reading, but this file actually is zero-sized and contains > nothing. > > Under such case, API could be written to provide this filesystem and > nothing more. If it is custom-mapped filesystem, then it could provide > search and such directories, which can be used by portage and others. If > not, it would work as it used to. > > So, having filesystem, which contains such stuff (i call this subdir "dev" > here): > > - /dev/search - write your query here and read the result. > - /dev/search/searchstring - another way for user to just read some > listings with her custom script. > - /portage/directory/category/packagename/depslist.dev - contains > dynamic list of package dependencies. > - /dev/version - some integer, which will grow every time any change to > portage tree is made. > > Then, other functions would be added eventually. > > Now, things simple: > > - Create standard filesystem, which can be used to contain portage > tree. > - Add all nessecary notifications to change and update files. > - *Mount this filesystem to the same dir, where actual files are placed > - if it's not mounted, portage will almost not notice this (so in emergency, > things are just slower). You can navigate to a directory, then mount new one > - I am not on linux box right now, but if I remember correctly, you can use > files in real directory after mounting smth other there in such way.* > - Create indexes and other stuff. > > 2008/11/24 Fabian Groffen <grobian@gentoo.org> > > On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote: >> > tvali schrieb: >> > > There is daemon, which notices about filesystem changes - >> > > http://pyinotify.sourceforge.net/ would be a good choice. >> > >> > Disadvantage: Has to run all the time (I see already some people crying: >> > "oh noez. not yet another daemon..."). >> >> ... and it is Linux only, which spoils the fun. >> >> >> -- >> Fabian Groffen >> Gentoo on a different level >> >> > > > -- > tvali > > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi > täica pea nagu prügikast... > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 7762 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 14:30 ` tvali 2008-11-24 15:14 ` tvali @ 2008-11-24 15:15 ` René 'Necoro' Neumann 2008-11-24 15:18 ` tvali 1 sibling, 1 reply; 44+ messages in thread From: René 'Necoro' Neumann @ 2008-11-24 15:15 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 tvali schrieb: > But about filesystem... > > [... snip lots of stuff ...] What you mentioned for the filesystem might be a nice thing (actually I started something like this some time ago [1] , though it is now dead ;)), but it does not help in the index/determine changes thing. It is just another API :). Perhaps the "index after sync" is sufficient for most parts of the userbase - but esp. those who often deal with their own local overlays (like me) do not want to have to re-index manually - esp. if re-indexing takes a long time. The best solution would be to have portage find a) THAT something has been changed and b) WHAT has been changed. So that it only has to update these parts of the index, and thus do not be sth enerving for the users (remind the "Generate Metadata" stuff (or whatever it was called) in older portage versions, which alone seemed to take longer than the rest of the sync progress) Regards, René [1] https://launchpad.net/catapultfs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkkqxSsACgkQ4UOg/zhYFuBPSACdH9H6VChrhlcovucgVAcCsp/B j+AAmgPXPmuBs5GWnNAfs5nss4HlBEMT =WG8B -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 15:15 ` René 'Necoro' Neumann @ 2008-11-24 15:18 ` tvali 2008-11-24 17:15 ` tvali 0 siblings, 1 reply; 44+ messages in thread From: tvali @ 2008-11-24 15:18 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 1795 bytes --] 2008/11/24 René 'Necoro' Neumann <lists@necoro.eu> > What you mentioned for the filesystem might be a nice thing (actually I > started something like this some time ago [1] , though it is now dead > ;)), but it does not help in the index/determine changes thing. It is > just another API :). > My thoughline is that when this FS is mounted, it's *only *portage dir - so having this FS mounted, changes are all noticed, because you do all changes in that FS. Anyway, when you unmount it and remount, some things might go wrong and that's what I'm thinking about ...but that's not a big problem. Perhaps the "index after sync" is sufficient for most parts of the > userbase - but esp. those who often deal with their own local overlays > (like me) do not want to have to re-index manually - esp. if re-indexing > takes a long time. The best solution would be to have portage find a) > THAT something has been changed and b) WHAT has been changed. So that it > only has to update these parts of the index, and thus do not be sth > enerving for the users (remind the "Generate Metadata" stuff (or > whatever it was called) in older portage versions, which alone seemed to > take longer than the rest of the sync progress) > > Regards, > René > > [1] https://launchpad.net/catapultfs > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkkqxSsACgkQ4UOg/zhYFuBPSACdH9H6VChrhlcovucgVAcCsp/B > j+AAmgPXPmuBs5GWnNAfs5nss4HlBEMT > =WG8B > -----END PGP SIGNATURE----- > > -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... [-- Attachment #2: Type: text/html, Size: 2485 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 15:18 ` tvali @ 2008-11-24 17:15 ` tvali 2008-11-30 23:42 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: tvali @ 2008-11-24 17:15 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 1511 bytes --] I take it shortly together as Rene didn't catch all and so I was fuzzy: Portage tree has automatically updateable parts, which should not changed by user, and overlay, which will be. Thus, index of this automatic part should be updated only after "emerge --sync". Speedup should contain custom filesystem, which would be called PortageFS, for example. In initial version, PortageFS uses current portage tree and generates additional indexes. So, when you bootup, you have portage tree in /usr/portage. At some point, PortageFS is mounted into the same directory, /usr/portage. It will map real /usr/portage directory into /usr/portage mount point and create some additional folders like /usr/portage/search, which maps files to do real searches. /usr/portage/handler would be a file, where you can write query and read result. It also contains virtual files to check dependancies and such stuff - many things you could use with your scripts. When it's mounted, every change is noticed and indexes will be automagically updated (and sometimes after communication with portage - for example, updates when doing "emerge --sync" should not happen automagically maybe, as it makes things slower. When it's not mounted, you can change user files, but must run some notification script afterwards maybe to rebuild indexes. Indexes are built-in into FS. If PortageFS is not mounted, for example because of some emergency reboot, portage can work without indexes, using real directory instead of this mount point. [-- Attachment #2: Type: text/html, Size: 1586 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 17:15 ` tvali @ 2008-11-30 23:42 ` Emma Strubell 2008-12-01 7:34 ` [gentoo-portage-dev] " Duncan 0 siblings, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-11-30 23:42 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 2414 bytes --] You guys all have some great ideas, but I don't think I'd have enough time to be able to implement them before my project is due... especially because they appear to be a bit beyond my current programming skills. I would love to devote a lot more time to this project, but I just can't right now because I already have a lot of other things on my plate. i am really interested in contributing to Gentoo and portage in the future, though. I'm thinking this summer I'll have a chance... Anyway, I'm going to try to keep it simple and just implement a suffix trie, and hope that that provides some measurable speed improvement :] Thanks again for everyone's help, though, and I'll definitely share the (amature and minimal, sorry!) results of my project if you're interested. Emma On Mon, Nov 24, 2008 at 12:15 PM, tvali <qtvali@gmail.com> wrote: > I take it shortly together as Rene didn't catch all and so I was fuzzy: > > Portage tree has automatically updateable parts, which should not changed > by user, and overlay, which will be. Thus, index of this automatic part > should be updated only after "emerge --sync". > > Speedup should contain custom filesystem, which would be called PortageFS, > for example. In initial version, PortageFS uses current portage tree and > generates additional indexes. > > So, when you bootup, you have portage tree in /usr/portage. At some point, > PortageFS is mounted into the same directory, /usr/portage. It will map real > /usr/portage directory into /usr/portage mount point and create some > additional folders like /usr/portage/search, which maps files to do real > searches. /usr/portage/handler would be a file, where you can write query > and read result. It also contains virtual files to check dependancies and > such stuff - many things you could use with your scripts. > > When it's mounted, every change is noticed and indexes will be > automagically updated (and sometimes after communication with portage - for > example, updates when doing "emerge --sync" should not happen automagically > maybe, as it makes things slower. When it's not mounted, you can change user > files, but must run some notification script afterwards maybe to rebuild > indexes. > > Indexes are built-in into FS. > > If PortageFS is not mounted, for example because of some emergency reboot, > portage can work without indexes, using real directory instead of this mount > point. > [-- Attachment #2: Type: text/html, Size: 2733 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* [gentoo-portage-dev] Re: search functionality in emerge 2008-11-30 23:42 ` Emma Strubell @ 2008-12-01 7:34 ` Duncan 2008-12-01 10:40 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: Duncan @ 2008-12-01 7:34 UTC (permalink / raw To: gentoo-portage-dev "Emma Strubell" <emma.strubell@gmail.com> posted 5a8c638a0811301542s4aca92c3ie68ef427913c0523@mail.gmail.com, excerpted below, on Sun, 30 Nov 2008 18:42:11 -0500: > i am really > interested in contributing to Gentoo and portage in the future, though. > I'm thinking this summer I'll have a chance... FWIW, Gentoo usually participates in the Google Summer of Code. Assuming they have it again next year, if you're already considering spending some time on Gentoo code this summer, might as well try to get paid a little something for it. It could/should be a nice resume booster, too. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 7:34 ` [gentoo-portage-dev] " Duncan @ 2008-12-01 10:40 ` Emma Strubell 2008-12-01 17:52 ` Zac Medico 0 siblings, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-12-01 10:40 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 1565 bytes --] I completely forgot about Google's Summer of Code! Thanks for reminding me. Hopefully I won't forget again by the time summer rolls around, obviously I wouldn't mind getting a little extra money for doing something I'd do for free anyway. On a more related note: What, exactly, does porttree.py do? And am I correct in thinking that my suffix tree(s) should somewhat replace porttree.py? Or, should I be using porttree.py in order to populate my tree? I think I have the suffix tree sufficiently figured out, I'm just trying to determine where, exactly, the tree will fit in to the portage code, and what the best way to populate it (with package names and some corresponding metadata) would be. On Mon, Dec 1, 2008 at 2:34 AM, Duncan <1i5t5.duncan@cox.net> wrote: > "Emma Strubell" <emma.strubell@gmail.com> posted > 5a8c638a0811301542s4aca92c3ie68ef427913c0523@mail.gmail.com, excerpted > below, on Sun, 30 Nov 2008 18:42:11 -0500: > > > i am really > > interested in contributing to Gentoo and portage in the future, though. > > I'm thinking this summer I'll have a chance... > > FWIW, Gentoo usually participates in the Google Summer of Code. Assuming > they have it again next year, if you're already considering spending some > time on Gentoo code this summer, might as well try to get paid a little > something for it. It could/should be a nice resume booster, too. =:^) > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > > [-- Attachment #2: Type: text/html, Size: 2164 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 10:40 ` Emma Strubell @ 2008-12-01 17:52 ` Zac Medico 2008-12-01 21:25 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: Zac Medico @ 2008-12-01 17:52 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Emma Strubell wrote: > I completely forgot about Google's Summer of Code! Thanks for reminding me. > Hopefully I won't forget again by the time summer rolls around, obviously I > wouldn't mind getting a little extra money for doing something I'd do for > free anyway. > > On a more related note: What, exactly, does porttree.py do? And am I correct > in thinking that my suffix tree(s) should somewhat replace porttree.py? Or, > should I be using porttree.py in order to populate my tree? You should use portree.py to populate it. Specifically, you should use portdbapi.aux_get() calls to access the package metadata that you'll need, similar to how the code in the existing search class accesses it. > I think I have > the suffix tree sufficiently figured out, I'm just trying to determine > where, exactly, the tree will fit in to the portage code, and what the best > way to populate it (with package names and some corresponding metadata) > would be. There are there possible times that I imagine a person might want to populate it: 1) Automatically after emerge --sync. This should not be mandatory since it will be somewhat time consuming and some users are very sensitive about --sync time. Note that FEATURES=metadate-transfer is disabled by default in the latest versions of portage, specifically to reduce --sync time. 2) On demand, when emerge --search is invoked. The calling user will need appropriate file system permissions in order to update the search index. 3) On request, by calling a command that is specifically designed to generate the search index. This could be a subcommand of emaint. For the index file format, it would be simplest to use a python pickle file, but you might choose another format if you'd like the index to be accessible without python and the portage API (probably not necessary). - -- Thanks, Zac -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2 gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR =hFNz -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 17:52 ` Zac Medico @ 2008-12-01 21:25 ` Emma Strubell 2008-12-01 21:52 ` Tambet 0 siblings, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-12-01 21:25 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 3070 bytes --] Thanks for the clarification. I was planning on forcing an update of the index as a part of emerge --sync, and implementing a command that would update the search index (leaving it up to the user to update after making any manual changes to the portage tree). That way the search index should always be up-to-date when emerge -s is called. It does make sense for the update upon --sync to be optional, but I guess I don't see why the update should always be SO slow. Of course the first population of the tree will take quite a while, but assuming regular (daily?) --syncs (and therefore updates to the index), subsequent updates shouldn't take very long, since there will only be a few (hundred?) changes to be made to the tree. And I do plan on using a pickling the search tree :] Emma On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico <zmedico@gentoo.org> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Emma Strubell wrote: > > I completely forgot about Google's Summer of Code! Thanks for reminding > me. > > Hopefully I won't forget again by the time summer rolls around, obviously > I > > wouldn't mind getting a little extra money for doing something I'd do for > > free anyway. > > > > On a more related note: What, exactly, does porttree.py do? And am I > correct > > in thinking that my suffix tree(s) should somewhat replace porttree.py? > Or, > > should I be using porttree.py in order to populate my tree? > > You should use portree.py to populate it. Specifically, you should > use portdbapi.aux_get() calls to access the package metadata that > you'll need, similar to how the code in the existing search class > accesses it. > > > I think I have > > the suffix tree sufficiently figured out, I'm just trying to determine > > where, exactly, the tree will fit in to the portage code, and what the > best > > way to populate it (with package names and some corresponding metadata) > > would be. > > There are there possible times that I imagine a person might want to > populate it: > > 1) Automatically after emerge --sync. This should not be mandatory > since it will be somewhat time consuming and some users are very > sensitive about --sync time. Note that FEATURES=metadate-transfer is > disabled by default in the latest versions of portage, specifically > to reduce --sync time. > > 2) On demand, when emerge --search is invoked. The calling user will > need appropriate file system permissions in order to update the > search index. > > 3) On request, by calling a command that is specifically designed to > generate the search index. This could be a subcommand of emaint. > > For the index file format, it would be simplest to use a python > pickle file, but you might choose another format if you'd like the > index to be accessible without python and the portage API (probably > not necessary). > - -- > Thanks, > Zac > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > > iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2 > gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR > =hFNz > -----END PGP SIGNATURE----- > > [-- Attachment #2: Type: text/html, Size: 3627 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 21:25 ` Emma Strubell @ 2008-12-01 21:52 ` Tambet 2008-12-01 22:08 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: Tambet @ 2008-12-01 21:52 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 4001 bytes --] I would suggest a different way of updates. When you manually change portage tree, you have to make an overlay. Overlay, as it's updated and managed by human being, will be always small (unless someone makes a script, which creates million overlay updates, but I dont think it would be efficient way to do anything). So, when you search, you can search Portage tree with index, which is updated with --sync and then search overlay, which is small and fast to search anyway. Overlay should not have index in such case. If anyone is going to change portage tree by hand, those changes will be lost with next --sync and thus noone should do it anyway - this case should not be considered at all. Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/1 Emma Strubell <emma.strubell@gmail.com> > Thanks for the clarification. I was planning on forcing an update of the > index as a part of emerge --sync, and implementing a command that would > update the search index (leaving it up to the user to update after making > any manual changes to the portage tree). That way the search index should > always be up-to-date when emerge -s is called. It does make sense for the > update upon --sync to be optional, but I guess I don't see why the update > should always be SO slow. Of course the first population of the tree will > take quite a while, but assuming regular (daily?) --syncs (and therefore > updates to the index), subsequent updates shouldn't take very long, since > there will only be a few (hundred?) changes to be made to the tree. > > And I do plan on using a pickling the search tree :] > > Emma > > > On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico <zmedico@gentoo.org> wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Emma Strubell wrote: >> > I completely forgot about Google's Summer of Code! Thanks for reminding >> me. >> > Hopefully I won't forget again by the time summer rolls around, >> obviously I >> > wouldn't mind getting a little extra money for doing something I'd do >> for >> > free anyway. >> > >> > On a more related note: What, exactly, does porttree.py do? And am I >> correct >> > in thinking that my suffix tree(s) should somewhat replace porttree.py? >> Or, >> > should I be using porttree.py in order to populate my tree? >> >> You should use portree.py to populate it. Specifically, you should >> use portdbapi.aux_get() calls to access the package metadata that >> you'll need, similar to how the code in the existing search class >> accesses it. >> >> > I think I have >> > the suffix tree sufficiently figured out, I'm just trying to determine >> > where, exactly, the tree will fit in to the portage code, and what the >> best >> > way to populate it (with package names and some corresponding metadata) >> > would be. >> >> There are there possible times that I imagine a person might want to >> populate it: >> >> 1) Automatically after emerge --sync. This should not be mandatory >> since it will be somewhat time consuming and some users are very >> sensitive about --sync time. Note that FEATURES=metadate-transfer is >> disabled by default in the latest versions of portage, specifically >> to reduce --sync time. >> >> 2) On demand, when emerge --search is invoked. The calling user will >> need appropriate file system permissions in order to update the >> search index. >> >> 3) On request, by calling a command that is specifically designed to >> generate the search index. This could be a subcommand of emaint. >> >> For the index file format, it would be simplest to use a python >> pickle file, but you might choose another format if you'd like the >> index to be accessible without python and the portage API (probably >> not necessary). >> - -- >> Thanks, >> Zac >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v2.0.9 (GNU/Linux) >> >> iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2 >> gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR >> =hFNz >> -----END PGP SIGNATURE----- >> >> > [-- Attachment #2: Type: text/html, Size: 4782 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 21:52 ` Tambet @ 2008-12-01 22:08 ` Emma Strubell 2008-12-01 22:17 ` René 'Necoro' Neumann 0 siblings, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-12-01 22:08 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 4379 bytes --] Good point. I may just ignore overlays completely because 1) I don't use them and 2) does anyone really need to search an overlay anyway? aren't any packages added via an overlay added deliberately? On Mon, Dec 1, 2008 at 4:52 PM, Tambet <qtvali@gmail.com> wrote: > I would suggest a different way of updates. When you manually change > portage tree, you have to make an overlay. Overlay, as it's updated and > managed by human being, will be always small (unless someone makes a script, > which creates million overlay updates, but I dont think it would be > efficient way to do anything). So, when you search, you can search Portage > tree with index, which is updated with --sync and then search overlay, which > is small and fast to search anyway. Overlay should not have index in such > case. If anyone is going to change portage tree by hand, those changes will > be lost with next --sync and thus noone should do it anyway - this case > should not be considered at all. > > Tambet - technique evolves to art, art evolves to magic, magic evolves to > just doing. > > > 2008/12/1 Emma Strubell <emma.strubell@gmail.com> > > Thanks for the clarification. I was planning on forcing an update of the >> index as a part of emerge --sync, and implementing a command that would >> update the search index (leaving it up to the user to update after making >> any manual changes to the portage tree). That way the search index should >> always be up-to-date when emerge -s is called. It does make sense for the >> update upon --sync to be optional, but I guess I don't see why the update >> should always be SO slow. Of course the first population of the tree will >> take quite a while, but assuming regular (daily?) --syncs (and therefore >> updates to the index), subsequent updates shouldn't take very long, since >> there will only be a few (hundred?) changes to be made to the tree. >> >> And I do plan on using a pickling the search tree :] >> >> Emma >> >> >> On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico <zmedico@gentoo.org> wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Emma Strubell wrote: >>> > I completely forgot about Google's Summer of Code! Thanks for reminding >>> me. >>> > Hopefully I won't forget again by the time summer rolls around, >>> obviously I >>> > wouldn't mind getting a little extra money for doing something I'd do >>> for >>> > free anyway. >>> > >>> > On a more related note: What, exactly, does porttree.py do? And am I >>> correct >>> > in thinking that my suffix tree(s) should somewhat replace porttree.py? >>> Or, >>> > should I be using porttree.py in order to populate my tree? >>> >>> You should use portree.py to populate it. Specifically, you should >>> use portdbapi.aux_get() calls to access the package metadata that >>> you'll need, similar to how the code in the existing search class >>> accesses it. >>> >>> > I think I have >>> > the suffix tree sufficiently figured out, I'm just trying to determine >>> > where, exactly, the tree will fit in to the portage code, and what the >>> best >>> > way to populate it (with package names and some corresponding metadata) >>> > would be. >>> >>> There are there possible times that I imagine a person might want to >>> populate it: >>> >>> 1) Automatically after emerge --sync. This should not be mandatory >>> since it will be somewhat time consuming and some users are very >>> sensitive about --sync time. Note that FEATURES=metadate-transfer is >>> disabled by default in the latest versions of portage, specifically >>> to reduce --sync time. >>> >>> 2) On demand, when emerge --search is invoked. The calling user will >>> need appropriate file system permissions in order to update the >>> search index. >>> >>> 3) On request, by calling a command that is specifically designed to >>> generate the search index. This could be a subcommand of emaint. >>> >>> For the index file format, it would be simplest to use a python >>> pickle file, but you might choose another format if you'd like the >>> index to be accessible without python and the portage API (probably >>> not necessary). >>> - -- >>> Thanks, >>> Zac >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v2.0.9 (GNU/Linux) >>> >>> iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2 >>> gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR >>> =hFNz >>> -----END PGP SIGNATURE----- >>> >>> >> > [-- Attachment #2: Type: text/html, Size: 5363 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 22:08 ` Emma Strubell @ 2008-12-01 22:17 ` René 'Necoro' Neumann 2008-12-01 22:47 ` Emma Strubell 0 siblings, 1 reply; 44+ messages in thread From: René 'Necoro' Neumann @ 2008-12-01 22:17 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Emma Strubell schrieb: > 2) does anyone really need to search an overlay anyway? Of course. Take large (semi-)official overlays like sunrise. They can easily be seen as a second portage tree. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S =+lCO -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 22:17 ` René 'Necoro' Neumann @ 2008-12-01 22:47 ` Emma Strubell 2008-12-02 0:20 ` Tambet 0 siblings, 1 reply; 44+ messages in thread From: Emma Strubell @ 2008-12-01 22:47 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 714 bytes --] True, true. Like I said, I don't really use overlays, so excuse my igonrance. On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Emma Strubell schrieb: > > 2) does anyone really need to search an overlay anyway? > > Of course. Take large (semi-)official overlays like sunrise. They can > easily be seen as a second portage tree. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt > 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S > =+lCO > -----END PGP SIGNATURE----- > > [-- Attachment #2: Type: text/html, Size: 1139 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-01 22:47 ` Emma Strubell @ 2008-12-02 0:20 ` Tambet 2008-12-02 2:23 ` Emma Strubell 2008-12-02 10:21 ` Alec Warner 0 siblings, 2 replies; 44+ messages in thread From: Tambet @ 2008-12-02 0:20 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 4626 bytes --] 2008/12/2 Emma Strubell <emma.strubell@gmail.com> > True, true. Like I said, I don't really use overlays, so excuse my > igonrance. > Do you know an order of doing things: Rules of Optimization: - Rule 1: Don't do it. - Rule 2 (for experts only): Don't do it yet. What this actually means - functionality comes first. Readability comes next. Optimization comes last. Unless you are creating a fancy 3D engine for kung fu game. If you are going to exclude overlays, you are removing functionality - and, indeed, absolutely has-to-be-there functionality, because noone would intuitively expect search function to search only one subset of packages, however reasonable this subset would be. So, you can't, just can't, add this package into portage base - you could write just another external search package for portage. I looked this code a bit and: Portage's "__init__.py" contains comment "# search functionality". After this comment, there is a nice and simple search class. It also contains method "def action_sync(...)", which contains synchronization stuff. Now, search class will be initialized by setting up 3 databases - porttree, bintree and vartree, whatever those are. Those will be in self._dbs array and porttree will be in self._portdb. It contains some more methods: _findname(...) will return result of self._portdb.findname(...) with same parameters or None if it does not exist. Other methods will do similar things - map one or another method. execute will do the real search... Now - "for package in self.portdb.cp_all()" is important here ...it currently loops over whole portage tree. All kinds of matching will be done inside. self.portdb obviously points to porttree.py (unless it points to fake tree). cp_all will take all porttrees and do simple file search inside. This method should contain optional index search. self.porttrees = [self.porttree_root] + \ [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()] So, self.porttrees contains list of trees - first of them is root, others are overlays. Now, what you have to do will not be harder just because of having overlay search, too. You have to create method def cp_index(self), which will return dictionary containing package names as keys. For oroot... will be "self.porttrees[1:]", not "self.porttrees" - this will only search overlays. d = {} will be replaced with d = self.cp_index(). If index is not there, old version will be used (thus, you have to make internal porttrees variable, which contains all or all except first). Other methods used by search are xmatch and aux_get - first used several times and last one used to get description. You have to cache results of those specific queries and make them use your cache - as you can see, those parts of portage are already able to use overlays. Thus, you have to put your code again in beginning of those functions - create index_xmatch and index_aux_get methods, then make those methods use them and return their results unless those are None (or something other in case none is already legal result) - if they return None, old code will be run and do it's job. If index is not created, result is None. In index_** methods, just check if query is what you can answer and if it is, then answer it. Obviously, the simplest way to create your index is to delete index, then use those same methods to query for all nessecary information - and fastest way would be to add updating index directly into sync, which you could do later. Please, also, make those commands to turn index on and off (last one should also delete it to save disk space). Default should be off until it's fast, small and reliable. Also notice that if index is kept on hard drive, it might be faster if it's compressed (gz, for example) - decompressing takes less time and more processing power than reading it fully out. Have luck! -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Emma Strubell schrieb: >> > 2) does anyone really need to search an overlay anyway? >> >> Of course. Take large (semi-)official overlays like sunrise. They can >> easily be seen as a second portage tree. >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v2.0.9 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt >> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S >> =+lCO >> -----END PGP SIGNATURE----- >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>wrote: > > [-- Attachment #2: Type: text/html, Size: 5637 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 0:20 ` Tambet @ 2008-12-02 2:23 ` Emma Strubell 2008-12-02 10:21 ` Alec Warner 1 sibling, 0 replies; 44+ messages in thread From: Emma Strubell @ 2008-12-02 2:23 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 5275 bytes --] yes, yes, i know, you're right :] and thanks a bunch for the outline! about the compression, I agree that it would be a good idea, but I don't know how to implement it. not that it would be difficult... I'm guessing there's a gzip module for python that would make it pretty straightforward? I think I'm getting ahead of myself, though. I haven't even implemented the suffix tree yet! Emma On Mon, Dec 1, 2008 at 7:20 PM, Tambet <qtvali@gmail.com> wrote: > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> > >> True, true. Like I said, I don't really use overlays, so excuse my >> igonrance. >> > > Do you know an order of doing things: > > Rules of Optimization: > > - Rule 1: Don't do it. > - Rule 2 (for experts only): Don't do it yet. > > What this actually means - functionality comes first. Readability comes > next. Optimization comes last. Unless you are creating a fancy 3D engine for > kung fu game. > > If you are going to exclude overlays, you are removing functionality - and, > indeed, absolutely has-to-be-there functionality, because noone would > intuitively expect search function to search only one subset of packages, > however reasonable this subset would be. So, you can't, just can't, add this > package into portage base - you could write just another external search > package for portage. > > I looked this code a bit and: > Portage's "__init__.py" contains comment "# search functionality". After > this comment, there is a nice and simple search class. > It also contains method "def action_sync(...)", which contains > synchronization stuff. > > Now, search class will be initialized by setting up 3 databases - porttree, > bintree and vartree, whatever those are. Those will be in self._dbs array > and porttree will be in self._portdb. > > It contains some more methods: > _findname(...) will return result of self._portdb.findname(...) with same > parameters or None if it does not exist. > Other methods will do similar things - map one or another method. > execute will do the real search... > Now - "for package in self.portdb.cp_all()" is important here ...it > currently loops over whole portage tree. All kinds of matching will be done > inside. > self.portdb obviously points to porttree.py (unless it points to fake > tree). > cp_all will take all porttrees and do simple file search inside. This > method should contain optional index search. > > self.porttrees = [self.porttree_root] + \ > [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()] > > So, self.porttrees contains list of trees - first of them is root, others > are overlays. > > Now, what you have to do will not be harder just because of having overlay > search, too. > > You have to create method def cp_index(self), which will return dictionary > containing package names as keys. For oroot... will be "self.porttrees[1:]", > not "self.porttrees" - this will only search overlays. d = {} will be > replaced with d = self.cp_index(). If index is not there, old version will > be used (thus, you have to make internal porttrees variable, which contains > all or all except first). > > Other methods used by search are xmatch and aux_get - first used several > times and last one used to get description. You have to cache results of > those specific queries and make them use your cache - as you can see, those > parts of portage are already able to use overlays. Thus, you have to put > your code again in beginning of those functions - create index_xmatch and > index_aux_get methods, then make those methods use them and return their > results unless those are None (or something other in case none is already > legal result) - if they return None, old code will be run and do it's job. > If index is not created, result is None. In index_** methods, just check if > query is what you can answer and if it is, then answer it. > > Obviously, the simplest way to create your index is to delete index, then > use those same methods to query for all nessecary information - and fastest > way would be to add updating index directly into sync, which you could do > later. > > Please, also, make those commands to turn index on and off (last one should > also delete it to save disk space). Default should be off until it's fast, > small and reliable. Also notice that if index is kept on hard drive, it > might be faster if it's compressed (gz, for example) - decompressing takes > less time and more processing power than reading it fully out. > > Have luck! > > -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Emma Strubell schrieb: >>> > 2) does anyone really need to search an overlay anyway? >>> >>> Of course. Take large (semi-)official overlays like sunrise. They can >>> easily be seen as a second portage tree. >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v2.0.9 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S >>> =+lCO >>> -----END PGP SIGNATURE----- >>> >>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>wrote: >> >> > [-- Attachment #2: Type: text/html, Size: 6454 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 0:20 ` Tambet 2008-12-02 2:23 ` Emma Strubell @ 2008-12-02 10:21 ` Alec Warner 2008-12-02 12:42 ` Tambet 2008-12-02 17:42 ` Tambet 1 sibling, 2 replies; 44+ messages in thread From: Alec Warner @ 2008-12-02 10:21 UTC (permalink / raw To: gentoo-portage-dev On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> >> >> True, true. Like I said, I don't really use overlays, so excuse my >> igonrance. > > Do you know an order of doing things: > > Rules of Optimization: > > Rule 1: Don't do it. > Rule 2 (for experts only): Don't do it yet. > > What this actually means - functionality comes first. Readability comes > next. Optimization comes last. Unless you are creating a fancy 3D engine for > kung fu game. > > If you are going to exclude overlays, you are removing functionality - and, > indeed, absolutely has-to-be-there functionality, because noone would > intuitively expect search function to search only one subset of packages, > however reasonable this subset would be. So, you can't, just can't, add this > package into portage base - you could write just another external search > package for portage. > > I looked this code a bit and: > Portage's "__init__.py" contains comment "# search functionality". After > this comment, there is a nice and simple search class. > It also contains method "def action_sync(...)", which contains > synchronization stuff. > > Now, search class will be initialized by setting up 3 databases - porttree, > bintree and vartree, whatever those are. Those will be in self._dbs array > and porttree will be in self._portdb. > > It contains some more methods: > _findname(...) will return result of self._portdb.findname(...) with same > parameters or None if it does not exist. > Other methods will do similar things - map one or another method. > execute will do the real search... > Now - "for package in self.portdb.cp_all()" is important here ...it > currently loops over whole portage tree. All kinds of matching will be done > inside. > self.portdb obviously points to porttree.py (unless it points to fake tree). > cp_all will take all porttrees and do simple file search inside. This method > should contain optional index search. > > self.porttrees = [self.porttree_root] + \ > [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()] > > So, self.porttrees contains list of trees - first of them is root, others > are overlays. > > Now, what you have to do will not be harder just because of having overlay > search, too. > > You have to create method def cp_index(self), which will return dictionary > containing package names as keys. For oroot... will be "self.porttrees[1:]", > not "self.porttrees" - this will only search overlays. d = {} will be > replaced with d = self.cp_index(). If index is not there, old version will > be used (thus, you have to make internal porttrees variable, which contains > all or all except first). > > Other methods used by search are xmatch and aux_get - first used several > times and last one used to get description. You have to cache results of > those specific queries and make them use your cache - as you can see, those > parts of portage are already able to use overlays. Thus, you have to put > your code again in beginning of those functions - create index_xmatch and > index_aux_get methods, then make those methods use them and return their > results unless those are None (or something other in case none is already > legal result) - if they return None, old code will be run and do it's job. > If index is not created, result is None. In index_** methods, just check if > query is what you can answer and if it is, then answer it. > > Obviously, the simplest way to create your index is to delete index, then > use those same methods to query for all nessecary information - and fastest > way would be to add updating index directly into sync, which you could do > later. > > Please, also, make those commands to turn index on and off (last one should > also delete it to save disk space). Default should be off until it's fast, > small and reliable. Also notice that if index is kept on hard drive, it > might be faster if it's compressed (gz, for example) - decompressing takes > less time and more processing power than reading it fully out. I'm pretty sure your mistaken here, unless your index is stored on a floppy or something really slow. A disk read has 2 primary costs. Seek Time: Time for the head to seek to the sector of disk you want. Spin Time: Time for the platter to spin around such that the sector you want is under the read head. Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 rotations per second, so worst case (you just passed the sector you need) you need to wait 1/120th of a second (or 8ms). Seek Time is per hard drive, but most drives provide average seek times under 10ms. So it takes on average 18ms to get to your data, then you start reading. The index will not be that large (my esearchdb is 2 megs, but lets assume 10MB for this compressed index). I took a 10MB meg sqlite database and compressed it with gzip (default settings) down to 5 megs. gzip -d on the database takes 300ms, catting the decompressed data base takes 88ms (average of 5 runs, drop disk caches between runs). I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle 1.3megs compresses to 390k. 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk. Your index would have to be very large or very fragmented on disk (requiring more than one seek) to see a significant gain in file compression (gzip scales linearly). In short, don't compress the index ;p > > Have luck! > >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Emma Strubell schrieb: >>> > 2) does anyone really need to search an overlay anyway? >>> >>> Of course. Take large (semi-)official overlays like sunrise. They can >>> easily be seen as a second portage tree. >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v2.0.9 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S >>> =+lCO >>> -----END PGP SIGNATURE----- >>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> >> wrote: >> > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 10:21 ` Alec Warner @ 2008-12-02 12:42 ` Tambet 2008-12-02 13:51 ` Tambet 2008-12-02 19:54 ` Alec Warner 2008-12-02 17:42 ` Tambet 1 sibling, 2 replies; 44+ messages in thread From: Tambet @ 2008-12-02 12:42 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 7946 bytes --] About zipping.. Default settings might not really be good idea - i think that "fastest" might be even better. Considering that portage tree contains same word again and again (like "applications") it needs pretty small dictionary to make it much smaller. Decompressing will not be reading from disc, decompressing and writing back to disc as in your case probably - try decompression to memory drive and you might get better numbers. I have personally used compression in one c++ application and with optimum settings, it made things much faster - those were files, where i had for example 65536 16-byte integers, which could be zeros and mostly were; I didnt care about creating better file format, but just compressed the whole thing. I suggest you to compress esearch db, then decompress it to memory drive and give us those numbers - might be considerably faster. http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support. Try open of that and normal open on esearch db; also compress with the same lib to get right kind of file. Anyway - maybe this compression should be later added and optional. Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/2 Alec Warner <antarus@gentoo.org> > On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: > > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> > >> > >> True, true. Like I said, I don't really use overlays, so excuse my > >> igonrance. > > > > Do you know an order of doing things: > > > > Rules of Optimization: > > > > Rule 1: Don't do it. > > Rule 2 (for experts only): Don't do it yet. > > > > What this actually means - functionality comes first. Readability comes > > next. Optimization comes last. Unless you are creating a fancy 3D engine > for > > kung fu game. > > > > If you are going to exclude overlays, you are removing functionality - > and, > > indeed, absolutely has-to-be-there functionality, because noone would > > intuitively expect search function to search only one subset of packages, > > however reasonable this subset would be. So, you can't, just can't, add > this > > package into portage base - you could write just another external search > > package for portage. > > > > I looked this code a bit and: > > Portage's "__init__.py" contains comment "# search functionality". After > > this comment, there is a nice and simple search class. > > It also contains method "def action_sync(...)", which contains > > synchronization stuff. > > > > Now, search class will be initialized by setting up 3 databases - > porttree, > > bintree and vartree, whatever those are. Those will be in self._dbs array > > and porttree will be in self._portdb. > > > > It contains some more methods: > > _findname(...) will return result of self._portdb.findname(...) with same > > parameters or None if it does not exist. > > Other methods will do similar things - map one or another method. > > execute will do the real search... > > Now - "for package in self.portdb.cp_all()" is important here ...it > > currently loops over whole portage tree. All kinds of matching will be > done > > inside. > > self.portdb obviously points to porttree.py (unless it points to fake > tree). > > cp_all will take all porttrees and do simple file search inside. This > method > > should contain optional index search. > > > > self.porttrees = [self.porttree_root] + \ > > [os.path.realpath(t) for t in > self.mysettings["PORTDIR_OVERLAY"].split()] > > > > So, self.porttrees contains list of trees - first of them is root, others > > are overlays. > > > > Now, what you have to do will not be harder just because of having > overlay > > search, too. > > > > You have to create method def cp_index(self), which will return > dictionary > > containing package names as keys. For oroot... will be > "self.porttrees[1:]", > > not "self.porttrees" - this will only search overlays. d = {} will be > > replaced with d = self.cp_index(). If index is not there, old version > will > > be used (thus, you have to make internal porttrees variable, which > contains > > all or all except first). > > > > Other methods used by search are xmatch and aux_get - first used several > > times and last one used to get description. You have to cache results of > > those specific queries and make them use your cache - as you can see, > those > > parts of portage are already able to use overlays. Thus, you have to put > > your code again in beginning of those functions - create index_xmatch and > > index_aux_get methods, then make those methods use them and return their > > results unless those are None (or something other in case none is already > > legal result) - if they return None, old code will be run and do it's > job. > > If index is not created, result is None. In index_** methods, just check > if > > query is what you can answer and if it is, then answer it. > > > > Obviously, the simplest way to create your index is to delete index, then > > use those same methods to query for all nessecary information - and > fastest > > way would be to add updating index directly into sync, which you could do > > later. > > > > Please, also, make those commands to turn index on and off (last one > should > > also delete it to save disk space). Default should be off until it's > fast, > > small and reliable. Also notice that if index is kept on hard drive, it > > might be faster if it's compressed (gz, for example) - decompressing > takes > > less time and more processing power than reading it fully out. > > I'm pretty sure your mistaken here, unless your index is stored on a > floppy or something really slow. > > A disk read has 2 primary costs. > > Seek Time: Time for the head to seek to the sector of disk you want. > Spin Time: Time for the platter to spin around such that the sector > you want is under the read head. > > Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 > rotations per second, so worst case (you just passed the sector you > need) you need to wait 1/120th of a second (or 8ms). > > Seek Time is per hard drive, but most drives provide average seek > times under 10ms. > > So it takes on average 18ms to get to your data, then you start > reading. The index will not be that large (my esearchdb is 2 megs, > but lets assume 10MB for this compressed index). > > I took a 10MB meg sqlite database and compressed it with gzip (default > settings) down to 5 megs. > gzip -d on the database takes 300ms, catting the decompressed data > base takes 88ms (average of 5 runs, drop disk caches between runs). > > I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle > > 1.3megs compresses to 390k. > > 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from > disk. > > Your index would have to be very large or very fragmented on disk > (requiring more than one seek) to see a significant gain in file > compression (gzip scales linearly). > > In short, don't compress the index ;p > > > > > Have luck! > > > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>> Emma Strubell schrieb: > >>> > 2) does anyone really need to search an overlay anyway? > >>> > >>> Of course. Take large (semi-)official overlays like sunrise. They can > >>> easily be seen as a second portage tree. > >>> -----BEGIN PGP SIGNATURE----- > >>> Version: GnuPG v2.0.9 (GNU/Linux) > >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >>> > >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt > >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S > >>> =+lCO > >>> -----END PGP SIGNATURE----- > >>> > >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> > >> wrote: > >> > > > > > [-- Attachment #2: Type: text/html, Size: 9470 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 12:42 ` Tambet @ 2008-12-02 13:51 ` Tambet 2008-12-02 19:54 ` Alec Warner 1 sibling, 0 replies; 44+ messages in thread From: Tambet @ 2008-12-02 13:51 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 8760 bytes --] Btw. one way of index update would be: emerge --sync: - Add delete index if it exists emerge --searchindexon - Turns index on emerge --searchindexoff - Turns index off - Deletes index if it exists emerge -s or emerge -S - Does normal search if index is off - If index is on and does not exist, then creates index on the fly - If index is on and does exist, then uses it Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/2 Tambet <qtvali@gmail.com> > About zipping.. Default settings might not really be good idea - i think > that "fastest" might be even better. Considering that portage tree contains > same word again and again (like "applications") it needs pretty small > dictionary to make it much smaller. Decompressing will not be reading from > disc, decompressing and writing back to disc as in your case probably - try > decompression to memory drive and you might get better numbers. > > I have personally used compression in one c++ application and with optimum > settings, it made things much faster - those were files, where i had for > example 65536 16-byte integers, which could be zeros and mostly were; I > didnt care about creating better file format, but just compressed the whole > thing. > > I suggest you to compress esearch db, then decompress it to memory drive > and give us those numbers - might be considerably faster. > > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip > support. Try open of that and normal open on esearch db; also compress with > the same lib to get right kind of file. > > Anyway - maybe this compression should be later added and optional. > > Tambet - technique evolves to art, art evolves to magic, magic evolves to > just doing. > > > 2008/12/2 Alec Warner <antarus@gentoo.org> > > On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: >> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> >> >> >> >> True, true. Like I said, I don't really use overlays, so excuse my >> >> igonrance. >> > >> > Do you know an order of doing things: >> > >> > Rules of Optimization: >> > >> > Rule 1: Don't do it. >> > Rule 2 (for experts only): Don't do it yet. >> > >> > What this actually means - functionality comes first. Readability comes >> > next. Optimization comes last. Unless you are creating a fancy 3D engine >> for >> > kung fu game. >> > >> > If you are going to exclude overlays, you are removing functionality - >> and, >> > indeed, absolutely has-to-be-there functionality, because noone would >> > intuitively expect search function to search only one subset of >> packages, >> > however reasonable this subset would be. So, you can't, just can't, add >> this >> > package into portage base - you could write just another external search >> > package for portage. >> > >> > I looked this code a bit and: >> > Portage's "__init__.py" contains comment "# search functionality". After >> > this comment, there is a nice and simple search class. >> > It also contains method "def action_sync(...)", which contains >> > synchronization stuff. >> > >> > Now, search class will be initialized by setting up 3 databases - >> porttree, >> > bintree and vartree, whatever those are. Those will be in self._dbs >> array >> > and porttree will be in self._portdb. >> > >> > It contains some more methods: >> > _findname(...) will return result of self._portdb.findname(...) with >> same >> > parameters or None if it does not exist. >> > Other methods will do similar things - map one or another method. >> > execute will do the real search... >> > Now - "for package in self.portdb.cp_all()" is important here ...it >> > currently loops over whole portage tree. All kinds of matching will be >> done >> > inside. >> > self.portdb obviously points to porttree.py (unless it points to fake >> tree). >> > cp_all will take all porttrees and do simple file search inside. This >> method >> > should contain optional index search. >> > >> > self.porttrees = [self.porttree_root] + \ >> > [os.path.realpath(t) for t in >> self.mysettings["PORTDIR_OVERLAY"].split()] >> > >> > So, self.porttrees contains list of trees - first of them is root, >> others >> > are overlays. >> > >> > Now, what you have to do will not be harder just because of having >> overlay >> > search, too. >> > >> > You have to create method def cp_index(self), which will return >> dictionary >> > containing package names as keys. For oroot... will be >> "self.porttrees[1:]", >> > not "self.porttrees" - this will only search overlays. d = {} will be >> > replaced with d = self.cp_index(). If index is not there, old version >> will >> > be used (thus, you have to make internal porttrees variable, which >> contains >> > all or all except first). >> > >> > Other methods used by search are xmatch and aux_get - first used several >> > times and last one used to get description. You have to cache results of >> > those specific queries and make them use your cache - as you can see, >> those >> > parts of portage are already able to use overlays. Thus, you have to put >> > your code again in beginning of those functions - create index_xmatch >> and >> > index_aux_get methods, then make those methods use them and return their >> > results unless those are None (or something other in case none is >> already >> > legal result) - if they return None, old code will be run and do it's >> job. >> > If index is not created, result is None. In index_** methods, just check >> if >> > query is what you can answer and if it is, then answer it. >> > >> > Obviously, the simplest way to create your index is to delete index, >> then >> > use those same methods to query for all nessecary information - and >> fastest >> > way would be to add updating index directly into sync, which you could >> do >> > later. >> > >> > Please, also, make those commands to turn index on and off (last one >> should >> > also delete it to save disk space). Default should be off until it's >> fast, >> > small and reliable. Also notice that if index is kept on hard drive, it >> > might be faster if it's compressed (gz, for example) - decompressing >> takes >> > less time and more processing power than reading it fully out. >> >> I'm pretty sure your mistaken here, unless your index is stored on a >> floppy or something really slow. >> >> A disk read has 2 primary costs. >> >> Seek Time: Time for the head to seek to the sector of disk you want. >> Spin Time: Time for the platter to spin around such that the sector >> you want is under the read head. >> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 >> rotations per second, so worst case (you just passed the sector you >> need) you need to wait 1/120th of a second (or 8ms). >> >> Seek Time is per hard drive, but most drives provide average seek >> times under 10ms. >> >> So it takes on average 18ms to get to your data, then you start >> reading. The index will not be that large (my esearchdb is 2 megs, >> but lets assume 10MB for this compressed index). >> >> I took a 10MB meg sqlite database and compressed it with gzip (default >> settings) down to 5 megs. >> gzip -d on the database takes 300ms, catting the decompressed data >> base takes 88ms (average of 5 runs, drop disk caches between runs). >> >> I then tried my vdb_metadata.pickle from >> /var/cache/edb/vdb_metadata.pickle >> >> 1.3megs compresses to 390k. >> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from >> disk. >> >> Your index would have to be very large or very fragmented on disk >> (requiring more than one seek) to see a significant gain in file >> compression (gzip scales linearly). >> >> In short, don't compress the index ;p >> >> > >> > Have luck! >> > >> >>> -----BEGIN PGP SIGNED MESSAGE----- >> >>> Hash: SHA1 >> >>> >> >>> Emma Strubell schrieb: >> >>> > 2) does anyone really need to search an overlay anyway? >> >>> >> >>> Of course. Take large (semi-)official overlays like sunrise. They can >> >>> easily be seen as a second portage tree. >> >>> -----BEGIN PGP SIGNATURE----- >> >>> Version: GnuPG v2.0.9 (GNU/Linux) >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >>> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S >> >>> =+lCO >> >>> -----END PGP SIGNATURE----- >> >>> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu >> > >> >> wrote: >> >> >> > >> > >> > > [-- Attachment #2: Type: text/html, Size: 10447 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 12:42 ` Tambet 2008-12-02 13:51 ` Tambet @ 2008-12-02 19:54 ` Alec Warner 2008-12-02 21:47 ` Tambet 1 sibling, 1 reply; 44+ messages in thread From: Alec Warner @ 2008-12-02 19:54 UTC (permalink / raw To: gentoo-portage-dev On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote: > About zipping.. Default settings might not really be good idea - i think > that "fastest" might be even better. Considering that portage tree contains > same word again and again (like "applications") it needs pretty small > dictionary to make it much smaller. Decompressing will not be reading from > disc, decompressing and writing back to disc as in your case probably - try > decompression to memory drive and you might get better numbers. I ran gzip -d -c file.gz > /dev/null, which should not write to disk. I tried again with gzip -1 and it still takes 29ms to decompress (even with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which is probably relevant to gzip decompression speed) > > I have personally used compression in one c++ application and with optimum > settings, it made things much faster - those were files, where i had for > example 65536 16-byte integers, which could be zeros and mostly were; I > didnt care about creating better file format, but just compressed the whole > thing. I'm not saying compression won't make the index smaller. I'm saying making the index smaller does not improve performance. If you have a 10 meg file and you make it 1 meg, you do not increase performance because you (on average) are not saving enough time reading the smaller file; since you pay it in decompressing the smaller file later. > > I suggest you to compress esearch db, then decompress it to memory drive and > give us those numbers - might be considerably faster. gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes on average (6 trials, dropped caches between trials) takes 35.1666ms cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms. The point is you use compression when you need to save space (sending data over the network, or storing large amounts of data or a lot of something). The index isn't going to be big (if it is bigger than 20 or 30 meg I'll be surprised), the index isn't going over the network and there is only 1 index, not say a million indexes (where compression might actually be useful for some kind of LRU subset of indexes to meet disk requirements). Anyway this is all moot since as you stated so well earlier, optimization comes last, so stop trying to do it now ;) -Alec > > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support. > Try open of that and normal open on esearch db; also compress with the same > lib to get right kind of file. > > Anyway - maybe this compression should be later added and optional. > > Tambet - technique evolves to art, art evolves to magic, magic evolves to > just doing. > > > 2008/12/2 Alec Warner <antarus@gentoo.org> >> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: >> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> >> >> >> >> True, true. Like I said, I don't really use overlays, so excuse my >> >> igonrance. >> > >> > Do you know an order of doing things: >> > >> > Rules of Optimization: >> > >> > Rule 1: Don't do it. >> > Rule 2 (for experts only): Don't do it yet. >> > >> > What this actually means - functionality comes first. Readability comes >> > next. Optimization comes last. Unless you are creating a fancy 3D engine >> > for >> > kung fu game. >> > >> > If you are going to exclude overlays, you are removing functionality - >> > and, >> > indeed, absolutely has-to-be-there functionality, because noone would >> > intuitively expect search function to search only one subset of >> > packages, >> > however reasonable this subset would be. So, you can't, just can't, add >> > this >> > package into portage base - you could write just another external search >> > package for portage. >> > >> > I looked this code a bit and: >> > Portage's "__init__.py" contains comment "# search functionality". After >> > this comment, there is a nice and simple search class. >> > It also contains method "def action_sync(...)", which contains >> > synchronization stuff. >> > >> > Now, search class will be initialized by setting up 3 databases - >> > porttree, >> > bintree and vartree, whatever those are. Those will be in self._dbs >> > array >> > and porttree will be in self._portdb. >> > >> > It contains some more methods: >> > _findname(...) will return result of self._portdb.findname(...) with >> > same >> > parameters or None if it does not exist. >> > Other methods will do similar things - map one or another method. >> > execute will do the real search... >> > Now - "for package in self.portdb.cp_all()" is important here ...it >> > currently loops over whole portage tree. All kinds of matching will be >> > done >> > inside. >> > self.portdb obviously points to porttree.py (unless it points to fake >> > tree). >> > cp_all will take all porttrees and do simple file search inside. This >> > method >> > should contain optional index search. >> > >> > self.porttrees = [self.porttree_root] + \ >> > [os.path.realpath(t) for t in >> > self.mysettings["PORTDIR_OVERLAY"].split()] >> > >> > So, self.porttrees contains list of trees - first of them is root, >> > others >> > are overlays. >> > >> > Now, what you have to do will not be harder just because of having >> > overlay >> > search, too. >> > >> > You have to create method def cp_index(self), which will return >> > dictionary >> > containing package names as keys. For oroot... will be >> > "self.porttrees[1:]", >> > not "self.porttrees" - this will only search overlays. d = {} will be >> > replaced with d = self.cp_index(). If index is not there, old version >> > will >> > be used (thus, you have to make internal porttrees variable, which >> > contains >> > all or all except first). >> > >> > Other methods used by search are xmatch and aux_get - first used several >> > times and last one used to get description. You have to cache results of >> > those specific queries and make them use your cache - as you can see, >> > those >> > parts of portage are already able to use overlays. Thus, you have to put >> > your code again in beginning of those functions - create index_xmatch >> > and >> > index_aux_get methods, then make those methods use them and return their >> > results unless those are None (or something other in case none is >> > already >> > legal result) - if they return None, old code will be run and do it's >> > job. >> > If index is not created, result is None. In index_** methods, just check >> > if >> > query is what you can answer and if it is, then answer it. >> > >> > Obviously, the simplest way to create your index is to delete index, >> > then >> > use those same methods to query for all nessecary information - and >> > fastest >> > way would be to add updating index directly into sync, which you could >> > do >> > later. >> > >> > Please, also, make those commands to turn index on and off (last one >> > should >> > also delete it to save disk space). Default should be off until it's >> > fast, >> > small and reliable. Also notice that if index is kept on hard drive, it >> > might be faster if it's compressed (gz, for example) - decompressing >> > takes >> > less time and more processing power than reading it fully out. >> >> I'm pretty sure your mistaken here, unless your index is stored on a >> floppy or something really slow. >> >> A disk read has 2 primary costs. >> >> Seek Time: Time for the head to seek to the sector of disk you want. >> Spin Time: Time for the platter to spin around such that the sector >> you want is under the read head. >> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 >> rotations per second, so worst case (you just passed the sector you >> need) you need to wait 1/120th of a second (or 8ms). >> >> Seek Time is per hard drive, but most drives provide average seek >> times under 10ms. >> >> So it takes on average 18ms to get to your data, then you start >> reading. The index will not be that large (my esearchdb is 2 megs, >> but lets assume 10MB for this compressed index). >> >> I took a 10MB meg sqlite database and compressed it with gzip (default >> settings) down to 5 megs. >> gzip -d on the database takes 300ms, catting the decompressed data >> base takes 88ms (average of 5 runs, drop disk caches between runs). >> >> I then tried my vdb_metadata.pickle from >> /var/cache/edb/vdb_metadata.pickle >> >> 1.3megs compresses to 390k. >> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from >> disk. >> >> Your index would have to be very large or very fragmented on disk >> (requiring more than one seek) to see a significant gain in file >> compression (gzip scales linearly). >> >> In short, don't compress the index ;p >> >> > >> > Have luck! >> > >> >>> -----BEGIN PGP SIGNED MESSAGE----- >> >>> Hash: SHA1 >> >>> >> >>> Emma Strubell schrieb: >> >>> > 2) does anyone really need to search an overlay anyway? >> >>> >> >>> Of course. Take large (semi-)official overlays like sunrise. They can >> >>> easily be seen as a second portage tree. >> >>> -----BEGIN PGP SIGNATURE----- >> >>> Version: GnuPG v2.0.9 (GNU/Linux) >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >>> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S >> >>> =+lCO >> >>> -----END PGP SIGNATURE----- >> >>> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> >> >> wrote: >> >> >> > >> > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 19:54 ` Alec Warner @ 2008-12-02 21:47 ` Tambet 0 siblings, 0 replies; 44+ messages in thread From: Tambet @ 2008-12-02 21:47 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 11413 bytes --] It might be that your hard drive is not that much slower than memory, then, but I really doubt this one ...or it could mean that reading gzip out is much slower than reading cat - and this one is highly probable. I mean, file size of gzip. Actually it's elementary logic that decompressing is faster than just loading. What I personally did use was *much *faster than without compressing, but that was also c++ application having this zip always in memory and this was also highly inefficiently stored data at first. I suggest you such test to understand me - make some file and write there character "a" about 10 000 000 times to get it that big, then try the same thing on that file. I think it's probable that it will be real fast to decompress the resulting file. Anyway, you have still made me think that at first, no zip should be used :) - just because your tests took several new variables in like speed of reading decompression utility from disk. Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/2 Alec Warner <antarus@gentoo.org> > On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote: > > About zipping.. Default settings might not really be good idea - i think > > that "fastest" might be even better. Considering that portage tree > contains > > same word again and again (like "applications") it needs pretty small > > dictionary to make it much smaller. Decompressing will not be reading > from > > disc, decompressing and writing back to disc as in your case probably - > try > > decompression to memory drive and you might get better numbers. > > I ran gzip -d -c file.gz > /dev/null, which should not write to disk. > > I tried again with gzip -1 and it still takes 29ms to decompress (even > with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which > is probably relevant to gzip decompression speed) > > > > > I have personally used compression in one c++ application and with > optimum > > settings, it made things much faster - those were files, where i had for > > example 65536 16-byte integers, which could be zeros and mostly were; I > > didnt care about creating better file format, but just compressed the > whole > > thing. > > I'm not saying compression won't make the index smaller. I'm saying > making the index smaller does not improve performance. If you have a > 10 meg file and you make it 1 meg, you do not increase performance > because you (on average) are not saving enough time reading the > smaller file; since you pay it in decompressing the smaller file > later. > > > > > I suggest you to compress esearch db, then decompress it to memory drive > and > > give us those numbers - might be considerably faster. > > gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes > on average (6 trials, dropped caches between trials) takes 35.1666ms > > cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, > 24ms. > > The point is you use compression when you need to save space (sending > data over the network, or storing large amounts of data or a lot of > something). The index isn't going to be big (if it is bigger than 20 > or 30 meg I'll be surprised), the index isn't going over the network > and there is only 1 index, not say a million indexes (where > compression might actually be useful for some kind of LRU subset of > indexes to meet disk requirements). > > Anyway this is all moot since as you stated so well earlier, > optimization comes last, so stop trying to do it now ;) > > -Alec > > > > > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip > support. > > Try open of that and normal open on esearch db; also compress with the > same > > lib to get right kind of file. > > > > Anyway - maybe this compression should be later added and optional. > > > > Tambet - technique evolves to art, art evolves to magic, magic evolves to > > just doing. > > > > > > 2008/12/2 Alec Warner <antarus@gentoo.org> > >> > >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: > >> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> > >> >> > >> >> True, true. Like I said, I don't really use overlays, so excuse my > >> >> igonrance. > >> > > >> > Do you know an order of doing things: > >> > > >> > Rules of Optimization: > >> > > >> > Rule 1: Don't do it. > >> > Rule 2 (for experts only): Don't do it yet. > >> > > >> > What this actually means - functionality comes first. Readability > comes > >> > next. Optimization comes last. Unless you are creating a fancy 3D > engine > >> > for > >> > kung fu game. > >> > > >> > If you are going to exclude overlays, you are removing functionality - > >> > and, > >> > indeed, absolutely has-to-be-there functionality, because noone would > >> > intuitively expect search function to search only one subset of > >> > packages, > >> > however reasonable this subset would be. So, you can't, just can't, > add > >> > this > >> > package into portage base - you could write just another external > search > >> > package for portage. > >> > > >> > I looked this code a bit and: > >> > Portage's "__init__.py" contains comment "# search functionality". > After > >> > this comment, there is a nice and simple search class. > >> > It also contains method "def action_sync(...)", which contains > >> > synchronization stuff. > >> > > >> > Now, search class will be initialized by setting up 3 databases - > >> > porttree, > >> > bintree and vartree, whatever those are. Those will be in self._dbs > >> > array > >> > and porttree will be in self._portdb. > >> > > >> > It contains some more methods: > >> > _findname(...) will return result of self._portdb.findname(...) with > >> > same > >> > parameters or None if it does not exist. > >> > Other methods will do similar things - map one or another method. > >> > execute will do the real search... > >> > Now - "for package in self.portdb.cp_all()" is important here ...it > >> > currently loops over whole portage tree. All kinds of matching will be > >> > done > >> > inside. > >> > self.portdb obviously points to porttree.py (unless it points to fake > >> > tree). > >> > cp_all will take all porttrees and do simple file search inside. This > >> > method > >> > should contain optional index search. > >> > > >> > self.porttrees = [self.porttree_root] + \ > >> > [os.path.realpath(t) for t in > >> > self.mysettings["PORTDIR_OVERLAY"].split()] > >> > > >> > So, self.porttrees contains list of trees - first of them is root, > >> > others > >> > are overlays. > >> > > >> > Now, what you have to do will not be harder just because of having > >> > overlay > >> > search, too. > >> > > >> > You have to create method def cp_index(self), which will return > >> > dictionary > >> > containing package names as keys. For oroot... will be > >> > "self.porttrees[1:]", > >> > not "self.porttrees" - this will only search overlays. d = {} will be > >> > replaced with d = self.cp_index(). If index is not there, old version > >> > will > >> > be used (thus, you have to make internal porttrees variable, which > >> > contains > >> > all or all except first). > >> > > >> > Other methods used by search are xmatch and aux_get - first used > several > >> > times and last one used to get description. You have to cache results > of > >> > those specific queries and make them use your cache - as you can see, > >> > those > >> > parts of portage are already able to use overlays. Thus, you have to > put > >> > your code again in beginning of those functions - create index_xmatch > >> > and > >> > index_aux_get methods, then make those methods use them and return > their > >> > results unless those are None (or something other in case none is > >> > already > >> > legal result) - if they return None, old code will be run and do it's > >> > job. > >> > If index is not created, result is None. In index_** methods, just > check > >> > if > >> > query is what you can answer and if it is, then answer it. > >> > > >> > Obviously, the simplest way to create your index is to delete index, > >> > then > >> > use those same methods to query for all nessecary information - and > >> > fastest > >> > way would be to add updating index directly into sync, which you could > >> > do > >> > later. > >> > > >> > Please, also, make those commands to turn index on and off (last one > >> > should > >> > also delete it to save disk space). Default should be off until it's > >> > fast, > >> > small and reliable. Also notice that if index is kept on hard drive, > it > >> > might be faster if it's compressed (gz, for example) - decompressing > >> > takes > >> > less time and more processing power than reading it fully out. > >> > >> I'm pretty sure your mistaken here, unless your index is stored on a > >> floppy or something really slow. > >> > >> A disk read has 2 primary costs. > >> > >> Seek Time: Time for the head to seek to the sector of disk you want. > >> Spin Time: Time for the platter to spin around such that the sector > >> you want is under the read head. > >> > >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 > >> rotations per second, so worst case (you just passed the sector you > >> need) you need to wait 1/120th of a second (or 8ms). > >> > >> Seek Time is per hard drive, but most drives provide average seek > >> times under 10ms. > >> > >> So it takes on average 18ms to get to your data, then you start > >> reading. The index will not be that large (my esearchdb is 2 megs, > >> but lets assume 10MB for this compressed index). > >> > >> I took a 10MB meg sqlite database and compressed it with gzip (default > >> settings) down to 5 megs. > >> gzip -d on the database takes 300ms, catting the decompressed data > >> base takes 88ms (average of 5 runs, drop disk caches between runs). > >> > >> I then tried my vdb_metadata.pickle from > >> /var/cache/edb/vdb_metadata.pickle > >> > >> 1.3megs compresses to 390k. > >> > >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from > >> disk. > >> > >> Your index would have to be very large or very fragmented on disk > >> (requiring more than one seek) to see a significant gain in file > >> compression (gzip scales linearly). > >> > >> In short, don't compress the index ;p > >> > >> > > >> > Have luck! > >> > > >> >>> -----BEGIN PGP SIGNED MESSAGE----- > >> >>> Hash: SHA1 > >> >>> > >> >>> Emma Strubell schrieb: > >> >>> > 2) does anyone really need to search an overlay anyway? > >> >>> > >> >>> Of course. Take large (semi-)official overlays like sunrise. They > can > >> >>> easily be seen as a second portage tree. > >> >>> -----BEGIN PGP SIGNATURE----- > >> >>> Version: GnuPG v2.0.9 (GNU/Linux) > >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >> >>> > >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt > >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S > >> >>> =+lCO > >> >>> -----END PGP SIGNATURE----- > >> >>> > >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann > <lists@necoro.eu> > >> >> wrote: > >> >> > >> > > >> > > > > > > [-- Attachment #2: Type: text/html, Size: 14595 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] Re: search functionality in emerge 2008-12-02 10:21 ` Alec Warner 2008-12-02 12:42 ` Tambet @ 2008-12-02 17:42 ` Tambet 1 sibling, 0 replies; 44+ messages in thread From: Tambet @ 2008-12-02 17:42 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 8967 bytes --] To remove this fuzz maybe some useflags would be possible - for example, useflag which makes portage use SQLite. Then, it's db can used for indexing and portage searches could be implemented as SQL queries. SQLite then does what it can to make it fast and disk cache will make sure that for simultaneous searches it's kept in memory. Just some thoughts on compression... http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s So the *fastest *hard drives are 8 times slower, than the *fastest *memory devices in pair. Anyway, we live in the real world... http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to 133MB/s Therefore, the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for 2MB file without seek times and partitioning. At the same time, you can read/write about 31.5MB of data in memory, doing different operations with it. If your compression algorithm has dictionary of repeated words, which repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms and loose about 1ms for seeking this thing. Processor usage might be somewhat bigger, but you really don't care about processor usage when searching packages and it's not *that* big. What it really gives us is ability to make our file format highly inefficient in terms of disk space - for example, what I mentioned before was 65536 64-bit integers (there was a mistake as I said they were 16 byte, they were 8 bytes long). Most of them started with at least 32 bits of zeros, which were wiped out by compression lib, but which I needed to do fast operations. It doesnt matter much, how data is stored, but there are repeated patterns if you want to make it fast. I really don't know if pickle has some good file format included or not. For common pickle files, I think that compression is noticeable. Anyway, using database, for example, would be clever idea and making keyword search fully SQL-query specific makes it clearly better and removes chance to compress. Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing. 2008/12/2 Alec Warner <antarus@gentoo.org> > On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote: > > 2008/12/2 Emma Strubell <emma.strubell@gmail.com> > >> > >> True, true. Like I said, I don't really use overlays, so excuse my > >> igonrance. > > > > Do you know an order of doing things: > > > > Rules of Optimization: > > > > Rule 1: Don't do it. > > Rule 2 (for experts only): Don't do it yet. > > > > What this actually means - functionality comes first. Readability comes > > next. Optimization comes last. Unless you are creating a fancy 3D engine > for > > kung fu game. > > > > If you are going to exclude overlays, you are removing functionality - > and, > > indeed, absolutely has-to-be-there functionality, because noone would > > intuitively expect search function to search only one subset of packages, > > however reasonable this subset would be. So, you can't, just can't, add > this > > package into portage base - you could write just another external search > > package for portage. > > > > I looked this code a bit and: > > Portage's "__init__.py" contains comment "# search functionality". After > > this comment, there is a nice and simple search class. > > It also contains method "def action_sync(...)", which contains > > synchronization stuff. > > > > Now, search class will be initialized by setting up 3 databases - > porttree, > > bintree and vartree, whatever those are. Those will be in self._dbs array > > and porttree will be in self._portdb. > > > > It contains some more methods: > > _findname(...) will return result of self._portdb.findname(...) with same > > parameters or None if it does not exist. > > Other methods will do similar things - map one or another method. > > execute will do the real search... > > Now - "for package in self.portdb.cp_all()" is important here ...it > > currently loops over whole portage tree. All kinds of matching will be > done > > inside. > > self.portdb obviously points to porttree.py (unless it points to fake > tree). > > cp_all will take all porttrees and do simple file search inside. This > method > > should contain optional index search. > > > > self.porttrees = [self.porttree_root] + \ > > [os.path.realpath(t) for t in > self.mysettings["PORTDIR_OVERLAY"].split()] > > > > So, self.porttrees contains list of trees - first of them is root, others > > are overlays. > > > > Now, what you have to do will not be harder just because of having > overlay > > search, too. > > > > You have to create method def cp_index(self), which will return > dictionary > > containing package names as keys. For oroot... will be > "self.porttrees[1:]", > > not "self.porttrees" - this will only search overlays. d = {} will be > > replaced with d = self.cp_index(). If index is not there, old version > will > > be used (thus, you have to make internal porttrees variable, which > contains > > all or all except first). > > > > Other methods used by search are xmatch and aux_get - first used several > > times and last one used to get description. You have to cache results of > > those specific queries and make them use your cache - as you can see, > those > > parts of portage are already able to use overlays. Thus, you have to put > > your code again in beginning of those functions - create index_xmatch and > > index_aux_get methods, then make those methods use them and return their > > results unless those are None (or something other in case none is already > > legal result) - if they return None, old code will be run and do it's > job. > > If index is not created, result is None. In index_** methods, just check > if > > query is what you can answer and if it is, then answer it. > > > > Obviously, the simplest way to create your index is to delete index, then > > use those same methods to query for all nessecary information - and > fastest > > way would be to add updating index directly into sync, which you could do > > later. > > > > Please, also, make those commands to turn index on and off (last one > should > > also delete it to save disk space). Default should be off until it's > fast, > > small and reliable. Also notice that if index is kept on hard drive, it > > might be faster if it's compressed (gz, for example) - decompressing > takes > > less time and more processing power than reading it fully out. > > I'm pretty sure your mistaken here, unless your index is stored on a > floppy or something really slow. > > A disk read has 2 primary costs. > > Seek Time: Time for the head to seek to the sector of disk you want. > Spin Time: Time for the platter to spin around such that the sector > you want is under the read head. > > Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 > rotations per second, so worst case (you just passed the sector you > need) you need to wait 1/120th of a second (or 8ms). > > Seek Time is per hard drive, but most drives provide average seek > times under 10ms. > > So it takes on average 18ms to get to your data, then you start > reading. The index will not be that large (my esearchdb is 2 megs, > but lets assume 10MB for this compressed index). > > I took a 10MB meg sqlite database and compressed it with gzip (default > settings) down to 5 megs. > gzip -d on the database takes 300ms, catting the decompressed data > base takes 88ms (average of 5 runs, drop disk caches between runs). > > I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle > > 1.3megs compresses to 390k. > > 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from > disk. > > Your index would have to be very large or very fragmented on disk > (requiring more than one seek) to see a significant gain in file > compression (gzip scales linearly). > > In short, don't compress the index ;p > > > > > Have luck! > > > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>> Emma Strubell schrieb: > >>> > 2) does anyone really need to search an overlay anyway? > >>> > >>> Of course. Take large (semi-)official overlays like sunrise. They can > >>> easily be seen as a second portage tree. > >>> -----BEGIN PGP SIGNATURE----- > >>> Version: GnuPG v2.0.9 (GNU/Linux) > >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >>> > >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt > >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S > >>> =+lCO > >>> -----END PGP SIGNATURE----- > >>> > >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> > >> wrote: > >> > > > > > [-- Attachment #2: Type: text/html, Size: 10697 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 14:33 ` Pacho Ramos 2008-11-23 14:43 ` Emma Strubell @ 2008-11-23 14:56 ` Douglas Anderson 1 sibling, 0 replies; 44+ messages in thread From: Douglas Anderson @ 2008-11-23 14:56 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 3400 bytes --] Emma, It would be great it you could speed search up a bit! As these other guys have pointed out, we do have some indexing tools in Gentoo already. Most users don't understand why that kind of functionality isn't built directly into Portage, but IIRC it has something to do with the fact that these fast search indexes aren't able to be written to by more than one process at the same time, so for example if you had two emerges finishing at the same time, Portage's current flat hash file can handle that, but the faster db-based indexes can't. Anyways, that's the way I, as a curious user, understood the problem. You might be interested in reading this, very old forum thread about a previous attempt: http://forums.gentoo.org/viewtopic-t-261580-postdays-0-postorder-asc-start-0.html On Sun, Nov 23, 2008 at 11:33 PM, Pacho Ramos < pacho@condmat1.ciencias.uniovi.es> wrote: > El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: > > Try esearch. > > > > emerge esearch > > esearch ... > > > > 2008/11/23 Emma Strubell <emma.strubell@gmail.com> > > Hi everyone. My name is Emma, and I am completely new to this > > list. I've been using Gentoo since 2004, including Portage of > > course, and before I say anything else I'd like to say thanks > > to everyone for such a kickass package management system!! > > > > Anyway, for my final project in my Data Structures & > > Algorithms class this semester, I would like to modify the > > search functionality in emerge. Something I've always noticed > > about 'emerge -s' or '-S' is that, in general, it takes a very > > long time to perform the searches. (Although, lately it does > > seem to be running faster, specifically on my laptop as > > opposed to my desktop. Strangely, though, it seems that when I > > do a simple 'emerge -av whatever' on my laptop it takes a very > > long time for emerge to find the package and/or determine the > > dependecies - whatever it's doing behind that spinner. I can > > definitely go into more detail about this if anyone's > > interested. It's really been puzzling me!) So, as my final > > project I've proposed to improve the time it takes to perform > > a search using emerge. My professor suggested that I look into > > implementing indexing. > > > > However, I've started looking at the code, and I must admit > > I'm pretty overwhelmed! I don't know where to start. I was > > wondering if anyone on here could give me a quick overview of > > how the search function currently works, an idea as to what > > could be modified or implemented in order to improve the > > running time of this code, or any tip really as to where I > > should start or what I should start looking at. I'd really > > appreciate any help or advice!! > > > > Thanks a lot, and keep on making my Debian-using professor > > jealous :] > > Emma > > > > > > > > -- > > tvali > > > > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. > > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju > > mingi täica pea nagu prügikast... > > I use eix: > emerge eix > > ;-) > > > [-- Attachment #2: Type: text/html, Size: 4891 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell 2008-11-23 14:01 ` tvali @ 2008-11-24 3:12 ` Marius Mauch 2008-11-24 5:01 ` devsk 2009-02-12 19:16 ` [gentoo-portage-dev] " René 'Necoro' Neumann 2 siblings, 1 reply; 44+ messages in thread From: Marius Mauch @ 2008-11-24 3:12 UTC (permalink / raw To: gentoo-portage-dev On Sun, 23 Nov 2008 07:17:40 -0500 "Emma Strubell" <emma.strubell@gmail.com> wrote: > However, I've started looking at the code, and I must admit I'm pretty > overwhelmed! I don't know where to start. I was wondering if anyone > on here could give me a quick overview of how the search function > currently works, an idea as to what could be modified or implemented > in order to improve the running time of this code, or any tip really > as to where I should start or what I should start looking at. I'd > really appreciate any help or advice!! Well, it depends how much effort you want to put into this. The current interface doesn't actually provide a "search" interface, but merely functions to 1) list all package names - dbapi.cp_all() 2) list all package names and versions - dbapi.cpv_all() 3) list all versions for a given package name - dbapi.cp_list() 4) read metadata (like DESCRIPTION) for a given package name and version - dbapi.aux_get() One of the main performance problems of --search is that there is no persistent cache for functions 1, 2 and 3, so if you're "just" interested in performance aspects you might want to look into that. The issue with implementing a persistent cache is that you have to consider both cold and hot filesystem cache cases: Loading an index file with package names and versions might improve the cold-cache case, but slow things down when the filesystem cache is populated. As has been mentioned, keeping the index updated is the other major issue, especially as it has to be portable and should require little or no configuration/setup for the user (so no extra daemons or special filesystems running permanently in the background). The obvious solution would be to generate the cache after `emerge --sync` (and other sync implementations) and hope that people don't modify their tree and search for the changes in between (that's what all the external tools do). I don't know if there is actually a way to do online updates while still improving performance and not relying on custom system daemons running in the background. As for --searchdesc, one problem is that dbapi.aux_get() can only operate on a single package-version on each call (though it can read multiple metadata variables). So for description searches the control flow is like this (obviously simplified): result = [] # iterate over all packages for package in dbapi.cp_all(): # determine the current version of each package, this is # another performance issue. version = get_current_version(package) # read package description from metadata cache description = dbapi.aux_get(version, ["DESCRIPTION"])[0] # check if the description matches if matches(description, searchkey): result.append(package) There you see the three bottlenecks: the lack of a pregenerated package list, the version lookup for *each* package and the actual metadata read. I've already talked about the first, so lets look at the other two. The core problem there is that DESCRIPTION (like all standard metadata variables) is version specific, so to access it you need to determine a version to use, even though in almost all cases the description is the same (or very similar) for all versions. So the proper solution would be to make the description a property of the package name instead of the package version, but that's a _huge_ task you're probably not interested in. What _might_ work here is to add support for an optional package-name->description cache that can be generated offline and includes those packages where all versions have the same description, and fall back to the current method if the package is not included in the cache. (Don't think about caching the version lookup, that's system dependent and therefore not suitable for caching). Hope it has become clear that while the actual search algorithm might be simple and not very efficient, the real problem lies in getting the data to operate on. That and the somewhat limited dbapi interface. Disclaimer: The stuff below involves extending and redesigning some core portage APIs. This isn't something you can do on a weekend, only work on this if you want to commit yourself to portage development for a long time. The functions listed above are the bare minimum to perform queries on the package repositories, but they're very low-level. That means that whenever you want to select packages by name, description, license, dependencies or other variables you need quite a bit of custom code, more if you want to combine multiple searches, and much more if you want to do it efficient and flexible. See http://dev.gentoo.org/~genone/scripts/metalib.py and http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible, but very inefficient search tool (might not work anymore due to old age). Ideally repository searches could be done without writing any application code using some kind of query language, similar to how SQL works for generic database searches (obviously not that complex). But before thinking about that we'd need a query API that actually a) allows tools to assemble queries without having to worry about implementation details b) run them efficiently without bothering the API user Simple example: Find all package-versions in the sys-apps category that are BSD-licensed. Currently that would involve something like: result = [] for package is dbapi.cp_all(): if not package.startswith("sys-apps/"): continue for version in dbapi.cp_list(package): license = dbapi.aux_get(version, ["LICENSE"])[0] # for simplicity perform a equivalence check, in reality you'd # have to account for complex license definitions if license == "BSD": result.append(version) Not very friendly to maintain, and not very efficient (we'd only need to iterate over packages in the 'sys-apps' category, but the interface doesn't allow that). And now how it might look with a extensive query interface: query = AndQuery() query.add(CategoryQuery("sys-apps", FullStringMatch())) query.add(MetadataQuery("BSD", FullStringMatch())) result = repository.selectPackages(query) Much nicer, don't you think? As said, implementing such a thing would be a huge amount of work, even if just implemented as wrappers on top of the current interface (which would prevent many efficiency improvements), but if you (or anyone else for that matter) are truly interested in this contact me off-list, maybe I can find some of my old design ideas and (incomplete) prototypes to give you a start. Marius ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 3:12 ` Marius Mauch @ 2008-11-24 5:01 ` devsk 2008-11-24 6:25 ` Marius Mauch 2008-11-24 6:47 ` [gentoo-portage-dev] " Duncan 0 siblings, 2 replies; 44+ messages in thread From: devsk @ 2008-11-24 5:01 UTC (permalink / raw To: gentoo-portage-dev > not relying on custom system daemonsrunning in the background. Why is a portage daemon such a bad thing? Or hard to do? I would very much like a daemon running on my system which I can configure to sync the portage tree once a week (or month if I am lazy), give me a summary of hot fixes, security fixes in a nice email, push important announcements and of course, sync caches on detecting changes (which should be trivial with notify daemons all over the place) etc. Why is it such a bad thing? Its crazy to think that security updates need to be pulled in Linux. -devsk ----- Original Message ---- From: Marius Mauch <genone@gentoo.org> To: gentoo-portage-dev@lists.gentoo.org Sent: Sunday, November 23, 2008 7:12:57 PM Subject: Re: [gentoo-portage-dev] search functionality in emerge On Sun, 23 Nov 2008 07:17:40 -0500 "Emma Strubell" <emma.strubell@gmail.com> wrote: > However, I've started looking at the code, and I must admit I'm pretty > overwhelmed! I don't know where to start. I was wondering if anyone > on here could give me a quick overview of how the search function > currently works, an idea as to what could be modified or implemented > in order to improve the running time of this code, or any tip really > as to where I should start or what I should start looking at. I'd > really appreciate any help or advice!! Well, it depends how much effort you want to put into this. The current interface doesn't actually provide a "search" interface, but merely functions to 1) list all package names - dbapi.cp_all() 2) list all package names and versions - dbapi.cpv_all() 3) list all versions for a given package name - dbapi.cp_list() 4) read metadata (like DESCRIPTION) for a given package name and version - dbapi.aux_get() One of the main performance problems of --search is that there is no persistent cache for functions 1, 2 and 3, so if you're "just" interested in performance aspects you might want to look into that. The issue with implementing a persistent cache is that you have to consider both cold and hot filesystem cache cases: Loading an index file with package names and versions might improve the cold-cache case, but slow things down when the filesystem cache is populated. As has been mentioned, keeping the index updated is the other major issue, especially as it has to be portable and should require little or no configuration/setup for the user (so no extra daemons or special filesystems running permanently in the background). The obvious solution would be to generate the cache after `emerge --sync` (and other sync implementations) and hope that people don't modify their tree and search for the changes in between (that's what all the external tools do). I don't know if there is actually a way to do online updates while still improving performance and not relying on custom system daemons running in the background. As for --searchdesc, one problem is that dbapi.aux_get() can only operate on a single package-version on each call (though it can read multiple metadata variables). So for description searches the control flow is like this (obviously simplified): result = [] # iterate over all packages for package in dbapi.cp_all(): # determine the current version of each package, this is # another performance issue. version = get_current_version(package) # read package description from metadata cache description = dbapi.aux_get(version, ["DESCRIPTION"])[0] # check if the description matches if matches(description, searchkey): result.append(package) There you see the three bottlenecks: the lack of a pregenerated package list, the version lookup for *each* package and the actual metadata read. I've already talked about the first, so lets look at the other two. The core problem there is that DESCRIPTION (like all standard metadata variables) is version specific, so to access it you need to determine a version to use, even though in almost all cases the description is the same (or very similar) for all versions. So the proper solution would be to make the description a property of the package name instead of the package version, but that's a _huge_ task you're probably not interested in. What _might_ work here is to add support for an optional package-name->description cache that can be generated offline and includes those packages where all versions have the same description, and fall back to the current method if the package is not included in the cache. (Don't think about caching the version lookup, that's system dependent and therefore not suitable for caching). Hope it has become clear that while the actual search algorithm might be simple and not very efficient, the real problem lies in getting the data to operate on. That and the somewhat limited dbapi interface. Disclaimer: The stuff below involves extending and redesigning some core portage APIs. This isn't something you can do on a weekend, only work on this if you want to commit yourself to portage development for a long time. The functions listed above are the bare minimum to perform queries on the package repositories, but they're very low-level. That means that whenever you want to select packages by name, description, license, dependencies or other variables you need quite a bit of custom code, more if you want to combine multiple searches, and much more if you want to do it efficient and flexible. See http://dev.gentoo.org/~genone/scripts/metalib.py and http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible, but very inefficient search tool (might not work anymore due to old age). Ideally repository searches could be done without writing any application code using some kind of query language, similar to how SQL works for generic database searches (obviously not that complex). But before thinking about that we'd need a query API that actually a) allows tools to assemble queries without having to worry about implementation details b) run them efficiently without bothering the API user Simple example: Find all package-versions in the sys-apps category that are BSD-licensed. Currently that would involve something like: result = [] for package is dbapi.cp_all(): if not package.startswith("sys-apps/"): continue for version in dbapi.cp_list(package): license = dbapi.aux_get(version, ["LICENSE"])[0] # for simplicity perform a equivalence check, in reality you'd # have to account for complex license definitions if license == "BSD": result.append(version) Not very friendly to maintain, and not very efficient (we'd only need to iterate over packages in the 'sys-apps' category, but the interface doesn't allow that). And now how it might look with a extensive query interface: query = AndQuery() query.add(CategoryQuery("sys-apps", FullStringMatch())) query.add(MetadataQuery("BSD", FullStringMatch())) result = repository.selectPackages(query) Much nicer, don't you think? As said, implementing such a thing would be a huge amount of work, even if just implemented as wrappers on top of the current interface (which would prevent many efficiency improvements), but if you (or anyone else for that matter) are truly interested in this contact me off-list, maybe I can find some of my old design ideas and (incomplete) prototypes to give you a start. Marius ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-24 5:01 ` devsk @ 2008-11-24 6:25 ` Marius Mauch 2008-11-24 6:47 ` [gentoo-portage-dev] " Duncan 1 sibling, 0 replies; 44+ messages in thread From: Marius Mauch @ 2008-11-24 6:25 UTC (permalink / raw To: gentoo-portage-dev On Sun, 23 Nov 2008 21:01:40 -0800 (PST) devsk <funtoos@yahoo.com> wrote: > > not relying on custom system daemonsrunning in the background. > > Why is a portage daemon such a bad thing? Or hard to do? I would very > much like a daemon running on my system which I can configure to sync > the portage tree once a week (or month if I am lazy), give me a > summary of hot fixes, security fixes in a nice email, push important > announcements and of course, sync caches on detecting changes (which > should be trivial with notify daemons all over the place) etc. Why is > it such a bad thing? Well, as an opt-in solution it might work (though most of what you described is IMO just stuff for cron, no need to reinvent the wheel). What I was saying is that _relying_ on custom system daemons/filesystems for a _core subsystem_ of portage is the wrong way, simply because it adds a substantial amount of complexity to the whole package management architecture. It's one more thing that can (and will) break, one more layer to take into account for any design decisions, one more component that has to be secured, one more obstacle to overcome when you want to analyze/debug things. And special care must be taken if it requires special kernel support and/or external packages. Do you want to make inotify support mandatory to use portage efficiently? (btw, looks like inotify doesn't really work with NFS mounts, which would already make such a daemon completely useless for people using a NFS-shared repository) And finally, if you look at the use cases, a daemon is simply overkill for most cases, as the vast majority of people only use emerge --sync (or wrappers) and maybe layman to change the tree, usually once per day or less often. Do you really want to push another system daemon on users that isn't of use to them? > Its crazy to think that security updates need to be pulled in Linux. That's IMO better be handled via an applet (bug #190397 has some code), or just check for updates after a sync (as syncing is the only way for updates to become available at this time). Maybe a message could be added after sync if there are pending GLSAs, now that the glsa support code is in portage. Marius ^ permalink raw reply [flat|nested] 44+ messages in thread
* [gentoo-portage-dev] Re: search functionality in emerge 2008-11-24 5:01 ` devsk 2008-11-24 6:25 ` Marius Mauch @ 2008-11-24 6:47 ` Duncan 1 sibling, 0 replies; 44+ messages in thread From: Duncan @ 2008-11-24 6:47 UTC (permalink / raw To: gentoo-portage-dev devsk <funtoos@yahoo.com> posted 396349.98307.qm@web31708.mail.mud.yahoo.com, excerpted below, on Sun, 23 Nov 2008 21:01:40 -0800: > Why is a portage daemon such a bad thing? Or hard to do? I would very > much like a daemon running on my system which I can configure to sync > the portage tree once a week (or month if I am lazy), give me a summary > of hot fixes, security fixes in a nice email, push important > announcements and of course, sync caches on detecting changes (which > should be trivial with notify daemons all over the place) etc. Why is it > such a bad thing? > > Its crazy to think that security updates need to be pulled in Linux. Well, this is more a user list discussion than a portage development discussion, but... For one thing, it's terribly inefficient to keep a dozen daemons running checking only a single thing each, each week, when we have a cron scheduling daemon, and it's both efficient and The Unix Way (R) to setup a script to do whatever you need it to do, and then have the cron daemon run each of a dozen different scripts once each week, instead of having those dozen different daemons running constantly when they're only active once a week. IOW, it only requires a manual pull if you've not already setup cron to invoke an appropriate script once a week, and that involves only a single constantly running daemon, the cron daemon of your choice. Now, perhaps it can be argued that there should be a package that installs such a pre-made script. For all I know, maybe there is one already. And perhaps it can be argued that said script, if optional, should at least be mentioned in the handbook. I couldn't argue with the logic of either of those. But there's no reason to run yet another daemon constantly, when (1) it's not needed constantly, and (2), there's already a perfectly functional way of scheduling something to run when it /is/ needed, complete with optional results mailing, etc, if it's scripted to do that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-portage-dev] search functionality in emerge 2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell 2008-11-23 14:01 ` tvali 2008-11-24 3:12 ` Marius Mauch @ 2009-02-12 19:16 ` René 'Necoro' Neumann [not found] ` <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com> 2 siblings, 1 reply; 44+ messages in thread From: René 'Necoro' Neumann @ 2009-02-12 19:16 UTC (permalink / raw To: gentoo-portage-dev; +Cc: emma.strubell -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey, has your project resulted in anything? :) Just curios about perhaps nice portage additions ;) Regards, Necoro Emma Strubell schrieb: > Hi everyone. My name is Emma, and I am completely new to this list. I've > been using Gentoo since 2004, including Portage of course, and before I say > anything else I'd like to say thanks to everyone for such a kickass package > management system!! > > Anyway, for my final project in my Data Structures & Algorithms class this > semester, I would like to modify the search functionality in emerge. > Something I've always noticed about 'emerge -s' or '-S' is that, in general, > it takes a very long time to perform the searches. (Although, lately it does > seem to be running faster, specifically on my laptop as opposed to my > desktop. Strangely, though, it seems that when I do a simple 'emerge -av > whatever' on my laptop it takes a very long time for emerge to find the > package and/or determine the dependecies - whatever it's doing behind that > spinner. I can definitely go into more detail about this if anyone's > interested. It's really been puzzling me!) So, as my final project I've > proposed to improve the time it takes to perform a search using emerge. My > professor suggested that I look into implementing indexing. > > However, I've started looking at the code, and I must admit I'm pretty > overwhelmed! I don't know where to start. I was wondering if anyone on here > could give me a quick overview of how the search function currently works, > an idea as to what could be modified or implemented in order to improve the > running time of this code, or any tip really as to where I should start or > what I should start looking at. I'd really appreciate any help or advice!! > > Thanks a lot, and keep on making my Debian-using professor jealous :] > Emma > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkmUdZIACgkQ4UOg/zhYFuDRQQCfeVXb6uy+wBSKll4MHq54MiyX VawAn0TWrTBVKuxAPFWpQMvvO3yED5Fs =dBni -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com>]
* Fwd: [gentoo-portage-dev] search functionality in emerge [not found] ` <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com> @ 2009-02-12 21:01 ` Emma Strubell 2009-02-12 21:05 ` Mike Auty 2009-02-13 13:37 ` Marijn Schouten (hkBst) 0 siblings, 2 replies; 44+ messages in thread From: Emma Strubell @ 2009-02-12 21:01 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 4386 bytes --] Hi! So, my project did result in... something. Nothing too impressive, though. My implementation of search ended up being about the same speed as the current implementation because of the pickle module. I finished my project at the very last minute before it was due, so I didn't have time to find and implement an alternative pickling/serialization module. There must be a faster python pickler out there, though, any recommendations? After I turned in my project I had final exams and then winter break, and I basically didn't want to look at my code at all during that time. Now that you've brought it up, though, I wouldn't mind working on it, perhaps polishing it (okay, it needs more than polishing) so that it might actually be a nice addition to portage? I'm not doing any coding for any of my classes this semester (except for some assembler) so I definitely wouldn't mind working on this on the side. The reason why it will need (significantly) more work is because I basically had no idea what I was getting into with the regex search. I implemented $ and *, if I remember correctly, and for anything else the search just defaults to the current portage search. I don't know whether implementing regex search with the suffix tree that I used to implement the search would make sense... I'll have to think about it some more. In fact, I have nothing else to do this rainy afternoon :] If I can find an unpickler that can unpickle at a reasonable speed, my search implementation would be significantly faster than the one currently in use. I'd show you my code, but I have to admit I'm intimidated by Alec's recent picking apart of Doug's code! For example, I don't even know how to use docstrings... The code probably could be cleaned up a lot in general since I was eventually just trying to get it to work before it was due. Thanks for asking, let me know what you think! (Also, sorry, René, for sending this to you twice.) Emma On Thu, Feb 12, 2009 at 2:16 PM, René 'Necoro' Neumann <lists@necoro.eu>wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hey, > > has your project resulted in anything? :) > > Just curios about perhaps nice portage additions ;) > > Regards, > Necoro > > Emma Strubell schrieb: > > Hi everyone. My name is Emma, and I am completely new to this list. I've > > been using Gentoo since 2004, including Portage of course, and before I > say > > anything else I'd like to say thanks to everyone for such a kickass > package > > management system!! > > > > Anyway, for my final project in my Data Structures & Algorithms class > this > > semester, I would like to modify the search functionality in emerge. > > Something I've always noticed about 'emerge -s' or '-S' is that, in > general, > > it takes a very long time to perform the searches. (Although, lately it > does > > seem to be running faster, specifically on my laptop as opposed to my > > desktop. Strangely, though, it seems that when I do a simple 'emerge -av > > whatever' on my laptop it takes a very long time for emerge to find the > > package and/or determine the dependecies - whatever it's doing behind > that > > spinner. I can definitely go into more detail about this if anyone's > > interested. It's really been puzzling me!) So, as my final project I've > > proposed to improve the time it takes to perform a search using emerge. > My > > professor suggested that I look into implementing indexing. > > > > However, I've started looking at the code, and I must admit I'm pretty > > overwhelmed! I don't know where to start. I was wondering if anyone on > here > > could give me a quick overview of how the search function currently > works, > > an idea as to what could be modified or implemented in order to improve > the > > running time of this code, or any tip really as to where I should start > or > > what I should start looking at. I'd really appreciate any help or > advice!! > > > > Thanks a lot, and keep on making my Debian-using professor jealous :] > > Emma > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkmUdZIACgkQ4UOg/zhYFuDRQQCfeVXb6uy+wBSKll4MHq54MiyX > VawAn0TWrTBVKuxAPFWpQMvvO3yED5Fs > =dBni > -----END PGP SIGNATURE----- > [-- Attachment #2: Type: text/html, Size: 5135 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Fwd: [gentoo-portage-dev] search functionality in emerge 2009-02-12 21:01 ` Fwd: " Emma Strubell @ 2009-02-12 21:05 ` Mike Auty 2009-02-12 21:14 ` Emma Strubell 2009-02-13 13:37 ` Marijn Schouten (hkBst) 1 sibling, 1 reply; 44+ messages in thread From: Mike Auty @ 2009-02-12 21:05 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Emma Strubell wrote: > There must be a faster python pickler out there, though, any > recommendations? Right off the bat, you might try cPickle. It should work identically, just faster. Also you can try importing psyco, if it's present it will try semi-compiling some bits and pieces and *might* offer some speed-ups (as in, it won't always, for small projects it might actually slow it down). Mike 5:) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkmUjy8ACgkQu7rWomwgFXrW6wCfS9zTTgqbhiDyaU1opDJO3BM2 VO4AoIaPQ+t27OnTGh7tBEH/mqYntO/v =NzDj -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Fwd: [gentoo-portage-dev] search functionality in emerge 2009-02-12 21:05 ` Mike Auty @ 2009-02-12 21:14 ` Emma Strubell 0 siblings, 0 replies; 44+ messages in thread From: Emma Strubell @ 2009-02-12 21:14 UTC (permalink / raw To: gentoo-portage-dev [-- Attachment #1: Type: text/plain, Size: 922 bytes --] Oh, I meant to say, I was indeed using cPickle. That was my first thought as well, that for some reason pickle was being loaded instead of cPickle, but no. On Thu, Feb 12, 2009 at 4:05 PM, Mike Auty <ikelos@gentoo.org> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Emma Strubell wrote: > > There must be a faster python pickler out there, though, any > > recommendations? > > Right off the bat, you might try cPickle. It should work identically, > just faster. Also you can try importing psyco, if it's present it will > try semi-compiling some bits and pieces and *might* offer some speed-ups > (as in, it won't always, for small projects it might actually slow it > down). > > Mike 5:) > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (GNU/Linux) > > iEYEARECAAYFAkmUjy8ACgkQu7rWomwgFXrW6wCfS9zTTgqbhiDyaU1opDJO3BM2 > VO4AoIaPQ+t27OnTGh7tBEH/mqYntO/v > =NzDj > -----END PGP SIGNATURE----- > > [-- Attachment #2: Type: text/html, Size: 1350 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Fwd: [gentoo-portage-dev] search functionality in emerge 2009-02-12 21:01 ` Fwd: " Emma Strubell 2009-02-12 21:05 ` Mike Auty @ 2009-02-13 13:37 ` Marijn Schouten (hkBst) 1 sibling, 0 replies; 44+ messages in thread From: Marijn Schouten (hkBst) @ 2009-02-13 13:37 UTC (permalink / raw To: gentoo-portage-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Emma Strubell wrote: > Hi! > > If I can find an unpickler that can unpickle at a reasonable speed, my > search implementation would be significantly faster than the one currently > in use. I'd show you my code, but I have to admit I'm intimidated by Alec's > recent picking apart of Doug's code! For example, I don't even know how to > use docstrings... The code probably could be cleaned up a lot in general > since I was eventually just trying to get it to work before it was due. Please don't be intimidated by it. Code review is one of the best methods to improve your skills. We all sucked at programming at one time and perhaps we still suck in anything but our favorite language. But if we are to improve ourselves we need to spend a lot of time reading and coding and still we will not always get it right. Furthermore other people learn from our code review, such as you learnt about docstrings. Here[1] is a quick explanation of them. Have fun, Marijn [1]:http://epydoc.sourceforge.net/docstrings.html - -- Sarcasm puts the iron in irony, cynicism the steel. Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML <http://www.gentoo.org/proj/en/lisp/>, #gentoo-{lisp,ml} on FreeNode -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkmVd5AACgkQp/VmCx0OL2zIFQCgyJYZve1o6DnBBV/HgRV/gWMc 9NkAoLl0M4NX8l+kgWYY3B1dQQtU0/4k =p/Pq -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2009-02-13 12:37 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-23 12:17 [gentoo-portage-dev] search functionality in emerge Emma Strubell 2008-11-23 14:01 ` tvali 2008-11-23 14:33 ` Pacho Ramos 2008-11-23 14:43 ` Emma Strubell 2008-11-23 16:56 ` Lucian Poston 2008-11-23 18:49 ` Emma Strubell 2008-11-23 20:00 ` tvali 2008-11-23 21:20 ` Mike Auty 2008-11-23 21:59 ` René 'Necoro' Neumann 2008-11-24 0:53 ` tvali 2008-11-24 9:34 ` René 'Necoro' Neumann 2008-11-24 9:48 ` Fabian Groffen 2008-11-24 14:30 ` tvali 2008-11-24 15:14 ` tvali 2008-11-24 15:15 ` René 'Necoro' Neumann 2008-11-24 15:18 ` tvali 2008-11-24 17:15 ` tvali 2008-11-30 23:42 ` Emma Strubell 2008-12-01 7:34 ` [gentoo-portage-dev] " Duncan 2008-12-01 10:40 ` Emma Strubell 2008-12-01 17:52 ` Zac Medico 2008-12-01 21:25 ` Emma Strubell 2008-12-01 21:52 ` Tambet 2008-12-01 22:08 ` Emma Strubell 2008-12-01 22:17 ` René 'Necoro' Neumann 2008-12-01 22:47 ` Emma Strubell 2008-12-02 0:20 ` Tambet 2008-12-02 2:23 ` Emma Strubell 2008-12-02 10:21 ` Alec Warner 2008-12-02 12:42 ` Tambet 2008-12-02 13:51 ` Tambet 2008-12-02 19:54 ` Alec Warner 2008-12-02 21:47 ` Tambet 2008-12-02 17:42 ` Tambet 2008-11-23 14:56 ` [gentoo-portage-dev] " Douglas Anderson 2008-11-24 3:12 ` Marius Mauch 2008-11-24 5:01 ` devsk 2008-11-24 6:25 ` Marius Mauch 2008-11-24 6:47 ` [gentoo-portage-dev] " Duncan 2009-02-12 19:16 ` [gentoo-portage-dev] " René 'Necoro' Neumann [not found] ` <5a8c638a0902121258s7402d9d7l1ad2b9a8ecf9820d@mail.gmail.com> 2009-02-12 21:01 ` Fwd: " Emma Strubell 2009-02-12 21:05 ` Mike Auty 2009-02-12 21:14 ` Emma Strubell 2009-02-13 13:37 ` Marijn Schouten (hkBst)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox