On Sun, 16 Jun 2013 19:21:38 +0200 Pacho Ramos wrote: > El dom, 16-06-2013 a las 10:09 -0700, Brian Dolbec escribió: > [...] > > Thank you for considering helping. I have stayed away form the > > intricate details of package management in the past, but I also do > > not like how long portage is taking now for dep calculations. > > And, cannot that efforts be put in enhancing portage instead? To make you see the problems and decisions, I'm going to elaborate a little and would like you to ask yourself some questions. Is it possible to reasonable enhance the Portage code to improve dep calculations in a reasonable amount of time? Let's take a call graph, demonstrating the amount of calls on the arrows and the amount of ticks spend in the call in the boxes: http://i.imgur.com/A93CdNR.png Which part do you think is problematic? What can we do to get an improvement in time that you can actually benefit from? Which improvements are reasonable to implement? ...? Ignoring that call graph, you could look at what has recently been introduced to increase the amount of time needed to calculate the dependency graph; you don't have to look far. http://blogs.gentoo.org/mgorny/2013/05/27/the-pointless-art-of-subslots/ While I don't want point out the contents of that blog post, the title is relevant; implementing features like subslots on an algorithm that was not written with subslots in mind introduces a lot of inefficiency. And it's not just subslots, newer features keep getting added to the dependency graph calculation and it gets slower and slower over time. It feels like revdep-rebuild moved into the dependency calculation! A combination of these two above arguments ("where do we start?" and "why try to fix something broken by design?") makes me feel that there is need for a huge refactoring; ask yourself another question, what if these systems were written from scratch with subslots in mind? Exactly, if you write a system with the features in mind you can write it much more efficient and don't have to deal with historical decisions that you have made in the past; you can continue writing without having to change half of your code, because you though about it in advance. But well, this is true in the start; some EAPIs later, history repeats! So, when we acknowledge that it is useless to waste effort on fixes that are unlikely to have a huge benefit or rewriting something that might as well get replaced some years later; we should instead look for better opportunities to do, there are multiple options: 1) Spend an awful lot of time refactoring our well known Portage; this doesn't only involve moving code around, but also nuking legacy code you can't deal with (see my earlier response) as well as writing new code where it is needed. It may sound easy, it isn't. 2) Write a system from scratch that is certain to be future proof; it is designed with the current and future specifications in mind, not based on specifications that were awesome some time in the past. 3) Spend time on learning how pkgcore is implemented, then improve it; pkgcore can be somewhat seen as a a mix from (1) and (2). Then, which option would you pick? I'm not saying Portage or the team behind it is bad; it is just a bit at the end of its time, I'm afraid of what the future will do to it. For option (1), I've thinked slightly further than to just generate the dependency graph, there are two things that came to mind that might be interesting _if_ we can get it to somehow work: A) Multiple threads I think the way trees work (branches), threads are a huge benefit. Maybe zmedico can elaborate on this again, but there were some problems that make it not easy to introduce these threads; there was something regarding a global interpreter lock, but I can't recall the details and am not that acquainted with Python. Besides that, the threads will have to be organized; so properly designing the threads, locks and inter-thread interactions might be an interesting task that could require some effort. B) Additional caching Some of the parts of the dependency graph calculation could benefit from a bit of caching; implementing caching right might be a tricky thing, too small cache is useless and too large cache is overhead. Let me take one example part; resolving the USE flags to consider which packages are dependencies, this happens again and again. For example, let's say you have >=dev-libs/glib-2.33:2 gnome-keyring? ( >=app-crypt/libsecret-0.5 ) introspection? ( >=dev-libs/gobject-introspection-1 ) then Portage has to go and turn that into maybe something like >=dev-libs/glib-2.33:2 because the user has neither USE flag set; and it is not only the USE flags the user has set, but also the USE flag in profiles, the default USE flags, the REQUIRED_USE and sometimes even other USE flags like "use1? ( use2? ( ATOM ) )". Heh, complexity! So, let's say we want to cache this operation, then we could store a pair of the following details in the cache: - Checksum of the ebuild. - USE flags that the user relevant to the ebuild. - Resulting dependency variables. So, instead of having to compute the whole thing, it simply can look up the checksum and the USE flags the user has set and immediately get back the right dependency string. That sounds awesome, but how well does the cache function? To know that, we would have to look at cache invalidation. - How often does the ebuild checksum change? --> Not so much, especially not for stable ebuilds. - How often do the users USE flags change for a specific ebuild? --> Not so much, only when the user explicitly changes it or some masking happens in the profile which both don't happen too much. That's really great, but now three sad questions: - But how big does this cache grow? No idea, it requires another study that implements half of this. - But how much does this really speed up? Hard to tell without trying it. - Erm, what about the USE flags the reverse dependencies force? Oops, no idea is perfect; can we resolve that?! Heh, no idea... You can see that it is not hard to come up with ideas like (A) and (B); but, it is much harder to come up with ideas that actually work, which is why I think we will not see any beneficial improvement to Portage tree soon, unless we are going to drop features and write alternatives. Back to the options... For option (2) I made a very small start and might continue with this over the vacation; but before I do that, I'm likely going to evaluate option (3) if other people are going to jump in as well, perhaps helping along pkgcore can help me gain knowledge to better write (2) further in the future when pkgcore is found to be past its time. Whatever we do, I hope a good educated choice is made... Until then, I hope I can continue to reasonably use Portage. -- With kind regards, Tom Wijsman (TomWij) Gentoo Developer E-mail address : TomWij@gentoo.org GPG Public Key : 6D34E57D GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D