public inbox for gentoo-soc@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-soc] Questions about Cache Sync idea for 2010 soc
@ 2010-03-08 17:21 Robert R. Russell
  2010-03-09 18:42 ` Zac Medico
  0 siblings, 1 reply; 3+ messages in thread
From: Robert R. Russell @ 2010-03-08 17:21 UTC (permalink / raw
  To: gentoo-soc

The cache sync project[1] wants a way to generate portage's cache on the portage tree and/or any chosen overlay and then distribute that cache by some method. Correct?

This project sounds very similar to an idea I have been toying around with for a bit, but I have some questions before I apply for this project.

How well documented is the current cache format portage uses?

What restrictions if any would be placed on extending the current cache format?

How well documented is the ebuild file format?

How much of the ebuild is essential for portage to create a valid cache entry?

How stable and well documented is the format of the cache essential pieces of an ebuild?

Is there any previous work on this or a project that might overlap with this project? Such as, an attempt at a new parser for portage.

Will there be mandatory discussion between the person doing this project and the person doing the tags support project?

Is improving the performance of the cache and/or search feature a mandatory goal of this project?

Thank you.

[1] http://en.gentoo-wiki.com/wiki/Google_Summer_of_Code_2010_ideas#Cache_sync



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [gentoo-soc] Questions about Cache Sync idea for 2010 soc
  2010-03-08 17:21 [gentoo-soc] Questions about Cache Sync idea for 2010 soc Robert R. Russell
@ 2010-03-09 18:42 ` Zac Medico
  2010-03-09 23:55   ` Robert R. Russell
  0 siblings, 1 reply; 3+ messages in thread
From: Zac Medico @ 2010-03-09 18:42 UTC (permalink / raw
  To: gentoo-soc

On 03/08/2010 09:21 AM, Robert R. Russell wrote:
> The cache sync project[1] wants a way to generate portage's cache on the portage tree and/or any chosen overlay and then distribute that cache by some method. Correct?

Well, you can already do that with the egencache program that's
included with portage. I think the gist of the "cache sync" idea is
that you should be able to download the cache for dependency
calculations, and defer the download of the source package until
after the dependency calculation. For doing something like that, a
portage tree is probably not very suitable since the tree can change
rapidly and the cache may invalidate quickly. If the cache and the
source package will be distributed separately, it might be more
practical to make something like a source rpm that contains an
ebuild and eclasses. Many of these source packages could be
distributed in a repository that is independent of the portage tree,
and it's cache may be valid for a longer period of time.

> This project sounds very similar to an idea I have been toying around with for a bit, but I have some questions before I apply for this project.
> 
> How well documented is the current cache format portage uses?

It's not very well documented. You might try experimenting with the
egencache program to get a feel for how it works. Cache is generated
by sourcing ebuilds, and it's stored in /var/cache/edb/dep. It's
validated by comparing ebuild and eclass timestamps to those that
are saved in the cache entry. After a complete cache entry is
generated for /var/cache/edb/dep, an incomplete cache entry (lacking
eclass timestamps, since the format hasn't been extended to support
them yet) is written into $PORTDIR/metadata/cache. There is a
discussion about extending the format to include eclass digests here:

 http://archives.gentoo.org/gentoo-dev/msg_cfa80e33ee5fa6f854120ddfb9b468b3.xml

> What restrictions if any would be placed on extending the current cache format?

It has to be backward compatible. If we want to change the format in
a backward incompatible way, for example by combining the whole
cache into a single text file, we'll have to distribute both formats
until users have had time to migrate to a package manager that
supports the new format.

> How well documented is the ebuild file format?

It's pretty well documented by PMS. You can get that by installing
app-doc/pms. For something that's much shorter and less
comprehensive, there's the `man 5 ebuild`.

> How much of the ebuild is essential for portage to create a valid cache entry?

The whole ebuild and any eclasses that it inherits.

> How stable and well documented is the format of the cache essential pieces of an ebuild?

It's very stable because it has to be backward compatible. Breaking
compatibility would be a sever problem because dependency
calculations are very slow unless there is a valid/compatible cache
available.

> Is there any previous work on this or a project that might overlap with this project? Such as, an attempt at a new parser for portage.

I know that Mounir Lamouri (volkmar@gentoo) has been thinking about
a new cache format that will use a single file for the whole cache.

> Will there be mandatory discussion between the person doing this project and the person doing the tags support project?

Tags are a separate project.

> Is improving the performance of the cache and/or search feature a mandatory goal of this project?

Well, the cache should probably all go in a single file, and that
will probably improve performance because generatlly it's faster to
load one big file than a bunch of small files.

> Thank you.
> 
> [1] http://en.gentoo-wiki.com/wiki/Google_Summer_of_Code_2010_ideas#Cache_sync
> 
-- 
Thanks,
Zac



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [gentoo-soc] Questions about Cache Sync idea for 2010 soc
  2010-03-09 18:42 ` Zac Medico
@ 2010-03-09 23:55   ` Robert R. Russell
  0 siblings, 0 replies; 3+ messages in thread
From: Robert R. Russell @ 2010-03-09 23:55 UTC (permalink / raw
  To: gentoo-soc

On Tue, Mar 09, 2010 at 10:42:38AM -0800, Zac Medico wrote:
> On 03/08/2010 09:21 AM, Robert R. Russell wrote:
> > The cache sync project[1] wants a way to generate portage's cache
> > on the portage tree and/or any chosen overlay and then distribute
> > that cache by some method. Correct?
> 
> Well, you can already do that with the egencache program that's
> included with portage. I think the gist of the "cache sync" idea is
> that you should be able to download the cache for dependency
> calculations, and defer the download of the source package until
> after the dependency calculation. For doing something like that, a
> portage tree is probably not very suitable since the tree can change
> rapidly and the cache may invalidate quickly. If the cache and the
> source package will be distributed separately, it might be more
> practical to make something like a source RPM that contains an
> ebuild and eclasses. Many of these source packages could be
> distributed in a repository that is independent of the portage tree,
> and it's cache may be valid for a longer period of time.
>

I did not know about the egencache program. So I got the wrong initial
impression of the project's goals, no problem. The goal is much
simpler than my initial impression of it was.

The worst case change I could see with only a partial copy of the
portage tree available locally would be the complete removal of an
ebuild between the last sync time and the attempt to that ebuild. By
complete removal, I mean the deletion of the ebuild from the tree and
removal of the tar-ball from the Gentoo mirror infrastructure. The
other common problem would be an incomplete or inaccurate manifest of
the ebuild, source tar-balls, and in tree patches. This problem is
usually eliminated by re-syncing the tree. So the most 2 likely
sources of problems are seen in the wild with the full portage tree
and have known work arounds.

Talking about source RPMs, do you mean something like a tar-ball of
the ebuild with is associated patches, eclasses, and other directly
dependent data, but no source code? This ebuild tar-ball is then
fetched after dependency calculation is made and it provides the
instructions for building, downloading, and installing the package
from the source tar-ball. That sounds like a Gentoo style replacement
for source RPMs.

> 
> > This project sounds very similar to an idea I have been toying
> > around with for a bit, but I have some questions before I apply
> > for this project.
> > 
> > How well documented is the current cache format portage uses?
> 
> It's not very well documented. You might try experimenting with the
> egencache program to get a feel for how it works. Cache is generated
> by sourcing ebuilds, and it's stored in /var/cache/edb/dep. It's
> validated by comparing ebuild and eclass timestamps to those that
> are saved in the cache entry. After a complete cache entry is
> generated for /var/cache/edb/dep, an incomplete cache entry (lacking
> eclass timestamps, since the format hasn't been extended to support
> them yet) is written into $PORTDIR/metadata/cache. There is a
> discussion about extending the format to include eclass digests here:
> 
>  http://archives.gentoo.org/gentoo-dev/msg_cfa80e33ee5fa6f854120ddfb9b468b3.xml
> 
> > What restrictions if any would be placed on extending the current cache format?
> 
> It has to be backward compatible. If we want to change the format in
> a backward incompatible way, for example by combining the whole
> cache into a single text file, we'll have to distribute both formats
> until users have had time to migrate to a package manager that
> supports the new format.
>

I think that any change to support the ebuild tar-ball format would
require the inclusion of some sort of cryptographic hash of the ebuild
tar-ball into the cache format. Another solution might be distributing
a large pile of public key signatures with the cache and then
validating the signature of the ebuild tar-ball. With the exception of
package manager support the cryptographic signature method is probably
the least intrusive method. Well is at first glance. I might change my
mind on that.
 
>
> > How well documented is the ebuild file format?
> 
> It's pretty well documented by PMS. You can get that by installing
> app-doc/pms. For something that's much shorter and less
> comprehensive, there's the `man 5 ebuild`.
> 
> > How much of the ebuild is essential for portage to create a valid
> >  cache entry?
> 
> The whole ebuild and any eclasses that it inherits.
> 
> > How stable and well documented is the format of the cache
> > essential pieces of an ebuild?
> 
> It's very stable because it has to be backward compatible. Breaking
> compatibility would be a sever problem because dependency
> calculations are very slow unless there is a valid/compatible cache
> available.
>

I think that keeping the slim tree package manager cache format
compatible with the full tree package manager cache format is not
going to be easy. Mainly because of the amount of new data needed in
the slim tree variant of the cache.

stuff like:
1.    Repository -- Is this cache information from the main tree or
      from an overlay?
2.    Hash or signature of the ebuild tar-ball -- How do I validate
      whether the tar-ball I downloaded is Ok.
3.    Tags -- If the tags soc project is accepted then they will need
      to be cached for searching as well.
4.    Speed improvements -- Any required changes that can improve the
      performance of searches and the like.
5.    Change tracking -- Any cache format for a slim tree will need to
      be able to update from one revision to another easily and with
      as little bandwidth as reasonably  possible.

> 
> > Is there any previous work on this or a project that might overlap
> > with this project? Such as, an attempt at a new parser for portage.
> 
> I know that Mounir Lamouri (volkmar@gentoo) has been thinking about
> a new cache format that will use a single file for the whole cache.
> 
> > Will there be mandatory discussion between the person doing this
> > project and the person doing the tags support project?
> 
> Tags are a separate project.
> 
> > Is improving the performance of the cache and/or search feature
> > a mandatory goal of this project?
> 
> Well, the cache should probably all go in a single file, and that
> will probably improve performance because generally it's faster to
> load one big file than a bunch of small files.
> 
> > Thank you.
> > 
> > [1] http://en.gentoo-wiki.com/wiki/Google_Summer_of_Code_2010_ideas#Cache_sync
> > 
> -- 
> Thanks,
> Zac
> 

Thank you for the information and I will ponder it for a little bit
and look at some different design angles.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-09 23:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-08 17:21 [gentoo-soc] Questions about Cache Sync idea for 2010 soc Robert R. Russell
2010-03-09 18:42 ` Zac Medico
2010-03-09 23:55   ` Robert R. Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox