public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-amd64] dig package
@ 2005-10-25 13:58 Mark Haney
  2005-10-25 14:02 ` Ben Skeggs
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Haney @ 2005-10-25 13:58 UTC (permalink / raw
  To: gentoo-amd64

What ebuild is dig part of?  I can't find it anywhere and it's very 
handy to have.

-- 
Interdum feror cupidine partium magnarum Europae vincendarum

Mark Haney
Sr. Systems Administrator
ERC Broadband
(828) 350-2415

-- 
gentoo-amd64@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 13:58 [gentoo-amd64] dig package Mark Haney
@ 2005-10-25 14:02 ` Ben Skeggs
  2005-10-25 14:13   ` Richard Freeman
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Skeggs @ 2005-10-25 14:02 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 341 bytes --]

Am Dienstag, den 25.10.2005, 09:58 -0400 schrieb Mark Haney:
> What ebuild is dig part of?  I can't find it anywhere and it's very 
> handy to have.
net-dns/bind-tools

Ben.
> 
> -- 
> Interdum feror cupidine partium magnarum Europae vincendarum
> 
> Mark Haney
> Sr. Systems Administrator
> ERC Broadband
> (828) 350-2415
> 

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 14:02 ` Ben Skeggs
@ 2005-10-25 14:13   ` Richard Freeman
  2005-10-25 15:35     ` John Myers
  2005-11-06 17:00     ` Eric Augustus
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Freeman @ 2005-10-25 14:13 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ben Skeggs wrote:
> Am Dienstag, den 25.10.2005, 09:58 -0400 schrieb Mark Haney:
> 
>>What ebuild is dig part of?  I can't find it anywhere and it's very 
>>handy to have.
> 
> net-dns/bind-tools
> 

It probably would be nice if somebody could expand the online package
database to include a list of installed files.  If you already have a
package installed you can always use equery to find out what provided a
particular file.  However, if you don't have it installed it isn't
always obvious where you can get it.

I've seen objections raised to the fact that files installed can vary
based on USE settings, etc.  Perhaps a good starting point would be just
to catalog what each package installs on the default x86 profile without
any USE changes.  Sure, it might only be a 99% solution, but it seems
like we're going without simply for the sake of the 1% of packages like
VNC which actually build more files if you have a USE flag set...

Sorry - not really amd64-specific - more like a GLEP.  It just seemed
relevant to the discussion here...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDXj2Dg2bN8aFizRkRAh5JAJ91MfoNDuzrJcPcMYUPUFYnPpjB9QCdG1pS
bPCWxVcEz7P1SzgUQIPwewg=
=kPTD
-----END PGP SIGNATURE-----

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4275 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 14:13   ` Richard Freeman
@ 2005-10-25 15:35     ` John Myers
  2005-10-25 16:23       ` Richard Freeman
  2005-11-06 17:00     ` Eric Augustus
  1 sibling, 1 reply; 11+ messages in thread
From: John Myers @ 2005-10-25 15:35 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]

On Tuesday 25 October 2005 07:13, Richard Freeman wrote:
> Ben Skeggs wrote:
> > Am Dienstag, den 25.10.2005, 09:58 -0400 schrieb Mark Haney:
> >>What ebuild is dig part of?  I can't find it anywhere and it's very
> >>handy to have.
> >
> > net-dns/bind-tools
>
> It probably would be nice if somebody could expand the online package
> database to include a list of installed files.  If you already have a
> package installed you can always use equery to find out what provided a
> particular file.  However, if you don't have it installed it isn't
> always obvious where you can get it.
>
> I've seen objections raised to the fact that files installed can vary
> based on USE settings, etc.  Perhaps a good starting point would be just
> to catalog what each package installs on the default x86 profile without
> any USE changes.  Sure, it might only be a 99% solution, but it seems
> like we're going without simply for the sake of the 1% of packages like
> VNC which actually build more files if you have a USE flag set...
>
> Sorry - not really amd64-specific - more like a GLEP.  It just seemed
> relevant to the discussion here...
I tried writing such a service once. 
<http://article.gmane.org/gmane.linux.gentoo.amd64/4043>

I designed a system where it took feedback from consenting users, sending the 
file lists back to my server, were I was going to do some data crunching. The 
data from just _my_ system was over 60 MB.

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 15:35     ` John Myers
@ 2005-10-25 16:23       ` Richard Freeman
  2005-10-26  5:11         ` John Myers
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Freeman @ 2005-10-25 16:23 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 2535 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Myers wrote:
> 
> I designed a system where it took feedback from consenting users, sending the 
> file lists back to my server, were I was going to do some data crunching. The 
> data from just _my_ system was over 60 MB.

It sounds like you really only need to index each package a few times at
most.  Sure, the raw data from a user could be 60MB each, but there are
some ways to reduce that significantly:

1.  Don't send in data for anything in the base system install.

2.  As you populate your database, publish a list of indexed packages
via a URL.  Users would exclude any packages you've already indexed.  If
this were a GLEP you could probably put the file in the portage
directory and everybody would get it via rsync.

3.  Start by only indexing each package ONCE.  Don't worry about every
combo of arches, CFLAGS, USE, etc.  That means that most users wouldn't
upload anything at all, and the rest would only send their unique
contributions.

If you get everything working without indexing by USE, you could start
adding that capability in.  Publish in #2 the list of USE flags indexed
for each package, and individuals would only upload packages compiled
with something that wasn't on that list.

Sure, the final database could easily be 100MB or so, but if you just
put it on a website you won't be sending the whole thing.  Just put it
in mysql/postgres and build a php front end (sorry, not a web dev
personally, but it isn't that hard to do from the little I've messed
with it).

Sorry - I don't intend to make it sound like the whole thing can be done
in 5 minutes, and I"m sure you've already poured hours into your effort.
 However, I don't see any theoretical issues with it as long as the
design is right.  The important thing is that users are only uploading
diffs against your master repository - and not doing a complete dump of
their entire system.  Otherwise you will get buried in data!

I must admit that it is easy to just talk about ideas like this - I
really do want to commend you on the work you've undoubtedly already
accomplished!  OSS projects require lots of hard work by many volunteers
and it is all too easy for people like me to just sit back and nitpick
what could be done better...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDXlvkg2bN8aFizRkRArU+AKCnEBdpoO2Acnwh3+FFR8CYj5CLtACcCboB
2QIb31yXVdW0EQST8PEUPeY=
=VF5P
-----END PGP SIGNATURE-----

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4275 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 16:23       ` Richard Freeman
@ 2005-10-26  5:11         ` John Myers
  2005-10-26 13:49           ` Billy Holmes
  0 siblings, 1 reply; 11+ messages in thread
From: John Myers @ 2005-10-26  5:11 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 3393 bytes --]

On Tuesday 25 October 2005 09:23, Richard Freeman wrote:
> John Myers wrote:
> > I designed a system where it took feedback from consenting users, sending
> > the file lists back to my server, were I was going to do some data
> > crunching. The data from just _my_ system was over 60 MB.
>
> It sounds like you really only need to index each package a few times at
> most.  Sure, the raw data from a user could be 60MB each, but there are
> some ways to reduce that significantly:
Hm. I forgot to mention that the largest pieces (the file names and the 
md5sums) are only stored once, and then referenced with a relatively small 
integer (compared to the size of, say, a file name)

Here's how it breaks down:
        table         |  rows   |  size
----------------------+---------+--------
ebuilds               | 994     | 118.3K
filenames             | 381,200 |  27.1M
file info             | 383,168 |  19.9M
installations list    | 1,007   |  26.7K
extra install data    | 1,007   |  88.2K
file->install mapping | 464,193 |  13.1M

There are some reinstallations and upgrades in the above data

> 1.  Don't send in data for anything in the base system install.
>
> 2.  As you populate your database, publish a list of indexed packages
> via a URL.  Users would exclude any packages you've already indexed.  If
> this were a GLEP you could probably put the file in the portage
> directory and everybody would get it via rsync.
>
> 3.  Start by only indexing each package ONCE.  Don't worry about every
> combo of arches, CFLAGS, USE, etc.  That means that most users wouldn't
> upload anything at all, and the rest would only send their unique
> contributions.
Interesting thoughts

> If you get everything working without indexing by USE, you could start
> adding that capability in.  Publish in #2 the list of USE flags indexed
> for each package, and individuals would only upload packages compiled
> with something that wasn't on that list.
>
> Sure, the final database could easily be 100MB or so, but if you just
> put it on a website you won't be sending the whole thing.  Just put it
> in mysql/postgres and build a php front end (sorry, not a web dev
> personally, but it isn't that hard to do from the little I've messed
> with it).
that's what the intention was. Maybe with an XML-RPC service for a 
command-line client to use. And the data is stored in a mysql database
>
> Sorry - I don't intend to make it sound like the whole thing can be done
> in 5 minutes, and I"m sure you've already poured hours into your effort.
>  However, I don't see any theoretical issues with it as long as the
> design is right.  The important thing is that users are only uploading
> diffs against your master repository - and not doing a complete dump of
> their entire system.  Otherwise you will get buried in data!
The biggest problem is that there are a lot of potential variations, and they 
all really need to be there for this to be useful
>
> I must admit that it is easy to just talk about ideas like this - I
> really do want to commend you on the work you've undoubtedly already
> accomplished!  OSS projects require lots of hard work by many volunteers
> and it is all too easy for people like me to just sit back and nitpick
> what could be done better...
Well, I think I might hack around on this a little more

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-26  5:11         ` John Myers
@ 2005-10-26 13:49           ` Billy Holmes
  2005-10-26 18:46             ` John Myers
  0 siblings, 1 reply; 11+ messages in thread
From: Billy Holmes @ 2005-10-26 13:49 UTC (permalink / raw
  To: gentoo-amd64

John Myers wrote:
> filenames             | 381,200 |  27.1M

are you storing the full pathname here?

if you broke up the path in another table, and referenced that to the 
basename, you could probably save lots of space - especially since most 
applications install lots of files in a limited number of directories.
-- 
gentoo-amd64@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-26 13:49           ` Billy Holmes
@ 2005-10-26 18:46             ` John Myers
  2005-10-27 15:45               ` Billy Holmes
  0 siblings, 1 reply; 11+ messages in thread
From: John Myers @ 2005-10-26 18:46 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 741 bytes --]

On Wednesday 26 October 2005 06:49, Billy Holmes wrote:
> John Myers wrote:
> > filenames             | 381,200 |  27.1M
>
> are you storing the full pathname here?
yes, but see below

> if you broke up the path in another table, and referenced that to the
> basename, you could probably save lots of space - especially since most
> applications install lots of files in a limited number of directories.
It would save a lot of space proportional to the number of unique filenames, 
but it wouln't really matter in practice. Each filename is only ever stored 
once, so over time, the largest tables are actually going to be the unique 
files table and the file->install map. It would also require many more 
queries to execute

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-26 18:46             ` John Myers
@ 2005-10-27 15:45               ` Billy Holmes
  2005-10-27 15:58                 ` Richard Freeman
  0 siblings, 1 reply; 11+ messages in thread
From: Billy Holmes @ 2005-10-27 15:45 UTC (permalink / raw
  To: gentoo-amd64

John Myers wrote:
>>if you broke up the path in another table, and referenced that to the
>>basename, you could probably save lots of space - especially since most
> It would save a lot of space proportional to the number of unique filenames, 
> but it wouln't really matter in practice. Each filename is only ever stored 
> once, so over time, the largest tables are actually going to be the unique 
> files table and the file->install map. It would also require many more 
> queries to execute

it will matter greatly, here is a simple script to show you:

$ /usr/lib/gentoolkit/bin/qpkg -nc -l kdebase-3.4 | wc -c
255050

$ /usr/lib/gentoolkit/bin/qpkg -nc -l kdebase-3.4 | perl -n \
-MFile::Basename -e 'next unless /^\//; s/->.*//; push(@{$f{dirname 
$_}},basename $_); END{map { print "$_\n"; map { print "\t$_"; } 
@{$f{$_}}} sort {$a cmp $b} keys %f};' | wc -c
99327

That's a ~62% savings. You can use a medium INT to reference the 
pathname, and then use inner joins in your queries.
-- 
gentoo-amd64@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-27 15:45               ` Billy Holmes
@ 2005-10-27 15:58                 ` Richard Freeman
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Freeman @ 2005-10-27 15:58 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 1238 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Billy Holmes wrote:
>> It would
>> also require many more queries to execute
> 
> That's a ~62% savings. You can use a medium INT to reference the
> pathname, and then use inner joins in your queries.

You should set up a sf.net project and then we could move discussion off
this list...  :)

I think that splitting the path may be a good idea.  Inner joins really
do not add much of a penalty to your query as long as everything is
indexed.

However, I'm not convinced it makes a big difference.  So, the DB is
100MB - it isn't like you're sending that over a wire at all - you just
need to store it on the DB server.  Having 2 20MB tables instead of 1
100MB table isn't going to make a big difference in final performance.

The main area to optimize size-wise is what gets sent over the wire -
and that is just a list of packages/flags already indexed so that
clients do not do needless work.  That will be pretty small in any case.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDYPkgg2bN8aFizRkRAhKtAJ0fiWZy6/GS0uaewIm9FgE/di/yxQCfYx1l
e/wKtM59CdWtxvTx+3H/rcY=
=hwBL
-----END PGP SIGNATURE-----

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4275 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-amd64] dig package
  2005-10-25 14:13   ` Richard Freeman
  2005-10-25 15:35     ` John Myers
@ 2005-11-06 17:00     ` Eric Augustus
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Augustus @ 2005-11-06 17:00 UTC (permalink / raw
  To: gentoo-amd64

On Tue, Oct 25, 2005 at 10:13:23AM -0400, Richard Freeman wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ben Skeggs wrote:
> > Am Dienstag, den 25.10.2005, 09:58 -0400 schrieb Mark Haney:
> > 
> >>What ebuild is dig part of?  I can't find it anywhere and it's very 
> >>handy to have.
> > 
> > net-dns/bind-tools
> > 
> 
> It probably would be nice if somebody could expand the online package
> database to include a list of installed files.  If you already have a
> package installed you can always use equery to find out what provided a
> particular file.  However, if you don't have it installed it isn't
> always obvious where you can get it.
> 

There's already an online package database for this. Check out:

http://www.rommel.stw.uni-erlangen.de/~fejf/pfs/


-- 
Eric Augustus
shrike@austin.rr.com
-----------------
Your domestic life may be harmonious.
-- 
gentoo-amd64@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-11-06 17:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-25 13:58 [gentoo-amd64] dig package Mark Haney
2005-10-25 14:02 ` Ben Skeggs
2005-10-25 14:13   ` Richard Freeman
2005-10-25 15:35     ` John Myers
2005-10-25 16:23       ` Richard Freeman
2005-10-26  5:11         ` John Myers
2005-10-26 13:49           ` Billy Holmes
2005-10-26 18:46             ` John Myers
2005-10-27 15:45               ` Billy Holmes
2005-10-27 15:58                 ` Richard Freeman
2005-11-06 17:00     ` Eric Augustus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox