[gentoo-dev] Gentoo XML Database

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-dev] Gentoo XML Database
@ 2003-02-07 17:34 Yannick Koehler
  2003-02-07 18:28 ` Vano D
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yannick Koehler @ 2003-02-07 17:34 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1999 bytes --]

For the fun of it, I created a little tool very custom and untested that will 
read the the cache files of gentoo and generate on the stdout a valid xml 
file.

Now the schema/dtd has been created without any thinking.  This may or not 
open the door to people to experiment with a gentoo equivalent database.

What's interesting is that the database is generated from a gentoo system 
pretty easily because of the presence of the cache.  One could easily think 
about creating a direct ebuilds -> xml db software instead of passing through 
the cache.

Discussion with carspaski reveal thought that the use of the database will 
actually not speed up emerge.  Because emerge loads the cache inside an 
internal memory database and python allow him to leave that in memory in 
between runs making it very fast and efficient as only the require entry of 
the database gets loaded instead of the whole database.

Some benefit I see from the xml db is for side-tools, for example search 
description of ebuilds is faster when using xml db as it is a single file and 
software only look for string that start with <description>.  One can use 
grep/regexp to do such query or built an xml capable application.

I believe that more works need to be put into this to figure out a better dtd 
and a separation of elements that would make more sense to some of the 
application such as kportage and others gui tools that try to load all at 
startup due to lack of persistent daemon keeping stuff in memory.

test.sh is the bash script that start the xml output and then do a recursive 
ls of /var/cache/edb/dep.  Then for each file it calls xmltest which is a 
libxml2 app that will only convert the read text in ISO-8859-1 and then 
output with escaping special chars as defined in XML 1.0.

I'm including only the source. I use the following compile line:

gcc -I /usr/include/libxml2 -o xmltest xmltest.c -lxml2

To run, 

./test.sh > gentoo.xml

It generate a 9525071 bytes file.

-- 

Yannick Koehler

[-- Attachment #2: test.sh --]
[-- Type: application/x-shellscript, Size: 1820 bytes --]

[-- Attachment #3: xmltest.c --]
[-- Type: text/x-csrc, Size: 3220 bytes --]

#include <string.h>
#include <libxml/parser.h>

unsigned char*
convert (unsigned char *in, char *encoding)
{
	unsigned char *out;
        int ret,size,out_size,temp;
        xmlCharEncodingHandlerPtr handler;

        size = (int)strlen(in)+1; 
        out_size = size*2-1; 
        out = malloc((size_t)out_size); 

        if (out) {
                handler = xmlFindCharEncodingHandler(encoding);

                if (!handler) {
                        free(out);
                        out = NULL;
                }
        }
        if (out) {
                temp=size-1;
                ret = handler->input(out, &out_size, in, &temp);
                if (ret || temp-size+1) {
                        if (ret) {
                                printf("conversion wasn't successful.\n");
                        } else {
                                printf("conversion wasn't successful. converted: %i octets.\n",temp);
                        }
                        free(out);
                        out = NULL;
                } else {
                        out = realloc(out,out_size+1); 
                        out[out_size]=0; /*null terminating out*/

                }
        } else {
                printf("no mem\n");
        }
        return (out);
}	

int
main(int argc, char **argv) {

	FILE *f = NULL;
	char *tags[] = { "depend","runtime-depend","slot","sources","restrict","homepage","license","description","keywords","inherited","uses","cdepend","pdepend", NULL}; 

	if (argc <= 1) {
		printf("Usage: %s filename\n", argv[0]);
		return(0);
	}

	if ((f = fopen(argv[1], "r"))) {
		int currentLineNumber = 0;
		unsigned char buffer[1000] = { 0 };

		while (fgets(buffer, sizeof(buffer), f)) {
			char * p = strtok(buffer, "\n");
			if (p && p[0]) {
				unsigned char *content, *out;
				char *encoding = "ISO-8859-1";
				int i = 0;

				content = p;

				if (NULL != (out = convert(content, encoding))) {
					printf("      <%s>", tags[currentLineNumber]);
					for (i = 0; i < strlen(out); i++) {
						switch(p[i]) {
							case '&':
								putchar('&');
								putchar('a');
								putchar('m');
								putchar('p');
								putchar(';');
								break;
							case '\'':
								putchar('&');
								putchar('a');
								putchar('p');
								putchar('o');
								putchar('s');
								putchar(';');
								break;
							case '"':
								putchar('&');
								putchar('q');
								putchar('u');
								putchar('o');
								putchar('t');
								putchar(';');
								break;
							case '<':
								putchar('&');
								putchar('l');
								putchar('t');
								putchar(';');
								break;
							case '>':
								putchar('&');
								putchar('g');
								putchar('t');
								putchar(';');
								break;
							default:
								putchar(p[i]);
								break;
						}
					}
					printf("</%s>\n", tags[currentLineNumber]);
					free(out);
					out = NULL;
				}
			}
			/* Increment and test currentLineNumber */
			if (!tags[++currentLineNumber]) {
				break;
			}
		}
	}
	return (1);
}

[-- Attachment #4: Type: text/plain, Size: 37 bytes --]

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-dev] Gentoo XML Database
  2003-02-07 17:34 [gentoo-dev] Gentoo XML Database Yannick Koehler
@ 2003-02-07 18:28 ` Vano D
  2003-02-07 18:41   ` Vano D
  2003-02-08 15:14 ` [gentoo-dev] " Denys Duchier
  2003-02-11 14:56 ` [gentoo-dev] Gentoo XML Database: More Data Yannick Koehler
  2 siblings, 1 reply; 7+ messages in thread
From: Vano D @ 2003-02-07 18:28 UTC (permalink / raw
  To: gentoo-dev

One interesting application I can think of using a database backend is
for making a "portage server" serving portage ebuilds and recording the
cache information (as in what is installed with what USE flags) for
every single client machine having an "account" on the db. So in effect
you would have all your machines without any portage-related files apart
from the "emergedb" command and accessory tools. It could be usefull for
anyone who needs to deploy a lot of differently configured Gentoo
machines really fast. It could also be usefull for administration of
production Gentoo machines freeing them of have any "portage bloat"
(bloat not in the bad sense, I love portage ;-) (here you would have the
machines connect to your central portage server. Very interesting even
though it is of limited use to many people.

On Fri, 2003-02-07 at 18:34, Yannick Koehler wrote:
> For the fun of it, I created a little tool very custom and untested that will 
> read the the cache files of gentoo and generate on the stdout a valid xml 
> file.
> 
> Now the schema/dtd has been created without any thinking.  This may or not 
> open the door to people to experiment with a gentoo equivalent database.

...

> Discussion with carspaski reveal thought that the use of the database will 
> actually not speed up emerge.  Because emerge loads the cache inside an 
> internal memory database and python allow him to leave that in memory in 
> between runs making it very fast and efficient as only the require entry of 
> the database gets loaded instead of the whole database.
> 
> Some benefit I see from the xml db is for side-tools, for example search 
> description of ebuilds is faster when using xml db as it is a single file and 
> software only look for string that start with <description>.  One can use 
> grep/regexp to do such query or built an xml capable application.
-- 
Vano D <gentoo-dev@europeansoftware.com>

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-dev] Gentoo XML Database
  2003-02-07 18:28 ` Vano D
@ 2003-02-07 18:41   ` Vano D
  2003-02-07 18:46     ` Yannick Koehler
  0 siblings, 1 reply; 7+ messages in thread
From: Vano D @ 2003-02-07 18:41 UTC (permalink / raw
  To: gentoo-dev

Sorry for double posting.

If that idea is extended and assuming that you have different machines
with different specs in a big organisation you want to deploy
gentoo clients to, you can in effect have a
"configuration management center" server to configure and
manage software in all of the gentoo machines in that organisation.

Ofcourse if all machines have the same specs you can still use this system
but without the need to compile software for each machine.

I think the idea is very interesting and can be usefull.

-- 
Vano D <gentoo-dev@europeansoftware.com>

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-dev] Gentoo XML Database
  2003-02-07 18:41   ` Vano D
@ 2003-02-07 18:46     ` Yannick Koehler
  2003-02-07 19:10       ` Vano D
  0 siblings, 1 reply; 7+ messages in thread
From: Yannick Koehler @ 2003-02-07 18:46 UTC (permalink / raw
  To: gentoo-dev

On February 7, 2003 01:41 pm, Vano D wrote:
> Sorry for double posting.
>
> If that idea is extended and assuming that you have different machines
> with different specs in a big organisation you want to deploy
> gentoo clients to, you can in effect have a
> "configuration management center" server to configure and
> manage software in all of the gentoo machines in that organisation.
>
> Ofcourse if all machines have the same specs you can still use this system
> but without the need to compile software for each machine.
>
> I think the idea is very interesting and can be usefull.

Which brings up a ver old idea that I again posted on gentoo last summer about 
having a script exporting all config file in an xml database/tree and have 
utilities developped to display/present/change this information and then make 
that information transform back into the original /etc files.

One could then export the xml and re-import it inside another system.  Even 
better, would be that you could configure more than simply linux because the 
notion of "users" can easily exists in other system and using xslt on an xml 
could help converting it to another similar format for the target platform.

-- 

Yannick Koehler

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-dev] Gentoo XML Database
  2003-02-07 18:46     ` Yannick Koehler
@ 2003-02-07 19:10       ` Vano D
  0 siblings, 0 replies; 7+ messages in thread
From: Vano D @ 2003-02-07 19:10 UTC (permalink / raw
  To: gentoo-dev

On Fri, 2003-02-07 at 19:46, Yannick Koehler wrote:
> On February 7, 2003 01:41 pm, Vano D wrote:
> > Sorry for double posting.
> >
> > If that idea is extended and assuming that you have different machines
> > with different specs in a big organisation you want to deploy
> > gentoo clients to, you can in effect have a
> > "configuration management center" server to configure and
> > manage software in all of the gentoo machines in that organisation.
> >
> > Ofcourse if all machines have the same specs you can still use this system
> > but without the need to compile software for each machine.
> >
> > I think the idea is very interesting and can be usefull.
> 
> Which brings up a ver old idea that I again posted on gentoo last summer about 
> having a script exporting all config file in an xml database/tree and have 
> utilities developped to display/present/change this information and then make 
> that information transform back into the original /etc files.
> 
> One could then export the xml and re-import it inside another system.  Even 
> better, would be that you could configure more than simply linux because the 
> notion of "users" can easily exists in other system and using xslt on an xml 
> could help converting it to another similar format for the target platform.

It is interesting that this issue came up because I have a friend whose
end of year university project was the management and configuration of
software using tools which interacted with xml templates. Each software
configuration file (such as proftpd's config files or samba's) is
configured via xml with the use of xml schema defining the config files.

You then make "software modules" for each software package you want (or
in another words make the xml schema for the configuration file(s),
default values, dependencies between directives and values, and a set of
default/secure rules)

He then developed GUI tools to modify the xml parameters locally and
remotely. The whole system also includes dependencies and security
(originally the whole idea was for security, so say if you define an XYZ
directive in Samba it won't compromise the system because you also had
an ABC directive somewhere else.. etc. So in effect the whole system
with its dependencies and default/set rules takes care of security and
as a side effect: easy configuration). 

Just thought to let you know about that project since you seem to be
interested in the same topic. I think if Gentoo is used with such a
system and with the ideas discussed in previous posts, you could have
one powerfull ((semi)auto) configuration management system with all its
bells and whistles.

Check http://inseguro.org/ it's all in Spanish unfortunately. You have
some screenshots of his GUIs for the configuration management. Also rpm
binaries for RedHat. He intends to release the code for everything when
it reaches 1.0.

-- 
Vano D <gentoo-dev@europeansoftware.com>

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-dev] Re: Gentoo XML Database
  2003-02-07 17:34 [gentoo-dev] Gentoo XML Database Yannick Koehler
  2003-02-07 18:28 ` Vano D
@ 2003-02-08 15:14 ` Denys Duchier
  2003-02-11 14:56 ` [gentoo-dev] Gentoo XML Database: More Data Yannick Koehler
  2 siblings, 0 replies; 7+ messages in thread
From: Denys Duchier @ 2003-02-08 15:14 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 9328 bytes --]

I quite liked the idea of reflecting the portage database in XML, but
not much the use of an auxiliary C program.  Besides, there was much
more to be milked from portage.  So here is my take on it.  I wrote it
in Python, and I tried to properly parse the dependency specs (I hope
I got it right, but I have never been able to locate any realistic
documentation for that bizarre syntax which seems to have more in
common with vogon poetry than with a specification language :-)

simply invoke:

	python toxml.py > EDB.xml

I attach the file "toxml.py" below:



import string,re,os

######################################################################

class Disj:
    def __init__(self,alts):
        self.alts = alts
    def __str__(self):
        return "<Disj %s>" % str(self.alts)
    def xml(self,indent):
        subindent = indent + '  '
        print "%s<choice>" % indent
        for a in self.alts:
            a.xml(subindent)
        print "%s</choice>" % indent

CMP_NAMES = {
    '>=' : 'ge',
    '<=' : 'le',
    '=<' : 'le',
    '>'  : 'gt',
    '<'  : 'lt',
    '='  : 'eq',
    '!'  : 'ne',
    '~'  : 'newest'
    }

class Pkg:
    def __init__(self,cmp,nam,star):
        self.cmp = cmp
        self.name = nam
        self.newest = star
    def __str__(self):
        cmp = self.cmp or ''
        nam = self.name
        star = ''
        if self.newest: star='*'
        return "<Package %s%s%s>" % (cmp,nam,star)
    def xml(self,indent):
        cmp = self.cmp
        if cmp:
            cmp = " cmp='%s'" % CMP_NAMES[cmp]
        else:
            cmp = ''
        if self.newest:
            newest = " newest='yes'"
        else:
            newest = ''
        print "%s<package name='%s'%s%s/>" % (indent,self.name,cmp,newest)

class Use:
    def __init__(self,var,val):
        self.var = var
        self.val = val
    def __str__(self):
        var = self.var
        val = '!'
        if self.val: val=''
        return "<Use %s%s>" % (val,var)

class Cond:
    def __init__(self,use,yes,no):
        self.use = use
        self.yes = yes
        self.no  = no
    def __str__(self):
        use = str(self.use)
        yes = str(self.yes)
        no  = str(self.no)
        return "<Cond %s yes=%s no=%s>" % (use,yes,no)
    def xml(self,indent):
        var = self.use.var
        subindent = indent + '  '
        subsubindent = subindent + '  '
        yes = self.yes
        no  = self.no
        if not self.use.val:
            yes,no = no,yes
        print "%s<test use='%s'>" % (indent,var)
        if yes:
            print "%s<when value='yes'>" % subindent
            for d in yes:
                d.xml(subsubindent)
            print "%s</when>" % subindent
        if no:
            print "%s<when value='no'>" % subindent
            for d in no:
                d.xml(subsubindent)
            print "%s</when>" % subindent
        print "%s</test>" % indent

######################################################################

TOKEN_RE = re.compile("^([()?:~!*]|\\|\\||>=|<=|=<|<|>|=|[#a-zA-Z0-9/.+_-]+)(.*)$")

def tokenize(s):
    tokens=[]
    for x in string.split(string.strip(s)):
        while x:
            res = TOKEN_RE.match(x)
            tokens.append(res.group(1))
            x = res.group(2)
    return tokens

FILE = None
TOKS = None

def parse_error():
    TOKS.reverse()
    raise "parse error file="+FILE+" tokens="+str(TOKS)

def parse(s):
    global TOKS
    TOKS = tokenize(s)
    TOKS.reverse()
    deps = []
    while TOKS:
        deps.extend(parse_dep())
    return deps

def parse_dep():
    if not TOKS: parse_error()
    elif TOKS[-1]=='(':
        TOKS.pop()
        deps = []
        while TOKS and TOKS[-1]!=')':
            deps.extend(parse_dep())
        if TOKS and TOKS[-1]==')':
            TOKS.pop()
            return deps
        else: parse_error()
    elif TOKS[-1]=='||':
        TOKS.pop()
        if TOKS and TOKS[-1]=='(':
            return [Disj(parse_dep())]
        else:
            parse_error()
    else:
        use = parse_use()
        if use:
            yes = parse_dep()
            if TOKS and TOKS[-1]==':':
                TOKS.pop()
                no = parse_dep()
            else:
                no = []
            return [Cond(use,yes,no)]
        else:
            return [parse_pkg()]

LETTER_RE = re.compile("[#a-zA-Z0-9]")

def is_name(s):
    return LETTER_RE.match(s)

def parse_use():
    n = len(TOKS)
    if n >= 3 and TOKS[-1]=='!' and is_name(TOKS[-2]) and TOKS[-3]=='?':
        TOKS.pop()
        var = TOKS.pop()
        val = False
        TOKS.pop()
        return Use(var,val)
    elif n >= 2 and is_name(TOKS[-1]) and TOKS[-2]=='?':
        var = TOKS.pop()
        val = True
        TOKS.pop()
        return Use(var,val)
    elif n >= 1 and TOKS[-1]=='?':
        # I don't understand what this is supposed to mean
        TOKS.pop()
        return Use("",True)
    else:
        return None

CMP = ['>=','<=','=<','<','>','=','~','!']
CMP_NEG = {
    '>=' : '<',
    '<=' : '>',
    '=<' : '>',
    '<'  : '>=',
    '>'  : '=<' }
CMP_NEG_KEYS = CMP_NEG.keys()

def parse_pkg():
    if TOKS and (TOKS[-1] in CMP):
        cmp = TOKS.pop()
        if cmp=='!' and TOKS and (TOKS[-1] in CMP_NEG_KEYS):
            cmp = TOKS.pop()
            cmp = CMP_NEG[cmp]
    else:
        cmp = None
    if TOKS and is_name(TOKS[-1]):
        nam = TOKS.pop()
    else:
        parse_error()
    if TOKS and TOKS[-1]=='*':
        star = True
        TOKS.pop()
    else:
        star = False
    return Pkg(cmp,nam,star)


######################################################################

TAGS = [ "depend"	,
         "rdepend"	,
         "slot"		,
         "sources"	,
         "restrict"	,
         "homepage"	,
         "license"	,
         "description"	,
         "keywords"	,
         "inherited"	,
         "uses"		,
         "cdepend"	,
         "pdepend"	]

def do_file(filename):
    #print "do_file(%s)" % filename
    f = open(filename)
    lines = f.readlines()
    f.close()
    table = {}
    for tag,line in zip(TAGS,lines):
        line = string.strip(line)
        if tag=="description" or tag=="homepage" or tag=="slot":
            table[tag] = line
        elif tag=="depend" or tag=="rdepend":
            global FILE
            FILE = filename
            table[tag] = parse(line)
        else:
            table[tag] = string.split(line)
    return table

DIR = "/var/cache/edb/dep"

PACKAGE_REGEX = re.compile("^(.+)-([0-9]+(\\.[0-9]+)*[a-zA-Z]?(_(alpha|beta|pre|rc|p)[0-9]*)?(-r[0-9]+)?)$")

class Database:
    def __init__(self):
        self.table = {}
    def xml(self,indent):
        print "%s<database>" % indent
        subindent = indent + '  '
        for c in self.table.itervalues():
            c.xml(subindent)
        print "%s</database>" % indent

class Category:
    def __init__(self,cat):
        self.name  = cat
        self.table = {}
    def xml(self,indent):
        print "%s<category name='%s'>" % (indent,self.name)
        subindent = indent + '  '
        for p in self.table.itervalues():
            p.xml(subindent)
        print "%s</category>" % indent

class Package:
    def __init__(self,pkg):
        self.name  = pkg
        self.table = {}
    def xml(self,indent):
        print "%s<package name='%s'>" % (indent,self.name)
        subindent = indent + '  '
        for v in self.table.itervalues():
            v.xml(subindent)
        print "%s</package>" % indent

class Version:
    def __init__(self,ver):
        self.version = ver
        self.table   = {}
    def xml(self,indent):
        print "%s<version number='%s'>" % (indent,self.version)
        subindent = indent + '  '
        subsubindent = subindent + '  '
        for tag in TAGS:
            val = self.table.get(tag,None)
            if not val:
                print "%s<%s/>" % (subindent,tag)
            elif tag=="description" or tag=="homepage" or tag=="slot":
                print "%s<%s>%s</%s>" % (subindent,tag,escape(val),tag)
            elif tag=="depend" or tag=="rdepend":
                print "%s<%s>" % (subindent,tag)
                for d in val:
                    d.xml(subsubindent)
                print "%s</%s>" % (subindent,tag)
        print "%s</version>" % indent

def escape(s):
    s = string.replace(s,"&","&amp;")
    s = string.replace(s,"<","&lt;")
    s = string.replace(s,">","&gt;")
    s = string.replace(s,'"',"&quot;")
    s = string.replace(s,"'","&apos;")
    return s

def do_categories():
    DB = Database()
    full_table = DB.table
    for cat in os.listdir(DIR):
        curdir = DIR+"/"+cat
        files=os.listdir(curdir)
        files.sort()
        CAT = Category(cat)
        cat_table = CAT.table
        full_table[cat] = CAT
        cur_pkg = None
        for f in files:
            res = PACKAGE_REGEX.match(f)
            pkg = res.group(1)
            ver = res.group(2)
            if cat_table.has_key(pkg):
                PKG = cat_table[pkg]
                pkg_table = PKG.table
            else:
                PKG = Package(pkg)
                pkg_table = PKG.table
                cat_table[pkg] = PKG
            VER = Version(ver)
            VER.table = do_file(curdir+'/'+f)
            pkg_table[ver] = VER
    return DB

DB = do_categories()
DB.xml('')

[-- Attachment #2: Type: text/plain, Size: 73 bytes --]


Cheers,

-- 
Dr. Denys Duchier
Équipe Calligramme
LORIA, Nancy, FRANCE


[-- Attachment #3: Type: text/plain, Size: 37 bytes --]

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-dev] Gentoo XML Database: More Data
  2003-02-07 17:34 [gentoo-dev] Gentoo XML Database Yannick Koehler
  2003-02-07 18:28 ` Vano D
  2003-02-08 15:14 ` [gentoo-dev] " Denys Duchier
@ 2003-02-11 14:56 ` Yannick Koehler
  2 siblings, 0 replies; 7+ messages in thread
From: Yannick Koehler @ 2003-02-11 14:56 UTC (permalink / raw
  To: gentoo-dev

On February 7, 2003 12:34 pm, Yannick Koehler wrote:
> Discussion with carspaski reveal thought that the use of the database will
> actually not speed up emerge.  Because emerge loads the cache inside an
> internal memory database and python allow him to leave that in memory in
> between runs making it very fast and efficient as only the require entry of
> the database gets loaded instead of the whole database.

Just a note about that comment I made.  When I said that it wouldn't speed-up 
emerge it was related to specific functions.  For example, if you do for the 
first time:

emerge kde

Emerge will then fetch the kde ebuild get the dependency and fetch all the 
dependency.  Cache this information inside its internal persistent db and 
then execute the operation.

In a DB mode, the database need to be loaded in some way or the index.  It is 
hard to imagine that the number of I/O will actually be less than the current 
one described above.

But, there is cases where a db would speed up portage and that's why the xml 
file is getting interesting.  It allow to import into a db who knows about 
xml or try things using xml/text related tools.

> Some benefit I see from the xml db is for side-tools, for example search
> description of ebuilds is faster when using xml db as it is a single file
> and software only look for string that start with <description>.  One can
> use grep/regexp to do such query or built an xml capable application.

I have done the following experiment.  When I posted the original mail, I 
generated the gentoo.xml file.  The file was actually 4762563 bytes.  I did 
report it was 9525071 bytes but this was wrong.  My script had generated a 
double file...  I found this by using

grep "version name=\"kdelibs-3.1" gentoo.xml

which outputted two instance.  After correcting the file was smaller.   
Something that got my attention also is that generating a gentoo.xml today 
got me a 4852253 bytes file.  Now if you compare the date/size:

2003-02-07 12:34 -> 4762563 bytes
2003-02-11 09:10 -> 4852253 bytes

This is a 89680 bytes difference.  Running emerge rsync daily is giving me 
more than 1.2 megs a shot.  So quick calculation:

11 Feb - 7 Feb = 4 days.
4 * 1.2 megs = 4.8 megs
4 * 89k = 356k
4.8 - 356k = ~4.45 megs

Which means that, if the gentoo.xml contained all the info required for 
calculating the dependencies and only fetching required ebuilds which I'm 
pretty sure it does, would mean that I have wasted 4.45 megs of bandwidth 
this past 4 days.

Now consider that I'm not alone and that in my case I have a shared repository 
both at home and at work, this is a huge waste of bandwidth for only 4 
days...  And those are bytes, not bits...

Other information...

ykoehler@corneille ykoehler $ time grep "version name=" -c gentoo.xml 
gentoo2.xml
gentoo.xml:7262
gentoo2.xml:7373

real    0m0.107s
user    0m0.010s
sys     0m0.050s

It takes less than 1 seconds for grep to parse the file and retrieve all 
occurrences of version name=".  While it is true that grep doesn't generate 
data structure in memory and parse the inner part of the version tag, this 
actually make me re-think my original discussion with carpaski where we got 
to the conclusion that the speedup for emerge would be minimal.  I now 
actually think that given a proper xml file which would minimize even more 
parsing by adding to the xml file generation has actually possibility for 
huge saving in bandwidth, hard disk space and speed on a local pc.

It also take less time to generate the xml file than to issue an emerge rsync 
for a 1 day of changes.  For example, I have run emerge rsync this morning:

rsync[15612] (receiver) heap statistics:
  arena:        5362104   (bytes from sbrk)
  ordblks:          551   (chunks not in use)
  smblks:             2
  hblks:              1   (chunks from mmap)
  hblkhd:        258048   (bytes from mmap)
  usmblks:            0
  fsmblks:           40
  uordblks:     4667912   (bytes used)
  fordblks:      694192   (bytes free)
  keepcost:       53464   (bytes in releasable chunk)

Number of files: 36464
Number of files transferred: 14444
Total file size: 29101431 bytes
Total transferred file size: 14445832 bytes
Literal data: 69 bytes
Matched data: 14445763 bytes
File list size: 848703
Total bytes written: 408076
Total bytes read: 1349498

wrote 408076 bytes  read 1349498 bytes  1858.88 bytes/sec
total size is 29101431  speedup is 16.56

>>> Updating Portage cache...

real    16m19.682s
user    0m21.990s
sys     0m40.790s

So 16 min. later and 1.3 megs I got the database update for today...

corneille dep # time /home/ykoehler/test.sh >/home/ykoehler/gentoo3.xml

real    1m17.284s
user    0m21.780s
sys     0m32.350s

Generated the gentoo.xml file at size 4865557 bytes.  Which if I had rsynced 
from the server would have transform into (4865557 - 4852253) ~13304 bytes.

I heard that there was already works on getting portage to use a real db such 
as berkeley or mysql etc..  In any case I think the distribution format that 
make most sense is xml.  This format can easily be manipulated using xslt to 
fit the need of many people and can be use with mostly any existing text 
tools.  Using xsl it also can be nicely converted into HTML and a dependency 
tree is easily built from there.  RSync could quickly figure out which part 
changes and update those in a fraction of time it takes today to diff the 
trees.

Hard Info:
P3 800 Mhz
Slow IDE Hard Disk 5400 rpm
256 megs ram
i810 chipset

So, with those numbers now extracted, I'm about to attempt to move emerge to 
this system on my PC and get you more "real" numbers using that mode.

-- 

Yannick Koehler

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-02-11 15:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-07 17:34 [gentoo-dev] Gentoo XML Database Yannick Koehler
2003-02-07 18:28 ` Vano D
2003-02-07 18:41   ` Vano D
2003-02-07 18:46     ` Yannick Koehler
2003-02-07 19:10       ` Vano D
2003-02-08 15:14 ` [gentoo-dev] " Denys Duchier
2003-02-11 14:56 ` [gentoo-dev] Gentoo XML Database: More Data Yannick Koehler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox