From: Zac Medico <zmedico@gentoo.org>
To: gentoo-portage-dev@lists.gentoo.org
Cc: Zac Medico <zmedico@gentoo.org>
Subject: [gentoo-portage-dev] [PATCH v2] emirrordist: add --content-db option required for content-hash layout (bug 756778)
Date: Fri, 26 Feb 2021 04:21:50 -0800 [thread overview]
Message-ID: <20210226122150.1112987-1-zmedico@gentoo.org> (raw)
In-Reply-To: <20210225012610.814758-1-zmedico@gentoo.org>
Add a --content-db option which is required for the content-hash
layout because its file listings return content digests instead of
distfile names.
The content db serves to translate content digests to distfiles
names, and distfiles names to content digests. All keys have a
prefix separated by a colon. For digest keys, the prefix is the
hash algorithm name. For filename keys, the prefix is "filename".
The value associated with a digest key is a plain filename. The
value associated with a distfile key is a set of content revisions.
Each content revision is expressed as a dictionary of digests which
is suitable for construction of a DistfileName instance.
Bug: https://bugs.gentoo.org/756778
Signed-off-by: Zac Medico <zmedico@gentoo.org>
---
[PATCH v2] Split out ContentDB class and associate distfile key
with a set of content revisions, where each content revision is
expressed as a dictionary of digests.
lib/portage/_emirrordist/Config.py | 8 +-
lib/portage/_emirrordist/ContentDB.py | 158 +++++++++++++++++++
lib/portage/_emirrordist/DeletionIterator.py | 25 ++-
lib/portage/_emirrordist/DeletionTask.py | 8 +
lib/portage/_emirrordist/FetchTask.py | 5 +-
lib/portage/_emirrordist/main.py | 15 +-
lib/portage/tests/ebuild/test_fetch.py | 14 ++
man/emirrordist.1 | 6 +-
8 files changed, 232 insertions(+), 7 deletions(-)
create mode 100644 lib/portage/_emirrordist/ContentDB.py
diff --git a/lib/portage/_emirrordist/Config.py b/lib/portage/_emirrordist/Config.py
index 4bee4f45e..cfe944040 100644
--- a/lib/portage/_emirrordist/Config.py
+++ b/lib/portage/_emirrordist/Config.py
@@ -1,4 +1,4 @@
-# Copyright 2013-2020 Gentoo Authors
+# Copyright 2013-2021 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2
import copy
@@ -10,6 +10,7 @@ import time
from portage import os
from portage.package.ebuild.fetch import MirrorLayoutConfig
from portage.util import grabdict, grablines
+from .ContentDB import ContentDB
class Config:
def __init__(self, options, portdb, event_loop):
@@ -65,6 +66,11 @@ class Config:
self.distfiles_db = self._open_shelve(
options.distfiles_db, 'distfiles')
+ self.content_db = None
+ if options.content_db is not None:
+ self.content_db = ContentDB(self._open_shelve(
+ options.content_db, 'content'))
+
self.deletion_db = None
if options.deletion_db is not None:
self.deletion_db = self._open_shelve(
diff --git a/lib/portage/_emirrordist/ContentDB.py b/lib/portage/_emirrordist/ContentDB.py
new file mode 100644
index 000000000..60e6ef39d
--- /dev/null
+++ b/lib/portage/_emirrordist/ContentDB.py
@@ -0,0 +1,158 @@
+# Copyright 2021 Gentoo Authors
+# Distributed under the terms of the GNU General Public License v2
+
+import logging
+import operator
+import shelve
+import typing
+
+from portage.package.ebuild.fetch import DistfileName
+
+
+class ContentDB:
+ """
+ The content db serves to translate content digests to distfiles
+ names, and distfiles names to content digests. All keys have a
+ prefix separated by a colon. For digest keys, the prefix is the
+ hash algorithm name. For filename keys, the prefix is "filename".
+
+ The value associated with a digest key is a plain filename. The
+ value associated with a distfile key is a set of content revisions.
+ Each content revision is expressed as a dictionary of digests which
+ is suitable for construction of a DistfileName instance.
+ """
+
+ def __init__(self, shelve_instance: shelve.Shelf):
+ self._shelve = shelve_instance
+
+ def add(self, filename: DistfileName):
+ """
+ Add file name and digests.
+
+ @param filename: file name with digests attribute
+ """
+ distfile_str = str(filename)
+ distfile_key = "filename:{}".format(distfile_str)
+ for k, v in filename.digests.items():
+ if k != "size":
+ self._shelve["{}:{}".format(k, v).lower()] = distfile_str
+ try:
+ content_revisions = self._shelve[distfile_key]
+ except KeyError:
+ content_revisions = set()
+
+ revision_key = tuple(
+ sorted(
+ (
+ (algo.lower(), filename.digests[algo].lower())
+ for algo in filename.digests
+ if algo != "size"
+ ),
+ key=operator.itemgetter(0),
+ )
+ )
+ content_revisions.add(revision_key)
+ self._shelve[distfile_key] = content_revisions
+
+ def remove(self, filename: DistfileName):
+ """
+ Remove a file name from the database.
+
+ @param filename: file name with digests attribute
+ """
+ distfile_key = "filename:{}".format(filename)
+ try:
+ content_revisions = self._shelve[distfile_key]
+ except KeyError:
+ pass
+ else:
+ for revision_key in content_revisions:
+ for k, v in revision_key:
+ try:
+ del self._shelve["{}:{}".format(k, v)]
+ except KeyError:
+ pass
+
+ logging.debug(("drop '%s' from content db") % filename)
+ try:
+ del self._shelve[distfile_key]
+ except KeyError:
+ pass
+
+ def get_filenames_translate(
+ self, filename: typing.Union[str, DistfileName]
+ ) -> typing.Generator[DistfileName, None, None]:
+ """
+ Translate distfiles content digests to distfile names.
+ If filename is already a distfile name, then it will pass
+ through unchanged.
+
+ @param filename: A filename listed by layout get_filenames
+ @return: The distfile name, translated from the corresponding
+ content digest when necessary
+ """
+ if not isinstance(filename, DistfileName):
+ filename = DistfileName(filename)
+ if self._shelve is None:
+ yield filename
+ return
+
+ # Match content digests with zero or more content revisions.
+ matched_revisions = {}
+
+ for k, v in filename.digests.items():
+ digest_item = (k.lower(), v.lower())
+ digest_key = "{}:{}".format(*digest_item)
+ try:
+ distfile_str = self._shelve[digest_key]
+ except KeyError:
+ continue
+
+ matched_revisions.setdefault(distfile_str, set())
+ try:
+ content_revisions = self._shelve["filename:{}".format(distfile_str)]
+ except KeyError:
+ pass
+ else:
+ for revision_key in content_revisions:
+ if (
+ digest_item in revision_key
+ and revision_key not in matched_revisions.get(distfile_str, ())
+ ):
+ matched_revisions[distfile_str].add(revision_key)
+ yield DistfileName(distfile_str, digests=dict(revision_key))
+
+ if not any(matched_revisions.values()):
+ # Since filename matched zero content revisions, allow
+ # it to pass through unchanged (on the path toward deletion).
+ yield filename
+
+ def __len__(self):
+ return len(self._shelve)
+
+ def __contains__(self, k):
+ return k in self._shelve
+
+ def __iter__(self):
+ return self._shelve.__iter__()
+
+ def items(self):
+ return self._shelve.iteritems()
+
+ def __setitem__(self, k, v):
+ self._shelve[k] = v
+
+ def __getitem__(self, k):
+ return self._shelve[k]
+
+ def __delitem__(self, k):
+ del self._shelve[k]
+
+ def get(self, k, *args):
+ return self._shelve.get(k, *args)
+
+ def close(self):
+ self._shelve.close()
+
+ def clear(self):
+ self._shelve.clear()
diff --git a/lib/portage/_emirrordist/DeletionIterator.py b/lib/portage/_emirrordist/DeletionIterator.py
index 08985ed6c..ab4309f9a 100644
--- a/lib/portage/_emirrordist/DeletionIterator.py
+++ b/lib/portage/_emirrordist/DeletionIterator.py
@@ -1,10 +1,12 @@
-# Copyright 2013-2019 Gentoo Authors
+# Copyright 2013-2021 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2
+import itertools
import logging
import stat
from portage import os
+from portage.package.ebuild.fetch import DistfileName
from .DeletionTask import DeletionTask
class DeletionIterator:
@@ -21,8 +23,25 @@ class DeletionIterator:
deletion_delay = self._config.options.deletion_delay
start_time = self._config.start_time
distfiles_set = set()
- for layout in self._config.layouts:
- distfiles_set.update(layout.get_filenames(distdir))
+ distfiles_set.update(
+ (
+ filename
+ if isinstance(filename, DistfileName)
+ else DistfileName(filename)
+ for filename in itertools.chain.from_iterable(
+ layout.get_filenames(distdir) for layout in self._config.layouts
+ )
+ )
+ if self._config.content_db is None
+ else itertools.chain.from_iterable(
+ (
+ self._config.content_db.get_filenames_translate(filename)
+ for filename in itertools.chain.from_iterable(
+ layout.get_filenames(distdir) for layout in self._config.layouts
+ )
+ )
+ )
+ )
for filename in distfiles_set:
# require at least one successful stat()
exceptions = []
diff --git a/lib/portage/_emirrordist/DeletionTask.py b/lib/portage/_emirrordist/DeletionTask.py
index 5eb01d840..73493c5a1 100644
--- a/lib/portage/_emirrordist/DeletionTask.py
+++ b/lib/portage/_emirrordist/DeletionTask.py
@@ -5,6 +5,7 @@ import errno
import logging
from portage import os
+from portage.package.ebuild.fetch import ContentHashLayout
from portage.util._async.FileCopier import FileCopier
from _emerge.CompositeTask import CompositeTask
@@ -99,6 +100,10 @@ class DeletionTask(CompositeTask):
def _delete_links(self):
success = True
for layout in self.config.layouts:
+ if isinstance(layout, ContentHashLayout) and not self.distfile.digests:
+ logging.debug(("_delete_links: '%s' has "
+ "no digests") % self.distfile)
+ continue
distfile_path = os.path.join(
self.config.options.distfiles,
layout.get_path(self.distfile))
@@ -134,6 +139,9 @@ class DeletionTask(CompositeTask):
logging.debug(("drop '%s' from "
"distfiles db") % self.distfile)
+ if self.config.content_db is not None:
+ self.config.content_db.remove(self.distfile)
+
if self.config.deletion_db is not None:
try:
del self.config.deletion_db[self.distfile]
diff --git a/lib/portage/_emirrordist/FetchTask.py b/lib/portage/_emirrordist/FetchTask.py
index 997762082..5a48f91cd 100644
--- a/lib/portage/_emirrordist/FetchTask.py
+++ b/lib/portage/_emirrordist/FetchTask.py
@@ -1,4 +1,4 @@
-# Copyright 2013-2020 Gentoo Authors
+# Copyright 2013-2021 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2
import collections
@@ -47,6 +47,9 @@ class FetchTask(CompositeTask):
# Convert _pkg_str to str in order to prevent pickle problems.
self.config.distfiles_db[self.distfile] = str(self.cpv)
+ if self.config.content_db is not None:
+ self.config.content_db.add(self.distfile)
+
if not self._have_needed_digests():
msg = "incomplete digests: %s" % " ".join(self.digests)
self.scheduler.output(msg, background=self.background,
diff --git a/lib/portage/_emirrordist/main.py b/lib/portage/_emirrordist/main.py
index 8d00a05f5..2200ec715 100644
--- a/lib/portage/_emirrordist/main.py
+++ b/lib/portage/_emirrordist/main.py
@@ -1,4 +1,4 @@
-# Copyright 2013-2020 Gentoo Authors
+# Copyright 2013-2021 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2
import argparse
@@ -7,6 +7,7 @@ import sys
import portage
from portage import os
+from portage.package.ebuild.fetch import ContentHashLayout
from portage.util import normalize_path, _recursive_file_list
from portage.util._async.run_main_scheduler import run_main_scheduler
from portage.util._async.SchedulerInterface import SchedulerInterface
@@ -151,6 +152,12 @@ common_options = (
"distfile belongs to",
"metavar" : "FILE"
},
+ {
+ "longopt" : "--content-db",
+ "help" : "database file used to map content digests to"
+ "distfiles names (required for content-hash layout)",
+ "metavar" : "FILE"
+ },
{
"longopt" : "--recycle-dir",
"help" : "directory for extended retention of files that "
@@ -441,6 +448,12 @@ def emirrordist_main(args):
if not options.mirror:
parser.error('No action specified')
+ if options.delete and config.content_db is None:
+ for layout in config.layouts:
+ if isinstance(layout, ContentHashLayout):
+ parser.error("content-hash layout requires "
+ "--content-db to be specified")
+
returncode = os.EX_OK
if options.mirror:
diff --git a/lib/portage/tests/ebuild/test_fetch.py b/lib/portage/tests/ebuild/test_fetch.py
index d50a4cbfc..881288cdc 100644
--- a/lib/portage/tests/ebuild/test_fetch.py
+++ b/lib/portage/tests/ebuild/test_fetch.py
@@ -172,6 +172,16 @@ class EbuildFetchTestCase(TestCase):
with open(os.path.join(settings['DISTDIR'], 'layout.conf'), 'wt') as f:
f.write(layout_data)
+ if any(isinstance(layout, ContentHashLayout) for layout in layouts):
+ content_db = os.path.join(playground.eprefix, 'var/db/emirrordist/content.db')
+ os.makedirs(os.path.dirname(content_db), exist_ok=True)
+ try:
+ os.unlink(content_db)
+ except OSError:
+ pass
+ else:
+ content_db = None
+
# Demonstrate that fetch preserves a stale file in DISTDIR when no digests are given.
foo_uri = {'foo': ('{scheme}://{host}:{port}/distfiles/foo'.format(scheme=scheme, host=host, port=server.server_port),)}
foo_path = os.path.join(settings['DISTDIR'], 'foo')
@@ -233,9 +243,13 @@ class EbuildFetchTestCase(TestCase):
os.path.join(self.bindir, 'emirrordist'),
'--distfiles', settings['DISTDIR'],
'--config-root', settings['EPREFIX'],
+ '--delete',
'--repositories-configuration', settings.repositories.config_string(),
'--repo', 'test_repo', '--mirror')
+ if content_db is not None:
+ emirrordist_cmd = emirrordist_cmd + ('--content-db', content_db,)
+
env = settings.environ()
env['PYTHONPATH'] = ':'.join(
filter(None, [PORTAGE_PYM_PATH] + os.environ.get('PYTHONPATH', '').split(':')))
diff --git a/man/emirrordist.1 b/man/emirrordist.1
index 45108ef8c..7ad10dfd0 100644
--- a/man/emirrordist.1
+++ b/man/emirrordist.1
@@ -1,4 +1,4 @@
-.TH "EMIRRORDIST" "1" "Dec 2015" "Portage VERSION" "Portage"
+.TH "EMIRRORDIST" "1" "Feb 2021" "Portage VERSION" "Portage"
.SH "NAME"
emirrordist \- a fetch tool for mirroring of package distfiles
.SH SYNOPSIS
@@ -66,6 +66,10 @@ reporting purposes. Opened in append mode.
Log file for scheduled deletions, with tab\-delimited output, for
reporting purposes. Overwritten with each run.
.TP
+\fB\-\-content\-db\fR=\fIFILE\fR
+Database file used to pair content digests with distfiles names
+(required fo content\-hash layout).
+.TP
\fB\-\-delete\fR
Enable deletion of unused distfiles.
.TP
--
2.26.2
next prev parent reply other threads:[~2021-02-26 12:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-25 1:26 [gentoo-portage-dev] [PATCH] emirrordist: add --content-db option required for content-hash layout (bug 756778) Zac Medico
2021-02-26 12:21 ` Zac Medico [this message]
2021-02-27 2:05 ` [gentoo-portage-dev] [PATCH v3] " Zac Medico
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210226122150.1112987-1-zmedico@gentoo.org \
--to=zmedico@gentoo.org \
--cc=gentoo-portage-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox