public inbox for gentoo-portage-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-portage-dev] [PATCH] Contribute squashdelta syncing module
@ 2015-04-05 10:08 Michał Górny
  2015-04-16 17:38 ` Brian Dolbec
  0 siblings, 1 reply; 5+ messages in thread
From: Michał Górny @ 2015-04-05 10:08 UTC (permalink / raw
  To: gentoo-portage-dev; +Cc: Michał Górny

The squashdelta module provides syncing via SquashFS snapshots. For the
initial sync, a complete snapshot is fetched and placed in
/var/cache/portage/squashfs. On subsequent sync operations, deltas are
fetched from the mirror and used to reconstruct the newest snapshot.

The distfile fetching logic is reused to fetch the remote files
and verify their checksums. Additionally, the sha512sum.txt file should
be OpenPGP-verified after fetching but this is currently unimplemented.

After fetching, Portage tries to (re-)mount the SquashFS in repository
location.
---
 cnf/repos.conf                                     |   4 +
 pym/portage/sync/modules/squashdelta/README        | 124 +++++++++++++
 pym/portage/sync/modules/squashdelta/__init__.py   |  37 ++++
 .../sync/modules/squashdelta/squashdelta.py        | 192 +++++++++++++++++++++
 4 files changed, 357 insertions(+)
 create mode 100644 pym/portage/sync/modules/squashdelta/README
 create mode 100644 pym/portage/sync/modules/squashdelta/__init__.py
 create mode 100644 pym/portage/sync/modules/squashdelta/squashdelta.py

diff --git a/cnf/repos.conf b/cnf/repos.conf
index 1ca98ca..062fc0d 100644
--- a/cnf/repos.conf
+++ b/cnf/repos.conf
@@ -6,3 +6,7 @@ location = /usr/portage
 sync-type = rsync
 sync-uri = rsync://rsync.gentoo.org/gentoo-portage
 auto-sync = yes
+
+# for daily squashfs snapshots
+#sync-type = squashdelta
+#sync-uri = mirror://gentoo/../snapshots/squashfs
diff --git a/pym/portage/sync/modules/squashdelta/README b/pym/portage/sync/modules/squashdelta/README
new file mode 100644
index 0000000..994ae6d
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/README
@@ -0,0 +1,124 @@
+==================
+ squashdelta-sync
+==================
+
+Introduction
+============
+
+Squashdelta-sync provides the squashfs syncing module for Portage.
+When used as sync-type for the repository, it fetches the complete
+repository snapshot on initial sync, and then uses squashdeltas to
+efficiently update it.
+
+While initially intended for the daily snapshot of the Gentoo
+repository, the module is designed with flexibility in mind. It can be
+used to sync any repository, without enforcing any specific snapshotting
+interval or versioning rules. However, each snapshot version identifier
+must be unique in the scope of repository.
+
+
+Technical hosting details
+=========================
+
+The snapshot hosting needs to provide the following files:
+
+1. the current (newest) full SquashFS snapshot of the repository,
+   and optionally M past snapshots,
+
+2. the deltas from N past snapshots to the current snapshot,
+
+3. a ``sha512sum.txt`` file containing SHA-512 checksums of all hosted
+   files, optionally OpenPGP-signed.
+
+The following naming schemes are used for the snapshots and deltas,
+respectively::
+
+    ${repo_name}-${version}.sqfs
+    ${repo_name}-${old_version}-${new_version}.sqdelta
+
+where:
+
+* ``${repo_name}`` is the repository name (as specified
+  in ``repos.conf``),
+* ``${version}`` specifies the snapshot version,
+* ``${old_version}`` specifies the snapshot version which the delta
+  updates from,
+* ``${new_version}`` specifies the snapshot version which the delta
+  updates to.
+
+Version can be an arbitrary string. It does not need to be incremental,
+however each version must be unique in the repository scope.
+For example, the version can be a date, a revision number or a commit
+hash.
+
+The ``sha512sum.txt`` uses the format used by the GNU coreutils
+``sha512sum`` program. That is, it contains one or more lines consisting
+of hexadecimal SHA-512 checksum followed by whitespace, followed by
+a filename. Lines not matching that format should be ignored.
+
+Optionally, the ``sha512sum.txt`` may be OpenPGP-signed. In that case,
+the file conforms to the ASCII-armored OpenPGP message format, with
+the checksums being stored in the message body.
+
+Additionally, the ``sha512sum.txt`` needs to contain an additional line
+containing the following string::
+
+    Current: ${repo_name}-${version}
+
+Stating the current (newest) snapshot version. If snapshots for multiple
+repositories are provided in the same directory (using the same
+``sha512sum.txt`` file), this line can occur multiple times or list
+multiple snapshots, whitespace-separated. In order not to introduce
+stray lines in the file, it is recommended to embed this information
+in the OpenPGP comment field.
+
+An example script generating daily deltas for a repository can be found
+in squashdelta-daily-gen_ repository.
+
+.. _squashdelta-daily-gen: https://bitbucket.org/mgorny/squashdelta-daily-gen
+
+
+Technical syncing details
+=========================
+
+When performing a sync, the script first fetches the ``sha512sum.txt``
+and processes it in order to determine the list of files available
+on the mirror. It should be noted that the script will never use
+a snapshot or delta that is not listed there. If the file is
+OpenPGP-signed, the signature is verified.
+
+The script scans scans the ``sha512sum.txt`` for a line containing
+the following string (case-insensitive)::
+
+    Current:
+
+The text following this string is split on spaces, and the resulting
+tokens are parsed as snapshot names. The one matching the current
+repository name is used to determine the current (newest) snapshot
+version.
+
+Afterwards, the script scans the local cache directory for the following
+symlink::
+
+    ${repo_name}-current.sqfs
+
+If the symlink exists, the file pointed by it is assumed to be
+the current (newest) local snapshot. Otherwise, the script assumes
+initial sync.
+
+On initial sync, the script fetches the newest snapshot from mirror
+and places it inside cache directory. The snapshot checksum is verified
+using ``sha512sum.txt`` and ``${repo_name}-current.sqfs`` symlink is
+created.
+
+On update, the script scans the file list for a delta transforming
+the current local snapshot to the newest remote snapshot. If such
+a delta is found, it is fetched, verified and applied to obtain
+the new snapshot. Afterwards, the resulting snapshot checksum is
+verified and the ``${repo_name}-current.sqfs`` symlink is updated.
+
+If no delta matches the version pair, it is assumed that the system is
+outdated beyond available deltas and a new snapshot is fetched instead
+(alike initial sync).
+
+.. vim:ft=rst
diff --git a/pym/portage/sync/modules/squashdelta/__init__.py b/pym/portage/sync/modules/squashdelta/__init__.py
new file mode 100644
index 0000000..1a17dea
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/__init__.py
@@ -0,0 +1,37 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+from portage.sync.config_checks import CheckSyncConfig
+
+
+DEFAULT_CACHE_LOCATION = '/var/cache/portage/squashfs'
+
+
+class CheckSquashDeltaConfig(CheckSyncConfig):
+	def __init__(self, repo, logger):
+		CheckSyncConfig.__init__(self, repo, logger)
+		self.checks.append('check_cache_location')
+
+	def check_cache_location(self):
+		# TODO: make it configurable when Portage is fixed to support
+		# arbitrary config variables
+		pass
+
+
+module_spec = {
+	'name': 'squashdelta',
+	'description': 'Syncing SquashFS images using SquashDeltas',
+	'provides': {
+		'squashdelta-module': {
+			'name': "squashdelta",
+			'class': "SquashDeltaSync",
+			'description': 'Syncing SquashFS images using SquashDeltas',
+			'functions': ['sync', 'new', 'exists'],
+			'func_desc': {
+				'sync': 'Performs the sync of the repository',
+			},
+			'validate_config': CheckSquashDeltaConfig,
+		}
+	}
+}
diff --git a/pym/portage/sync/modules/squashdelta/squashdelta.py b/pym/portage/sync/modules/squashdelta/squashdelta.py
new file mode 100644
index 0000000..a0dfc46
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/squashdelta.py
@@ -0,0 +1,192 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+import errno
+import io
+import logging
+import os
+import os.path
+import re
+
+import portage
+from portage.package.ebuild.fetch import fetch
+from portage.sync.syncbase import SyncBase
+
+from . import DEFAULT_CACHE_LOCATION
+
+
+class SquashDeltaSync(SyncBase):
+	short_desc = "Repository syncing using SquashFS deltas"
+
+	@staticmethod
+	def name():
+		return "SquashDeltaSync"
+
+	def __init__(self):
+		super(SquashDeltaSync, self).__init__(
+				'squashmerge', 'dev-util/squashmerge')
+
+	def sync(self, **kwargs):
+		self._kwargs(kwargs)
+		my_settings = portage.config(clone = self.settings)
+		cache_location = DEFAULT_CACHE_LOCATION
+
+		# override fetching location
+		my_settings['DISTDIR'] = cache_location
+
+		# make sure we append paths correctly
+		base_uri = self.repo.sync_uri
+		if not base_uri.endswith('/'):
+			base_uri += '/'
+
+		def my_fetch(fn, **kwargs):
+			kwargs['try_mirrors'] = 0
+			return fetch([base_uri + fn], my_settings, **kwargs)
+
+		# fetch sha512sum.txt
+		sha512_path = os.path.join(cache_location, 'sha512sum.txt')
+		try:
+			os.unlink(sha512_path)
+		except OSError:
+			pass
+		if not my_fetch('sha512sum.txt'):
+			return (1, False)
+
+		if 'webrsync-gpg' in my_settings.features:
+			# TODO: GPG signature verification
+			pass
+
+		# sha512sum.txt parsing
+		with io.open(sha512_path, 'r', encoding='utf8') as f:
+			data = f.readlines()
+
+		repo_re = re.compile(self.repo.name + '-(.*)$')
+		# current tag
+		current_re = re.compile('current:', re.IGNORECASE)
+		# checksum
+		checksum_re = re.compile('^([a-f0-9]{128})\s+(.*)$', re.IGNORECASE)
+
+		def iter_snapshots(lines):
+			for l in lines:
+				m = current_re.search(l)
+				if m:
+					for s in l[m.end():].split():
+						yield s
+
+		def iter_checksums(lines):
+			for l in lines:
+				m = checksum_re.match(l)
+				if m:
+					yield (m.group(2), {
+						'size': None,
+						'SHA512': m.group(1),
+					})
+
+		# look for current indicator
+		for s in iter_snapshots(data):
+			m = repo_re.match(s)
+			if m:
+				new_snapshot = m.group(0) + '.sqfs'
+				new_version = m.group(1)
+				break
+		else:
+			logging.error('Unable to find current snapshot in sha512sum.txt')
+			return (1, False)
+		new_path = os.path.join(cache_location, new_snapshot)
+
+		# get digests
+		my_digests = dict(iter_checksums(data))
+
+		# try to find a local snapshot
+		old_version = None
+		current_path = os.path.join(cache_location,
+				self.repo.name + '-current.sqfs')
+		try:
+			old_snapshot = os.readlink(current_path)
+		except OSError:
+			pass
+		else:
+			m = repo_re.match(old_snapshot)
+			if m and old_snapshot.endswith('.sqfs'):
+				old_version = m.group(1)[:-5]
+				old_path = os.path.join(cache_location, old_snapshot)
+
+		if old_version is not None:
+			if old_version == new_version:
+				logging.info('Snapshot up-to-date, verifying integrity.')
+			else:
+				# attempt to update
+				delta_path = None
+				expected_delta = '%s-%s-%s.sqdelta' % (
+						self.repo.name, old_version, new_version)
+				if expected_delta not in my_digests:
+					logging.warning('No delta for %s->%s, fetching new snapshot.'
+							% (old_version, new_version))
+				else:
+					delta_path = os.path.join(cache_location, expected_delta)
+
+					if not my_fetch(expected_delta, digests = my_digests):
+						return (4, False)
+					if not self.has_bin:
+						return (5, False)
+
+					ret = portage.process.spawn([self.bin_command,
+							old_path, delta_path, new_path], **self.spawn_kwargs)
+					if ret != os.EX_OK:
+						logging.error('Merging the delta failed')
+						return (6, False)
+
+					# pass-through to verification and cleanup
+
+		# fetch full snapshot or verify the one we have
+		if not my_fetch(new_snapshot, digests = my_digests):
+			return (2, False)
+
+		# create/update -current symlink
+		# using external ln for two reasons:
+		# 1. clean --force (unlike python's unlink+symlink)
+		# 2. easy userpriv (otherwise we'd have to lchown())
+		ret = portage.process.spawn(['ln', '-s', '-f', new_snapshot, current_path],
+				**self.spawn_kwargs)
+		if ret != os.EX_OK:
+			logging.error('Unable to set -current symlink')
+			retrurn (3, False)
+
+		# remove old snapshot
+		if old_version is not None and old_version != new_version:
+			try:
+				os.unlink(old_path)
+			except OSError as e:
+				logging.warning('Unable to unlink old snapshot: ' + str(e))
+			if delta_path is not None:
+				try:
+					os.unlink(delta_path)
+				except OSError as e:
+					logging.warning('Unable to unlink old delta: ' + str(e))
+		try:
+			os.unlink(sha512_path)
+		except OSError as e:
+			logging.warning('Unable to unlink sha512sum.txt: ' + str(e))
+
+		mount_cmd = ['mount', current_path, self.repo.location]
+		can_mount = True
+		if os.path.ismount(self.repo.location):
+			# need to umount old snapshot
+			ret = portage.process.spawn(['umount', '-l', self.repo.location])
+			if ret != os.EX_OK:
+				logging.warning('Unable to unmount old SquashFS after update')
+				can_mount = False
+		else:
+			try:
+				os.makedirs(self.repo.location)
+			except OSError as e:
+				if e.errno != errno.EEXIST:
+					raise
+
+		if can_mount:
+			ret = portage.process.spawn(mount_cmd)
+			if ret != os.EX_OK:
+				logging.warning('Unable to (re-)mount SquashFS after update')
+
+		return (0, True)
-- 
2.3.5



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] Contribute squashdelta syncing module
  2015-04-05 10:08 [gentoo-portage-dev] [PATCH] Contribute squashdelta syncing module Michał Górny
@ 2015-04-16 17:38 ` Brian Dolbec
  2015-04-18 17:45   ` Michał Górny
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Brian Dolbec @ 2015-04-16 17:38 UTC (permalink / raw
  To: gentoo-portage-dev

On Sun,  5 Apr 2015 12:08:31 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> The squashdelta module provides syncing via SquashFS snapshots. For
> the initial sync, a complete snapshot is fetched and placed in
> /var/cache/portage/squashfs. On subsequent sync operations, deltas are
> fetched from the mirror and used to reconstruct the newest snapshot.
> 
> The distfile fetching logic is reused to fetch the remote files
> and verify their checksums. Additionally, the sha512sum.txt file
> should be OpenPGP-verified after fetching but this is currently
> unimplemented.
> 
> After fetching, Portage tries to (re-)mount the SquashFS in repository
> location.
> ---
>  cnf/repos.conf                                     |   4 +
>  pym/portage/sync/modules/squashdelta/README        | 124
> +++++++++++++ pym/portage/sync/modules/squashdelta/__init__.py   |
> 37 ++++ .../sync/modules/squashdelta/squashdelta.py        | 192
> +++++++++++++++++++++ 4 files changed, 357 insertions(+)
>  create mode 100644 pym/portage/sync/modules/squashdelta/README
>  create mode 100644 pym/portage/sync/modules/squashdelta/__init__.py
>  create mode 100644
> pym/portage/sync/modules/squashdelta/squashdelta.py
> 
> diff --git a/cnf/repos.conf b/cnf/repos.conf
> index 1ca98ca..062fc0d 100644
> --- a/cnf/repos.conf
> +++ b/cnf/repos.conf
> @@ -6,3 +6,7 @@ location = /usr/portage
>  sync-type = rsync
>  sync-uri = rsync://rsync.gentoo.org/gentoo-portage
>  auto-sync = yes
> +
> +# for daily squashfs snapshots
> +#sync-type = squashdelta
> +#sync-uri = mirror://gentoo/../snapshots/squashfs
>

<snip>

> diff --git a/pym/portage/sync/modules/squashdelta/__init__.py
> b/pym/portage/sync/modules/squashdelta/__init__.py new file mode
> 100644 index 0000000..1a17dea
> --- /dev/null
> +++ b/pym/portage/sync/modules/squashdelta/__init__.py
> @@ -0,0 +1,37 @@
> +#	vim:fileencoding=utf-8:noet
> +# (c) 2015 Michał Górny <mgorny@gentoo.org>
> +# Distributed under the terms of the GNU General Public License v2
> +
> +from portage.sync.config_checks import CheckSyncConfig
> +
> +
> +DEFAULT_CACHE_LOCATION = '/var/cache/portage/squashfs'
> +
> +
> +class CheckSquashDeltaConfig(CheckSyncConfig):
> +	def __init__(self, repo, logger):
> +		CheckSyncConfig.__init__(self, repo, logger)
> +		self.checks.append('check_cache_location')
> +
> +	def check_cache_location(self):
> +		# TODO: make it configurable when Portage is fixed
> to support
> +		# arbitrary config variables
> +		pass
> +
> +
> +module_spec = {
> +	'name': 'squashdelta',
> +	'description': 'Syncing SquashFS images using SquashDeltas',
> +	'provides': {
> +		'squashdelta-module': {
> +			'name': "squashdelta",
> +			'class': "SquashDeltaSync",
> +			'description': 'Syncing SquashFS images
> using SquashDeltas',
> +			'functions': ['sync', 'new', 'exists'],
> +			'func_desc': {
> +				'sync': 'Performs the sync of the
> repository',
> +			},
> +			'validate_config': CheckSquashDeltaConfig,
> +		}
> +	}
> +}
> diff --git a/pym/portage/sync/modules/squashdelta/squashdelta.py
> b/pym/portage/sync/modules/squashdelta/squashdelta.py new file mode
> 100644 index 0000000..a0dfc46
> --- /dev/null
> +++ b/pym/portage/sync/modules/squashdelta/squashdelta.py
> @@ -0,0 +1,192 @@
> +#	vim:fileencoding=utf-8:noet
> +# (c) 2015 Michał Górny <mgorny@gentoo.org>
> +# Distributed under the terms of the GNU General Public License v2
> +
> +import errno
> +import io
> +import logging
> +import os
> +import os.path
> +import re
> +
> +import portage
> +from portage.package.ebuild.fetch import fetch
> +from portage.sync.syncbase import SyncBase
> +
> +from . import DEFAULT_CACHE_LOCATION
> +
> +
> +class SquashDeltaSync(SyncBase):


OK, I see a small mistake.  You are subclassing SyncBase which does not
stub out a new() and you do not define one here.  But you export a new()
in the module-spec above.


> +	short_desc = "Repository syncing using SquashFS deltas"
> +
> +	@staticmethod
> +	def name():
> +		return "SquashDeltaSync"
> +
> +	def __init__(self):
> +		super(SquashDeltaSync, self).__init__(
> +				'squashmerge',
> 'dev-util/squashmerge') +
> +	def sync(self, **kwargs):
> +		self._kwargs(kwargs)
> +		my_settings = portage.config(clone = self.settings)
> +		cache_location = DEFAULT_CACHE_LOCATION
> +
> +		# override fetching location
> +		my_settings['DISTDIR'] = cache_location
> +
> +		# make sure we append paths correctly
> +		base_uri = self.repo.sync_uri
> +		if not base_uri.endswith('/'):
> +			base_uri += '/'
> +
> +		def my_fetch(fn, **kwargs):
> +			kwargs['try_mirrors'] = 0
> +			return fetch([base_uri + fn], my_settings,
> **kwargs) +
> +		# fetch sha512sum.txt
> +		sha512_path = os.path.join(cache_location,
> 'sha512sum.txt')
> +		try:
> +			os.unlink(sha512_path)
> +		except OSError:
> +			pass
> +		if not my_fetch('sha512sum.txt'):
> +			return (1, False)
> +
> +		if 'webrsync-gpg' in my_settings.features:
> +			# TODO: GPG signature verification
> +			pass
> +
> +		# sha512sum.txt parsing
> +		with io.open(sha512_path, 'r', encoding='utf8') as f:
> +			data = f.readlines()
> +
> +		repo_re = re.compile(self.repo.name + '-(.*)$')
> +		# current tag
> +		current_re = re.compile('current:', re.IGNORECASE)
> +		# checksum
> +		checksum_re = re.compile('^([a-f0-9]{128})\s+(.*)$',
> re.IGNORECASE) +
> +		def iter_snapshots(lines):
> +			for l in lines:
> +				m = current_re.search(l)
> +				if m:
> +					for s in l[m.end():].split():
> +						yield s
> +
> +		def iter_checksums(lines):
> +			for l in lines:
> +				m = checksum_re.match(l)
> +				if m:
> +					yield (m.group(2), {
> +						'size': None,
> +						'SHA512': m.group(1),
> +					})
> +
> +		# look for current indicator
> +		for s in iter_snapshots(data):
> +			m = repo_re.match(s)
> +			if m:
> +				new_snapshot = m.group(0) + '.sqfs'
> +				new_version = m.group(1)
> +				break
> +		else:
> +			logging.error('Unable to find current
> snapshot in sha512sum.txt')
> +			return (1, False)
> +		new_path = os.path.join(cache_location, new_snapshot)
> +
> +		# get digests
> +		my_digests = dict(iter_checksums(data))
> +
> +		# try to find a local snapshot
> +		old_version = None
> +		current_path = os.path.join(cache_location,
> +				self.repo.name + '-current.sqfs')
> +		try:
> +			old_snapshot = os.readlink(current_path)
> +		except OSError:
> +			pass
> +		else:
> +			m = repo_re.match(old_snapshot)
> +			if m and old_snapshot.endswith('.sqfs'):
> +				old_version = m.group(1)[:-5]
> +				old_path =
> os.path.join(cache_location, old_snapshot) +
> +		if old_version is not None:
> +			if old_version == new_version:
> +				logging.info('Snapshot up-to-date,
> verifying integrity.')
> +			else:
> +				# attempt to update
> +				delta_path = None
> +				expected_delta = '%s-%s-%s.sqdelta'
> % (
> +						self.repo.name,
> old_version, new_version)
> +				if expected_delta not in my_digests:
> +					logging.warning('No delta
> for %s->%s, fetching new snapshot.'
> +							%
> (old_version, new_version))
> +				else:
> +					delta_path =
> os.path.join(cache_location, expected_delta) +
> +					if not
> my_fetch(expected_delta, digests = my_digests):
> +						return (4, False)
> +					if not self.has_bin:
> +						return (5, False)
> +
> +					ret =
> portage.process.spawn([self.bin_command,
> +							old_path,
> delta_path, new_path], **self.spawn_kwargs)
> +					if ret != os.EX_OK:
> +
> logging.error('Merging the delta failed')
> +						return (6, False)
> +
> +					# pass-through to
> verification and cleanup +
> +		# fetch full snapshot or verify the one we have
> +		if not my_fetch(new_snapshot, digests = my_digests):
> +			return (2, False)
> +
> +		# create/update -current symlink
> +		# using external ln for two reasons:
> +		# 1. clean --force (unlike python's unlink+symlink)
> +		# 2. easy userpriv (otherwise we'd have to lchown())
> +		ret = portage.process.spawn(['ln', '-s', '-f',
> new_snapshot, current_path],
> +				**self.spawn_kwargs)
> +		if ret != os.EX_OK:
> +			logging.error('Unable to set -current
> symlink')
> +			retrurn (3, False)
> +
> +		# remove old snapshot
> +		if old_version is not None and old_version !=
> new_version:
> +			try:
> +				os.unlink(old_path)
> +			except OSError as e:
> +				logging.warning('Unable to unlink
> old snapshot: ' + str(e))
> +			if delta_path is not None:
> +				try:
> +					os.unlink(delta_path)
> +				except OSError as e:
> +					logging.warning('Unable to
> unlink old delta: ' + str(e))
> +		try:
> +			os.unlink(sha512_path)
> +		except OSError as e:
> +			logging.warning('Unable to unlink
> sha512sum.txt: ' + str(e)) +
> +		mount_cmd = ['mount', current_path,
> self.repo.location]
> +		can_mount = True
> +		if os.path.ismount(self.repo.location):
> +			# need to umount old snapshot
> +			ret = portage.process.spawn(['umount', '-l',
> self.repo.location])
> +			if ret != os.EX_OK:
> +				logging.warning('Unable to unmount
> old SquashFS after update')
> +				can_mount = False
> +		else:
> +			try:
> +				os.makedirs(self.repo.location)
> +			except OSError as e:
> +				if e.errno != errno.EEXIST:
> +					raise
> +
> +		if can_mount:
> +			ret = portage.process.spawn(mount_cmd)
> +			if ret != os.EX_OK:
> +				logging.warning('Unable to
> (re-)mount SquashFS after update') +
> +		return (0, True)

Overall the code itself looks decent.  Aside from the small mistake
mentioned inline, my only concern is the sheer size of the sync().  It
is 162 lines and embeds 2 private functions. This code could easily be
broken up into several smaller task functions.  It would make reading
the main sync() logic easier as well as the smaller task sections.  I
am not a fan of the long winded functions and scripts present in
portage (this by no means is in the same category as many of those).
But I certainly don't want to let more of that in if I can help it. And
aim to reduce it while I'm the lead.


Ok, so the only data variable you wanted to add to the repos.conf was
the cache location?

I'll work on adding the gkeys integration in the gkeys branch I started
for the gpg verification.  I see no point in porting the code from
emerge-webrsync's bash to python only to be replaced by gkeys in the
very near future.  Please stub out a function & call for it when you
address the above issues.  I'll fill in the code for it.

-- 
Brian Dolbec <dolsen>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] Contribute squashdelta syncing module
  2015-04-16 17:38 ` Brian Dolbec
@ 2015-04-18 17:45   ` Michał Górny
  2015-04-18 18:29   ` [gentoo-portage-dev] [PATCH v2] " Michał Górny
  2015-04-18 18:46   ` [gentoo-portage-dev] [PATCH v3] " Michał Górny
  2 siblings, 0 replies; 5+ messages in thread
From: Michał Górny @ 2015-04-18 17:45 UTC (permalink / raw
  To: Brian Dolbec; +Cc: gentoo-portage-dev

[-- Attachment #1: Type: text/plain, Size: 11782 bytes --]

Dnia 2015-04-16, o godz. 10:38:22
Brian Dolbec <dolsen@gentoo.org> napisał(a):

> On Sun,  5 Apr 2015 12:08:31 +0200
> Michał Górny <mgorny@gentoo.org> wrote:
> 
> > The squashdelta module provides syncing via SquashFS snapshots. For
> > the initial sync, a complete snapshot is fetched and placed in
> > /var/cache/portage/squashfs. On subsequent sync operations, deltas are
> > fetched from the mirror and used to reconstruct the newest snapshot.
> > 
> > The distfile fetching logic is reused to fetch the remote files
> > and verify their checksums. Additionally, the sha512sum.txt file
> > should be OpenPGP-verified after fetching but this is currently
> > unimplemented.
> > 
> > After fetching, Portage tries to (re-)mount the SquashFS in repository
> > location.
> > ---
> >  cnf/repos.conf                                     |   4 +
> >  pym/portage/sync/modules/squashdelta/README        | 124
> > +++++++++++++ pym/portage/sync/modules/squashdelta/__init__.py   |
> > 37 ++++ .../sync/modules/squashdelta/squashdelta.py        | 192
> > +++++++++++++++++++++ 4 files changed, 357 insertions(+)
> >  create mode 100644 pym/portage/sync/modules/squashdelta/README
> >  create mode 100644 pym/portage/sync/modules/squashdelta/__init__.py
> >  create mode 100644
> > pym/portage/sync/modules/squashdelta/squashdelta.py
> > 
> > diff --git a/cnf/repos.conf b/cnf/repos.conf
> > index 1ca98ca..062fc0d 100644
> > --- a/cnf/repos.conf
> > +++ b/cnf/repos.conf
> > @@ -6,3 +6,7 @@ location = /usr/portage
> >  sync-type = rsync
> >  sync-uri = rsync://rsync.gentoo.org/gentoo-portage
> >  auto-sync = yes
> > +
> > +# for daily squashfs snapshots
> > +#sync-type = squashdelta
> > +#sync-uri = mirror://gentoo/../snapshots/squashfs
> >
> 
> <snip>
> 
> > diff --git a/pym/portage/sync/modules/squashdelta/__init__.py
> > b/pym/portage/sync/modules/squashdelta/__init__.py new file mode
> > 100644 index 0000000..1a17dea
> > --- /dev/null
> > +++ b/pym/portage/sync/modules/squashdelta/__init__.py
> > @@ -0,0 +1,37 @@
> > +#	vim:fileencoding=utf-8:noet
> > +# (c) 2015 Michał Górny <mgorny@gentoo.org>
> > +# Distributed under the terms of the GNU General Public License v2
> > +
> > +from portage.sync.config_checks import CheckSyncConfig
> > +
> > +
> > +DEFAULT_CACHE_LOCATION = '/var/cache/portage/squashfs'
> > +
> > +
> > +class CheckSquashDeltaConfig(CheckSyncConfig):
> > +	def __init__(self, repo, logger):
> > +		CheckSyncConfig.__init__(self, repo, logger)
> > +		self.checks.append('check_cache_location')
> > +
> > +	def check_cache_location(self):
> > +		# TODO: make it configurable when Portage is fixed
> > to support
> > +		# arbitrary config variables
> > +		pass
> > +
> > +
> > +module_spec = {
> > +	'name': 'squashdelta',
> > +	'description': 'Syncing SquashFS images using SquashDeltas',
> > +	'provides': {
> > +		'squashdelta-module': {
> > +			'name': "squashdelta",
> > +			'class': "SquashDeltaSync",
> > +			'description': 'Syncing SquashFS images
> > using SquashDeltas',
> > +			'functions': ['sync', 'new', 'exists'],
> > +			'func_desc': {
> > +				'sync': 'Performs the sync of the
> > repository',
> > +			},
> > +			'validate_config': CheckSquashDeltaConfig,
> > +		}
> > +	}
> > +}
> > diff --git a/pym/portage/sync/modules/squashdelta/squashdelta.py
> > b/pym/portage/sync/modules/squashdelta/squashdelta.py new file mode
> > 100644 index 0000000..a0dfc46
> > --- /dev/null
> > +++ b/pym/portage/sync/modules/squashdelta/squashdelta.py
> > @@ -0,0 +1,192 @@
> > +#	vim:fileencoding=utf-8:noet
> > +# (c) 2015 Michał Górny <mgorny@gentoo.org>
> > +# Distributed under the terms of the GNU General Public License v2
> > +
> > +import errno
> > +import io
> > +import logging
> > +import os
> > +import os.path
> > +import re
> > +
> > +import portage
> > +from portage.package.ebuild.fetch import fetch
> > +from portage.sync.syncbase import SyncBase
> > +
> > +from . import DEFAULT_CACHE_LOCATION
> > +
> > +
> > +class SquashDeltaSync(SyncBase):
> 
> 
> OK, I see a small mistake.  You are subclassing SyncBase which does not
> stub out a new() and you do not define one here.  But you export a new()
> in the module-spec above.

Fixed. I removed them from [func_desc], and apparently forgot
to do so from [functions].

> > +	short_desc = "Repository syncing using SquashFS deltas"
> > +
> > +	@staticmethod
> > +	def name():
> > +		return "SquashDeltaSync"
> > +
> > +	def __init__(self):
> > +		super(SquashDeltaSync, self).__init__(
> > +				'squashmerge',
> > 'dev-util/squashmerge') +
> > +	def sync(self, **kwargs):
> > +		self._kwargs(kwargs)
> > +		my_settings = portage.config(clone = self.settings)
> > +		cache_location = DEFAULT_CACHE_LOCATION
> > +
> > +		# override fetching location
> > +		my_settings['DISTDIR'] = cache_location
> > +
> > +		# make sure we append paths correctly
> > +		base_uri = self.repo.sync_uri
> > +		if not base_uri.endswith('/'):
> > +			base_uri += '/'
> > +
> > +		def my_fetch(fn, **kwargs):
> > +			kwargs['try_mirrors'] = 0
> > +			return fetch([base_uri + fn], my_settings,
> > **kwargs) +
> > +		# fetch sha512sum.txt
> > +		sha512_path = os.path.join(cache_location,
> > 'sha512sum.txt')
> > +		try:
> > +			os.unlink(sha512_path)
> > +		except OSError:
> > +			pass
> > +		if not my_fetch('sha512sum.txt'):
> > +			return (1, False)
> > +
> > +		if 'webrsync-gpg' in my_settings.features:
> > +			# TODO: GPG signature verification
> > +			pass
> > +
> > +		# sha512sum.txt parsing
> > +		with io.open(sha512_path, 'r', encoding='utf8') as f:
> > +			data = f.readlines()
> > +
> > +		repo_re = re.compile(self.repo.name + '-(.*)$')
> > +		# current tag
> > +		current_re = re.compile('current:', re.IGNORECASE)
> > +		# checksum
> > +		checksum_re = re.compile('^([a-f0-9]{128})\s+(.*)$',
> > re.IGNORECASE) +
> > +		def iter_snapshots(lines):
> > +			for l in lines:
> > +				m = current_re.search(l)
> > +				if m:
> > +					for s in l[m.end():].split():
> > +						yield s
> > +
> > +		def iter_checksums(lines):
> > +			for l in lines:
> > +				m = checksum_re.match(l)
> > +				if m:
> > +					yield (m.group(2), {
> > +						'size': None,
> > +						'SHA512': m.group(1),
> > +					})
> > +
> > +		# look for current indicator
> > +		for s in iter_snapshots(data):
> > +			m = repo_re.match(s)
> > +			if m:
> > +				new_snapshot = m.group(0) + '.sqfs'
> > +				new_version = m.group(1)
> > +				break
> > +		else:
> > +			logging.error('Unable to find current
> > snapshot in sha512sum.txt')
> > +			return (1, False)
> > +		new_path = os.path.join(cache_location, new_snapshot)
> > +
> > +		# get digests
> > +		my_digests = dict(iter_checksums(data))
> > +
> > +		# try to find a local snapshot
> > +		old_version = None
> > +		current_path = os.path.join(cache_location,
> > +				self.repo.name + '-current.sqfs')
> > +		try:
> > +			old_snapshot = os.readlink(current_path)
> > +		except OSError:
> > +			pass
> > +		else:
> > +			m = repo_re.match(old_snapshot)
> > +			if m and old_snapshot.endswith('.sqfs'):
> > +				old_version = m.group(1)[:-5]
> > +				old_path =
> > os.path.join(cache_location, old_snapshot) +
> > +		if old_version is not None:
> > +			if old_version == new_version:
> > +				logging.info('Snapshot up-to-date,
> > verifying integrity.')
> > +			else:
> > +				# attempt to update
> > +				delta_path = None
> > +				expected_delta = '%s-%s-%s.sqdelta'
> > % (
> > +						self.repo.name,
> > old_version, new_version)
> > +				if expected_delta not in my_digests:
> > +					logging.warning('No delta
> > for %s->%s, fetching new snapshot.'
> > +							%
> > (old_version, new_version))
> > +				else:
> > +					delta_path =
> > os.path.join(cache_location, expected_delta) +
> > +					if not
> > my_fetch(expected_delta, digests = my_digests):
> > +						return (4, False)
> > +					if not self.has_bin:
> > +						return (5, False)
> > +
> > +					ret =
> > portage.process.spawn([self.bin_command,
> > +							old_path,
> > delta_path, new_path], **self.spawn_kwargs)
> > +					if ret != os.EX_OK:
> > +
> > logging.error('Merging the delta failed')
> > +						return (6, False)
> > +
> > +					# pass-through to
> > verification and cleanup +
> > +		# fetch full snapshot or verify the one we have
> > +		if not my_fetch(new_snapshot, digests = my_digests):
> > +			return (2, False)
> > +
> > +		# create/update -current symlink
> > +		# using external ln for two reasons:
> > +		# 1. clean --force (unlike python's unlink+symlink)
> > +		# 2. easy userpriv (otherwise we'd have to lchown())
> > +		ret = portage.process.spawn(['ln', '-s', '-f',
> > new_snapshot, current_path],
> > +				**self.spawn_kwargs)
> > +		if ret != os.EX_OK:
> > +			logging.error('Unable to set -current
> > symlink')
> > +			retrurn (3, False)
> > +
> > +		# remove old snapshot
> > +		if old_version is not None and old_version !=
> > new_version:
> > +			try:
> > +				os.unlink(old_path)
> > +			except OSError as e:
> > +				logging.warning('Unable to unlink
> > old snapshot: ' + str(e))
> > +			if delta_path is not None:
> > +				try:
> > +					os.unlink(delta_path)
> > +				except OSError as e:
> > +					logging.warning('Unable to
> > unlink old delta: ' + str(e))
> > +		try:
> > +			os.unlink(sha512_path)
> > +		except OSError as e:
> > +			logging.warning('Unable to unlink
> > sha512sum.txt: ' + str(e)) +
> > +		mount_cmd = ['mount', current_path,
> > self.repo.location]
> > +		can_mount = True
> > +		if os.path.ismount(self.repo.location):
> > +			# need to umount old snapshot
> > +			ret = portage.process.spawn(['umount', '-l',
> > self.repo.location])
> > +			if ret != os.EX_OK:
> > +				logging.warning('Unable to unmount
> > old SquashFS after update')
> > +				can_mount = False
> > +		else:
> > +			try:
> > +				os.makedirs(self.repo.location)
> > +			except OSError as e:
> > +				if e.errno != errno.EEXIST:
> > +					raise
> > +
> > +		if can_mount:
> > +			ret = portage.process.spawn(mount_cmd)
> > +			if ret != os.EX_OK:
> > +				logging.warning('Unable to
> > (re-)mount SquashFS after update') +
> > +		return (0, True)
> 
> Overall the code itself looks decent.  Aside from the small mistake
> mentioned inline, my only concern is the sheer size of the sync().  It
> is 162 lines and embeds 2 private functions. This code could easily be
> broken up into several smaller task functions.  It would make reading
> the main sync() logic easier as well as the smaller task sections.  I
> am not a fan of the long winded functions and scripts present in
> portage (this by no means is in the same category as many of those).
> But I certainly don't want to let more of that in if I can help it. And
> aim to reduce it while I'm the lead.

Will try.

> Ok, so the only data variable you wanted to add to the repos.conf was
> the cache location?

Yes, right now just that, I think. Maybe in the future we'd use
per-repo OpenPGP verification switch. Something like:

  openpgp-verify = [no|yes|required]

With 'yes' meaning 'verify if signed', and 'required' meaning 'refuse
if not signed'.

> I'll work on adding the gkeys integration in the gkeys branch I started
> for the gpg verification.  I see no point in porting the code from
> emerge-webrsync's bash to python only to be replaced by gkeys in the
> very near future.  Please stub out a function & call for it when you
> address the above issues.  I'll fill in the code for it.

Ok.

-- 
Best regards,
Michał Górny

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 949 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [gentoo-portage-dev] [PATCH v2] Contribute squashdelta syncing module
  2015-04-16 17:38 ` Brian Dolbec
  2015-04-18 17:45   ` Michał Górny
@ 2015-04-18 18:29   ` Michał Górny
  2015-04-18 18:46   ` [gentoo-portage-dev] [PATCH v3] " Michał Górny
  2 siblings, 0 replies; 5+ messages in thread
From: Michał Górny @ 2015-04-18 18:29 UTC (permalink / raw
  To: gentoo-portage-dev; +Cc: Michał Górny

The squashdelta module provides syncing via SquashFS snapshots. For the
initial sync, a complete snapshot is fetched and placed in
/var/cache/portage/squashfs. On subsequent sync operations, deltas are
fetched from the mirror and used to reconstruct the newest snapshot.

The distfile fetching logic is reused to fetch the remote files
and verify their checksums. Additionally, the sha512sum.txt file should
be OpenPGP-verified after fetching but this is currently unimplemented.

After fetching, Portage tries to (re-)mount the SquashFS in repository
location.
---
 cnf/repos.conf                                     |   4 +
 pym/portage/sync/modules/squashdelta/README        | 124 +++++++++++
 pym/portage/sync/modules/squashdelta/__init__.py   |  37 ++++
 .../sync/modules/squashdelta/squashdelta.py        | 231 +++++++++++++++++++++
 4 files changed, 396 insertions(+)
 create mode 100644 pym/portage/sync/modules/squashdelta/README
 create mode 100644 pym/portage/sync/modules/squashdelta/__init__.py
 create mode 100644 pym/portage/sync/modules/squashdelta/squashdelta.py

diff --git a/cnf/repos.conf b/cnf/repos.conf
index 1ca98ca..062fc0d 100644
--- a/cnf/repos.conf
+++ b/cnf/repos.conf
@@ -6,3 +6,7 @@ location = /usr/portage
 sync-type = rsync
 sync-uri = rsync://rsync.gentoo.org/gentoo-portage
 auto-sync = yes
+
+# for daily squashfs snapshots
+#sync-type = squashdelta
+#sync-uri = mirror://gentoo/../snapshots/squashfs
diff --git a/pym/portage/sync/modules/squashdelta/README b/pym/portage/sync/modules/squashdelta/README
new file mode 100644
index 0000000..994ae6d
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/README
@@ -0,0 +1,124 @@
+==================
+ squashdelta-sync
+==================
+
+Introduction
+============
+
+Squashdelta-sync provides the squashfs syncing module for Portage.
+When used as sync-type for the repository, it fetches the complete
+repository snapshot on initial sync, and then uses squashdeltas to
+efficiently update it.
+
+While initially intended for the daily snapshot of the Gentoo
+repository, the module is designed with flexibility in mind. It can be
+used to sync any repository, without enforcing any specific snapshotting
+interval or versioning rules. However, each snapshot version identifier
+must be unique in the scope of repository.
+
+
+Technical hosting details
+=========================
+
+The snapshot hosting needs to provide the following files:
+
+1. the current (newest) full SquashFS snapshot of the repository,
+   and optionally M past snapshots,
+
+2. the deltas from N past snapshots to the current snapshot,
+
+3. a ``sha512sum.txt`` file containing SHA-512 checksums of all hosted
+   files, optionally OpenPGP-signed.
+
+The following naming schemes are used for the snapshots and deltas,
+respectively::
+
+    ${repo_name}-${version}.sqfs
+    ${repo_name}-${old_version}-${new_version}.sqdelta
+
+where:
+
+* ``${repo_name}`` is the repository name (as specified
+  in ``repos.conf``),
+* ``${version}`` specifies the snapshot version,
+* ``${old_version}`` specifies the snapshot version which the delta
+  updates from,
+* ``${new_version}`` specifies the snapshot version which the delta
+  updates to.
+
+Version can be an arbitrary string. It does not need to be incremental,
+however each version must be unique in the repository scope.
+For example, the version can be a date, a revision number or a commit
+hash.
+
+The ``sha512sum.txt`` uses the format used by the GNU coreutils
+``sha512sum`` program. That is, it contains one or more lines consisting
+of hexadecimal SHA-512 checksum followed by whitespace, followed by
+a filename. Lines not matching that format should be ignored.
+
+Optionally, the ``sha512sum.txt`` may be OpenPGP-signed. In that case,
+the file conforms to the ASCII-armored OpenPGP message format, with
+the checksums being stored in the message body.
+
+Additionally, the ``sha512sum.txt`` needs to contain an additional line
+containing the following string::
+
+    Current: ${repo_name}-${version}
+
+Stating the current (newest) snapshot version. If snapshots for multiple
+repositories are provided in the same directory (using the same
+``sha512sum.txt`` file), this line can occur multiple times or list
+multiple snapshots, whitespace-separated. In order not to introduce
+stray lines in the file, it is recommended to embed this information
+in the OpenPGP comment field.
+
+An example script generating daily deltas for a repository can be found
+in squashdelta-daily-gen_ repository.
+
+.. _squashdelta-daily-gen: https://bitbucket.org/mgorny/squashdelta-daily-gen
+
+
+Technical syncing details
+=========================
+
+When performing a sync, the script first fetches the ``sha512sum.txt``
+and processes it in order to determine the list of files available
+on the mirror. It should be noted that the script will never use
+a snapshot or delta that is not listed there. If the file is
+OpenPGP-signed, the signature is verified.
+
+The script scans scans the ``sha512sum.txt`` for a line containing
+the following string (case-insensitive)::
+
+    Current:
+
+The text following this string is split on spaces, and the resulting
+tokens are parsed as snapshot names. The one matching the current
+repository name is used to determine the current (newest) snapshot
+version.
+
+Afterwards, the script scans the local cache directory for the following
+symlink::
+
+    ${repo_name}-current.sqfs
+
+If the symlink exists, the file pointed by it is assumed to be
+the current (newest) local snapshot. Otherwise, the script assumes
+initial sync.
+
+On initial sync, the script fetches the newest snapshot from mirror
+and places it inside cache directory. The snapshot checksum is verified
+using ``sha512sum.txt`` and ``${repo_name}-current.sqfs`` symlink is
+created.
+
+On update, the script scans the file list for a delta transforming
+the current local snapshot to the newest remote snapshot. If such
+a delta is found, it is fetched, verified and applied to obtain
+the new snapshot. Afterwards, the resulting snapshot checksum is
+verified and the ``${repo_name}-current.sqfs`` symlink is updated.
+
+If no delta matches the version pair, it is assumed that the system is
+outdated beyond available deltas and a new snapshot is fetched instead
+(alike initial sync).
+
+.. vim:ft=rst
diff --git a/pym/portage/sync/modules/squashdelta/__init__.py b/pym/portage/sync/modules/squashdelta/__init__.py
new file mode 100644
index 0000000..680835c
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/__init__.py
@@ -0,0 +1,37 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+from portage.sync.config_checks import CheckSyncConfig
+
+
+DEFAULT_CACHE_LOCATION = '/var/cache/portage/squashfs'
+
+
+class CheckSquashDeltaConfig(CheckSyncConfig):
+	def __init__(self, repo, logger):
+		CheckSyncConfig.__init__(self, repo, logger)
+		self.checks.append('check_cache_location')
+
+	def check_cache_location(self):
+		# TODO: make it configurable when Portage is fixed to support
+		# arbitrary config variables
+		pass
+
+
+module_spec = {
+	'name': 'squashdelta',
+	'description': 'Syncing SquashFS images using SquashDeltas',
+	'provides': {
+		'squashdelta-module': {
+			'name': "squashdelta",
+			'class': "SquashDeltaSync",
+			'description': 'Syncing SquashFS images using SquashDeltas',
+			'functions': ['sync'],
+			'func_desc': {
+				'sync': 'Performs the sync of the repository',
+			},
+			'validate_config': CheckSquashDeltaConfig,
+		}
+	}
+}
diff --git a/pym/portage/sync/modules/squashdelta/squashdelta.py b/pym/portage/sync/modules/squashdelta/squashdelta.py
new file mode 100644
index 0000000..796a5f0
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/squashdelta.py
@@ -0,0 +1,231 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+import errno
+import io
+import logging
+import os
+import os.path
+import re
+
+import portage
+from portage.package.ebuild.fetch import fetch
+from portage.sync.syncbase import SyncBase
+
+from . import DEFAULT_CACHE_LOCATION
+
+
+class SquashDeltaError(Exception):
+	pass
+
+
+class SquashDeltaSync(SyncBase):
+	short_desc = "Repository syncing using SquashFS deltas"
+
+	@staticmethod
+	def name():
+		return "SquashDeltaSync"
+
+	def __init__(self):
+		super(SquashDeltaSync, self).__init__(
+				'squashmerge', 'dev-util/squashmerge')
+		self.repo_re = re.compile(self.repo.name + '-(.*)$')
+
+	def _configure(self):
+		self.my_settings = portage.config(clone = self.settings)
+		self.cache_location = DEFAULT_CACHE_LOCATION
+
+		# override fetching location
+		self.my_settings['DISTDIR'] = self.cache_location
+
+		# make sure we append paths correctly
+		self.base_uri = self.repo.sync_uri
+		if not self.base_uri.endswith('/'):
+			self.base_uri += '/'
+
+	def _fetch(self, fn, **kwargs):
+		# disable implicit mirrors support since it relies on file
+		# being in distfiles/
+		kwargs['try_mirrors'] = 0
+		if not fetch([self.base_uri + fn], self.my_settings, **kwargs):
+			raise SquashDeltaError()
+
+	def _openpgp_verify(self, data):
+		if 'webrsync-gpg' in self.my_settings.features:
+			# TODO: OpenPGP signature verification
+			# raise SquashDeltaError if it fails
+			pass
+
+	def _parse_sha512sum(self, path):
+		# sha512sum.txt parsing
+		with io.open(path, 'r', encoding='utf8') as f:
+			data = f.readlines()
+
+		if not self._openpgp_verify(data):
+			logging.error('OpenPGP verification failed for sha512sum.txt')
+			raise SquashDeltaError()
+
+		# current tag
+		current_re = re.compile('current:', re.IGNORECASE)
+		# checksum
+		checksum_re = re.compile('^([a-f0-9]{128})\s+(.*)$', re.IGNORECASE)
+
+		def iter_snapshots(lines):
+			for l in lines:
+				m = current_re.search(l)
+				if m:
+					for s in l[m.end():].split():
+						yield s
+
+		def iter_checksums(lines):
+			for l in lines:
+				m = checksum_re.match(l)
+				if m:
+					yield (m.group(2), {
+						'size': None,
+						'SHA512': m.group(1),
+					})
+
+		return (iter_snapshots(data), dict(iter_checksums(data)))
+
+	def _find_newest_snapshot(self, snapshots):
+		# look for current indicator
+		for s in snapshots:
+			m = self.repo_re.match(s)
+			if m:
+				new_snapshot = m.group(0) + '.sqfs'
+				new_version = m.group(1)
+				break
+		else:
+			logging.error('Unable to find current snapshot in sha512sum.txt')
+			raise SquashDeltaError()
+
+		new_path = os.path.join(self.cache_location, new_snapshot)
+		return (new_snapshot, new_version, new_path)
+
+	def _find_local_snapshot(self, current_path):
+		# try to find a local snapshot
+		try:
+			old_snapshot = os.readlink(current_path)
+		except OSError:
+			return ('', '', '')
+		else:
+			m = self.repo_re.match(old_snapshot)
+			if m and old_snapshot.endswith('.sqfs'):
+				old_version = m.group(1)[:-5]
+				old_path = os.path.join(self.cache_location, old_snapshot)
+
+		return (old_snapshot, old_version, old_path)
+
+	def _try_delta(self, old_version, new_version, old_path, new_path, my_digests):
+		# attempt to update
+		delta_path = None
+		expected_delta = '%s-%s-%s.sqdelta' % (
+				self.repo.name, old_version, new_version)
+		if expected_delta not in my_digests:
+			logging.warning('No delta for %s->%s, fetching new snapshot.'
+					% (old_version, new_version))
+		else:
+			delta_path = os.path.join(self.cache_location, expected_delta)
+
+			if not self._fetch(expected_delta, digests = my_digests):
+				raise SquashDeltaError()
+			if not self.has_bin:
+				raise SquashDeltaError()
+
+			ret = portage.process.spawn([self.bin_command,
+					old_path, delta_path, new_path], **self.spawn_kwargs)
+			if ret != os.EX_OK:
+				logging.error('Merging the delta failed')
+				raise SquashDeltaError()
+		return delta_path
+
+	def _update_symlink(self, new_snapshot, current_path):
+		# using external ln for two reasons:
+		# 1. clean --force (unlike python's unlink+symlink)
+		# 2. easy userpriv (otherwise we'd have to lchown())
+		ret = portage.process.spawn(['ln', '-s', '-f', new_snapshot, current_path],
+				**self.spawn_kwargs)
+		if ret != os.EX_OK:
+			logging.error('Unable to set -current symlink')
+			raise SquashDeltaError()
+
+	def _cleanup(self, path):
+		try:
+			os.unlink(path)
+		except OSError as e:
+			logging.warning('Unable to clean up ' + path + ': ' + str(e))
+
+	def _update_mount(self, current_path):
+		mount_cmd = ['mount', current_path, self.repo.location]
+		can_mount = True
+		if os.path.ismount(self.repo.location):
+			# need to umount old snapshot
+			ret = portage.process.spawn(['umount', '-l', self.repo.location])
+			if ret != os.EX_OK:
+				logging.warning('Unable to unmount old SquashFS after update')
+				can_mount = False
+		else:
+			try:
+				os.makedirs(self.repo.location)
+			except OSError as e:
+				if e.errno != errno.EEXIST:
+					raise
+
+		if can_mount:
+			ret = portage.process.spawn(mount_cmd)
+			if ret != os.EX_OK:
+				logging.warning('Unable to (re-)mount SquashFS after update')
+				
+	def sync(self, **kwargs):
+		self._kwargs(kwargs)
+
+		try:
+			self._configure()
+
+			# fetch sha512sum.txt
+			sha512_path = os.path.join(self.cache_location, 'sha512sum.txt')
+			try:
+				os.unlink(sha512_path)
+			except OSError as e:
+				if e.errno != errno.ENOENT:
+					logging.error('Unable to unlink sha512sum.txt')
+					return (1, False)
+			self._fetch('sha512sum.txt')
+
+			snapshots, my_digests = self._parse_sha512sum(sha512_path)
+
+			current_path = os.path.join(self.cache_location,
+					self.repo.name + '-current.sqfs')
+			new_snapshot, new_version, new_path = (
+					self._find_newest_snapshot(snapshots))
+			old_snapshot, old_version, old_path = (
+					self._find_local_snapshot(current_path))
+
+			if old_version:
+				if old_version == new_version:
+					logging.info('Snapshot up-to-date, verifying integrity.')
+				else:
+					delta_path = self._try_delta(old_version, new_version,
+							old_path, new_path, my_digests)
+					# pass-through to verification and cleanup
+
+			# fetch full snapshot or verify the one we have
+			self._fetch(new_snapshot, digests = my_digests)
+
+			# create/update -current symlink
+			self._update_symlink(new_snapshot, current_path)
+
+			# remove old snapshot
+			if old_version is not None and old_version != new_version:
+				self._cleanup(old_path)
+				if delta_path is not None:
+					self._cleanup(delta_path)
+			self._cleanup(sha512_path)
+
+			self._update_mount(current_path)
+
+			return (0, True)
+		except SquashDeltaError:
+			return (1, False)
-- 
2.3.5



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [gentoo-portage-dev] [PATCH v3] Contribute squashdelta syncing module
  2015-04-16 17:38 ` Brian Dolbec
  2015-04-18 17:45   ` Michał Górny
  2015-04-18 18:29   ` [gentoo-portage-dev] [PATCH v2] " Michał Górny
@ 2015-04-18 18:46   ` Michał Górny
  2 siblings, 0 replies; 5+ messages in thread
From: Michał Górny @ 2015-04-18 18:46 UTC (permalink / raw
  To: gentoo-portage-dev; +Cc: Michał Górny

The squashdelta module provides syncing via SquashFS snapshots. For the
initial sync, a complete snapshot is fetched and placed in
/var/cache/portage/squashfs. On subsequent sync operations, deltas are
fetched from the mirror and used to reconstruct the newest snapshot.

The distfile fetching logic is reused to fetch the remote files
and verify their checksums. Additionally, the sha512sum.txt file should
be OpenPGP-verified after fetching but this is currently unimplemented.

After fetching, Portage tries to (re-)mount the SquashFS in repository
location.
---
 cnf/repos.conf                                     |   4 +
 pym/portage/sync/modules/squashdelta/README        | 124 +++++++++++
 pym/portage/sync/modules/squashdelta/__init__.py   |  37 ++++
 .../sync/modules/squashdelta/squashdelta.py        | 231 +++++++++++++++++++++
 4 files changed, 396 insertions(+)
 create mode 100644 pym/portage/sync/modules/squashdelta/README
 create mode 100644 pym/portage/sync/modules/squashdelta/__init__.py
 create mode 100644 pym/portage/sync/modules/squashdelta/squashdelta.py

diff --git a/cnf/repos.conf b/cnf/repos.conf
index 1ca98ca..062fc0d 100644
--- a/cnf/repos.conf
+++ b/cnf/repos.conf
@@ -6,3 +6,7 @@ location = /usr/portage
 sync-type = rsync
 sync-uri = rsync://rsync.gentoo.org/gentoo-portage
 auto-sync = yes
+
+# for daily squashfs snapshots
+#sync-type = squashdelta
+#sync-uri = mirror://gentoo/../snapshots/squashfs
diff --git a/pym/portage/sync/modules/squashdelta/README b/pym/portage/sync/modules/squashdelta/README
new file mode 100644
index 0000000..994ae6d
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/README
@@ -0,0 +1,124 @@
+==================
+ squashdelta-sync
+==================
+
+Introduction
+============
+
+Squashdelta-sync provides the squashfs syncing module for Portage.
+When used as sync-type for the repository, it fetches the complete
+repository snapshot on initial sync, and then uses squashdeltas to
+efficiently update it.
+
+While initially intended for the daily snapshot of the Gentoo
+repository, the module is designed with flexibility in mind. It can be
+used to sync any repository, without enforcing any specific snapshotting
+interval or versioning rules. However, each snapshot version identifier
+must be unique in the scope of repository.
+
+
+Technical hosting details
+=========================
+
+The snapshot hosting needs to provide the following files:
+
+1. the current (newest) full SquashFS snapshot of the repository,
+   and optionally M past snapshots,
+
+2. the deltas from N past snapshots to the current snapshot,
+
+3. a ``sha512sum.txt`` file containing SHA-512 checksums of all hosted
+   files, optionally OpenPGP-signed.
+
+The following naming schemes are used for the snapshots and deltas,
+respectively::
+
+    ${repo_name}-${version}.sqfs
+    ${repo_name}-${old_version}-${new_version}.sqdelta
+
+where:
+
+* ``${repo_name}`` is the repository name (as specified
+  in ``repos.conf``),
+* ``${version}`` specifies the snapshot version,
+* ``${old_version}`` specifies the snapshot version which the delta
+  updates from,
+* ``${new_version}`` specifies the snapshot version which the delta
+  updates to.
+
+Version can be an arbitrary string. It does not need to be incremental,
+however each version must be unique in the repository scope.
+For example, the version can be a date, a revision number or a commit
+hash.
+
+The ``sha512sum.txt`` uses the format used by the GNU coreutils
+``sha512sum`` program. That is, it contains one or more lines consisting
+of hexadecimal SHA-512 checksum followed by whitespace, followed by
+a filename. Lines not matching that format should be ignored.
+
+Optionally, the ``sha512sum.txt`` may be OpenPGP-signed. In that case,
+the file conforms to the ASCII-armored OpenPGP message format, with
+the checksums being stored in the message body.
+
+Additionally, the ``sha512sum.txt`` needs to contain an additional line
+containing the following string::
+
+    Current: ${repo_name}-${version}
+
+Stating the current (newest) snapshot version. If snapshots for multiple
+repositories are provided in the same directory (using the same
+``sha512sum.txt`` file), this line can occur multiple times or list
+multiple snapshots, whitespace-separated. In order not to introduce
+stray lines in the file, it is recommended to embed this information
+in the OpenPGP comment field.
+
+An example script generating daily deltas for a repository can be found
+in squashdelta-daily-gen_ repository.
+
+.. _squashdelta-daily-gen: https://bitbucket.org/mgorny/squashdelta-daily-gen
+
+
+Technical syncing details
+=========================
+
+When performing a sync, the script first fetches the ``sha512sum.txt``
+and processes it in order to determine the list of files available
+on the mirror. It should be noted that the script will never use
+a snapshot or delta that is not listed there. If the file is
+OpenPGP-signed, the signature is verified.
+
+The script scans scans the ``sha512sum.txt`` for a line containing
+the following string (case-insensitive)::
+
+    Current:
+
+The text following this string is split on spaces, and the resulting
+tokens are parsed as snapshot names. The one matching the current
+repository name is used to determine the current (newest) snapshot
+version.
+
+Afterwards, the script scans the local cache directory for the following
+symlink::
+
+    ${repo_name}-current.sqfs
+
+If the symlink exists, the file pointed by it is assumed to be
+the current (newest) local snapshot. Otherwise, the script assumes
+initial sync.
+
+On initial sync, the script fetches the newest snapshot from mirror
+and places it inside cache directory. The snapshot checksum is verified
+using ``sha512sum.txt`` and ``${repo_name}-current.sqfs`` symlink is
+created.
+
+On update, the script scans the file list for a delta transforming
+the current local snapshot to the newest remote snapshot. If such
+a delta is found, it is fetched, verified and applied to obtain
+the new snapshot. Afterwards, the resulting snapshot checksum is
+verified and the ``${repo_name}-current.sqfs`` symlink is updated.
+
+If no delta matches the version pair, it is assumed that the system is
+outdated beyond available deltas and a new snapshot is fetched instead
+(alike initial sync).
+
+.. vim:ft=rst
diff --git a/pym/portage/sync/modules/squashdelta/__init__.py b/pym/portage/sync/modules/squashdelta/__init__.py
new file mode 100644
index 0000000..680835c
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/__init__.py
@@ -0,0 +1,37 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+from portage.sync.config_checks import CheckSyncConfig
+
+
+DEFAULT_CACHE_LOCATION = '/var/cache/portage/squashfs'
+
+
+class CheckSquashDeltaConfig(CheckSyncConfig):
+	def __init__(self, repo, logger):
+		CheckSyncConfig.__init__(self, repo, logger)
+		self.checks.append('check_cache_location')
+
+	def check_cache_location(self):
+		# TODO: make it configurable when Portage is fixed to support
+		# arbitrary config variables
+		pass
+
+
+module_spec = {
+	'name': 'squashdelta',
+	'description': 'Syncing SquashFS images using SquashDeltas',
+	'provides': {
+		'squashdelta-module': {
+			'name': "squashdelta",
+			'class': "SquashDeltaSync",
+			'description': 'Syncing SquashFS images using SquashDeltas',
+			'functions': ['sync'],
+			'func_desc': {
+				'sync': 'Performs the sync of the repository',
+			},
+			'validate_config': CheckSquashDeltaConfig,
+		}
+	}
+}
diff --git a/pym/portage/sync/modules/squashdelta/squashdelta.py b/pym/portage/sync/modules/squashdelta/squashdelta.py
new file mode 100644
index 0000000..b4911af
--- /dev/null
+++ b/pym/portage/sync/modules/squashdelta/squashdelta.py
@@ -0,0 +1,231 @@
+#	vim:fileencoding=utf-8:noet
+# (c) 2015 Michał Górny <mgorny@gentoo.org>
+# Distributed under the terms of the GNU General Public License v2
+
+import errno
+import io
+import logging
+import os
+import os.path
+import re
+
+import portage
+from portage.package.ebuild.fetch import fetch
+from portage.sync.syncbase import SyncBase
+
+from . import DEFAULT_CACHE_LOCATION
+
+
+class SquashDeltaError(Exception):
+	pass
+
+
+class SquashDeltaSync(SyncBase):
+	short_desc = "Repository syncing using SquashFS deltas"
+
+	@staticmethod
+	def name():
+		return "SquashDeltaSync"
+
+	def __init__(self):
+		super(SquashDeltaSync, self).__init__(
+				'squashmerge', 'dev-util/squashmerge')
+		self.repo_re = re.compile(self.repo.name + '-(.*)$')
+
+	def _configure(self):
+		self.my_settings = portage.config(clone = self.settings)
+		self.cache_location = DEFAULT_CACHE_LOCATION
+
+		# override fetching location
+		self.my_settings['DISTDIR'] = self.cache_location
+
+		# make sure we append paths correctly
+		self.base_uri = self.repo.sync_uri
+		if not self.base_uri.endswith('/'):
+			self.base_uri += '/'
+
+	def _fetch(self, fn, **kwargs):
+		# disable implicit mirrors support since it relies on file
+		# being in distfiles/
+		kwargs['try_mirrors'] = 0
+		if not fetch([self.base_uri + fn], self.my_settings, **kwargs):
+			raise SquashDeltaError()
+
+	def _openpgp_verify(self, data):
+		if 'webrsync-gpg' in self.my_settings.features:
+			# TODO: OpenPGP signature verification
+			# raise SquashDeltaError if it fails
+			pass
+
+	def _parse_sha512sum(self, path):
+		# sha512sum.txt parsing
+		with io.open(path, 'r', encoding='utf8') as f:
+			data = f.readlines()
+
+		if not self._openpgp_verify(data):
+			logging.error('OpenPGP verification failed for sha512sum.txt')
+			raise SquashDeltaError()
+
+		# current tag
+		current_re = re.compile('current:', re.IGNORECASE)
+		# checksum
+		checksum_re = re.compile('^([a-f0-9]{128})\s+(.*)$', re.IGNORECASE)
+
+		def iter_snapshots(lines):
+			for l in lines:
+				m = current_re.search(l)
+				if m:
+					for s in l[m.end():].split():
+						yield s
+
+		def iter_checksums(lines):
+			for l in lines:
+				m = checksum_re.match(l)
+				if m:
+					yield (m.group(2), {
+						'size': None,
+						'SHA512': m.group(1),
+					})
+
+		return (iter_snapshots(data), dict(iter_checksums(data)))
+
+	def _find_newest_snapshot(self, snapshots):
+		# look for current indicator
+		for s in snapshots:
+			m = self.repo_re.match(s)
+			if m:
+				new_snapshot = m.group(0) + '.sqfs'
+				new_version = m.group(1)
+				break
+		else:
+			logging.error('Unable to find current snapshot in sha512sum.txt')
+			raise SquashDeltaError()
+
+		new_path = os.path.join(self.cache_location, new_snapshot)
+		return (new_snapshot, new_version, new_path)
+
+	def _find_local_snapshot(self, current_path):
+		# try to find a local snapshot
+		try:
+			old_snapshot = os.readlink(current_path)
+		except OSError:
+			return ('', '', '')
+		else:
+			m = self.repo_re.match(old_snapshot)
+			if m and old_snapshot.endswith('.sqfs'):
+				old_version = m.group(1)[:-5]
+				old_path = os.path.join(self.cache_location, old_snapshot)
+
+		return (old_snapshot, old_version, old_path)
+
+	def _try_delta(self, old_version, new_version, old_path, new_path, my_digests):
+		# attempt to update
+		delta_path = None
+		expected_delta = '%s-%s-%s.sqdelta' % (
+				self.repo.name, old_version, new_version)
+		if expected_delta not in my_digests:
+			logging.warning('No delta for %s->%s, fetching new snapshot.'
+					% (old_version, new_version))
+		else:
+			delta_path = os.path.join(self.cache_location, expected_delta)
+
+			if not self._fetch(expected_delta, digests = my_digests):
+				raise SquashDeltaError()
+			if not self.has_bin:
+				raise SquashDeltaError()
+
+			ret = portage.process.spawn([self.bin_command,
+					old_path, delta_path, new_path], **self.spawn_kwargs)
+			if ret != os.EX_OK:
+				logging.error('Merging the delta failed')
+				raise SquashDeltaError()
+		return delta_path
+
+	def _update_symlink(self, new_snapshot, current_path):
+		# using external ln for two reasons:
+		# 1. clean --force (unlike python's unlink+symlink)
+		# 2. easy userpriv (otherwise we'd have to lchown())
+		ret = portage.process.spawn(['ln', '-s', '-f', new_snapshot, current_path],
+				**self.spawn_kwargs)
+		if ret != os.EX_OK:
+			logging.error('Unable to set -current symlink')
+			raise SquashDeltaError()
+
+	def _cleanup(self, path):
+		try:
+			os.unlink(path)
+		except OSError as e:
+			logging.warning('Unable to clean up ' + path + ': ' + str(e))
+
+	def _update_mount(self, current_path):
+		mount_cmd = ['mount', current_path, self.repo.location]
+		can_mount = True
+		if os.path.ismount(self.repo.location):
+			# need to umount old snapshot
+			ret = portage.process.spawn(['umount', '-l', self.repo.location])
+			if ret != os.EX_OK:
+				logging.error('Unable to unmount old SquashFS after update')
+				raise SquashDeltaError()
+		else:
+			try:
+				os.makedirs(self.repo.location)
+			except OSError as e:
+				if e.errno != errno.EEXIST:
+					raise
+
+		ret = portage.process.spawn(mount_cmd)
+		if ret != os.EX_OK:
+			logging.error('Unable to (re-)mount SquashFS after update')
+			raise SquashDeltaError()
+				
+	def sync(self, **kwargs):
+		self._kwargs(kwargs)
+
+		try:
+			self._configure()
+
+			# fetch sha512sum.txt
+			sha512_path = os.path.join(self.cache_location, 'sha512sum.txt')
+			try:
+				os.unlink(sha512_path)
+			except OSError as e:
+				if e.errno != errno.ENOENT:
+					logging.error('Unable to unlink sha512sum.txt')
+					return (1, False)
+			self._fetch('sha512sum.txt')
+
+			snapshots, my_digests = self._parse_sha512sum(sha512_path)
+
+			current_path = os.path.join(self.cache_location,
+					self.repo.name + '-current.sqfs')
+			new_snapshot, new_version, new_path = (
+					self._find_newest_snapshot(snapshots))
+			old_snapshot, old_version, old_path = (
+					self._find_local_snapshot(current_path))
+
+			if old_version:
+				if old_version == new_version:
+					logging.info('Snapshot up-to-date, verifying integrity.')
+				else:
+					delta_path = self._try_delta(old_version, new_version,
+							old_path, new_path, my_digests)
+					# pass-through to verification and cleanup
+
+			# fetch full snapshot or verify the one we have
+			self._fetch(new_snapshot, digests = my_digests)
+
+			# create/update -current symlink
+			self._update_symlink(new_snapshot, current_path)
+
+			# remove old snapshot
+			if old_version is not None and old_version != new_version:
+				self._cleanup(old_path)
+				if delta_path is not None:
+					self._cleanup(delta_path)
+			self._cleanup(sha512_path)
+
+			self._update_mount(current_path)
+
+			return (0, True)
+		except SquashDeltaError:
+			return (1, False)
-- 
2.3.5



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-04-18 18:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-05 10:08 [gentoo-portage-dev] [PATCH] Contribute squashdelta syncing module Michał Górny
2015-04-16 17:38 ` Brian Dolbec
2015-04-18 17:45   ` Michał Górny
2015-04-18 18:29   ` [gentoo-portage-dev] [PATCH v2] " Michał Górny
2015-04-18 18:46   ` [gentoo-portage-dev] [PATCH v3] " Michał Górny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox