public inbox for gentoo-portage-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: Mark Kubacki <wmark@hurrikane.de>
To: gentoo-portage-dev@lists.gentoo.org
Subject: [gentoo-portage-dev] [PATCH] portage: HTTP if-modified-since and compression
Date: Thu, 2 Aug 2012 01:32:57 +0200	[thread overview]
Message-ID: <CAHw5cr+A9zSRizzC+tYVv+gyZcd6Mpp6-ioOC6_RWGK8=AS4xA@mail.gmail.com> (raw)
In-Reply-To: <CAHw5crK-ek0EZashNgskiDPMM_bDycEX==KUXPmjWaTNyaOqCw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

Hi Portage devs,

The attached patches fix some issues I've noticed as maintainer and
user of Gentoo binhost(s). They're made against master/HEAD and can
easily be backported to 2.1*.

The first patch enables Portage to skip downloading a remote index if
the local copy is recent enough. E.g., the remote index didn't change
between to "emerge"-runs. This is done by setting "If-Modified-Since"
request-header. The server responds with HTTP code 304 and Portage
doesn't even load a single byte of the (large) index file.

By the second patch Portage will download remote indices—which are
text-files after all—compressed, if the remote server supports that.
Although de-compression introduces a small latency, this will save
bandwidth and transmission time. If the index needs to be fetched at
all, that is (see the patch above).

An index' TIMESTAMP entry is set before the corresponding file gets
written. If the difference between TIMESTAMP and modification time
("mtime") is greater than or the times span one second, remote index
files will be loaded despite the "If-Modified-Since" header. This is
because TIMESTAMP of the local copy is compared with the remote index'
"mtime". The third patch fixes that by setting "mtime" = TIMESTAMP.

-- 
Mark

[-- Attachment #2: 0001-Use-If-Modified-Since-HTTP-header-and-avoid-download.patch --]
[-- Type: application/octet-stream, Size: 4868 bytes --]

From a52924ce950fd2efbf99959e4dd0452b5feb92da Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 19:49:34 +0200
Subject: [PATCH 1/3] Use If-Modified-Since HTTP-header and avoid downloading
 a remote index if the local copy is recent enough.

---
 pym/portage/dbapi/bintree.py |   24 +++++++++++++++++++++---
 pym/portage/util/_urlopen.py |   33 ++++++++++++++++++++++++++++++---
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/pym/portage/dbapi/bintree.py b/pym/portage/dbapi/bintree.py
index 9527b07..16ae8ec 100644
--- a/pym/portage/dbapi/bintree.py
+++ b/pym/portage/dbapi/bintree.py
@@ -54,6 +54,11 @@ if sys.hexversion >= 0x3000000:
 else:
 	_unicode = unicode
 
+class UseCachedCopyOfRemoteIndex(Exception):
+	# If the local copy is recent enough
+	# then fetching the remote index can be skipped.
+	pass
+
 class bindbapi(fakedbapi):
 	_known_keys = frozenset(list(fakedbapi._known_keys) + \
 		["CHOST", "repository", "USE"])
@@ -852,6 +857,7 @@ class binarytree(object):
 				if e.errno != errno.ENOENT:
 					raise
 			local_timestamp = pkgindex.header.get("TIMESTAMP", None)
+			remote_timestamp = None
 			rmt_idx = self._new_pkgindex()
 			proc = None
 			tmp_filename = None
@@ -861,8 +867,13 @@ class binarytree(object):
 				# slash, so join manually...
 				url = base_url.rstrip("/") + "/Packages"
 				try:
-					f = _urlopen(url)
-				except IOError:
+					f = _urlopen(url, if_modified_since=local_timestamp)
+					if hasattr(f, 'headers') and f.headers.get('timestamp', ''):
+						remote_timestamp = f.headers.get('timestamp')
+				except IOError as err:
+					if hasattr(err, 'code') and err.code == 304: # not modified (since local_timestamp)
+						raise UseCachedCopyOfRemoteIndex()
+
 					path = parsed_url.path.rstrip("/") + "/Packages"
 
 					if parsed_url.scheme == 'sftp':
@@ -903,7 +914,8 @@ class binarytree(object):
 					_encodings['repo.content'], errors='replace')
 				try:
 					rmt_idx.readHeader(f_dec)
-					remote_timestamp = rmt_idx.header.get("TIMESTAMP", None)
+					if not remote_timestamp: # in case it had not been read from HTTP header
+						remote_timestamp = rmt_idx.header.get("TIMESTAMP", None)
 					if not remote_timestamp:
 						# no timestamp in the header, something's wrong
 						pkgindex = None
@@ -931,6 +943,12 @@ class binarytree(object):
 						writemsg("\n\n!!! %s\n" % \
 							_("Timed out while closing connection to binhost"),
 							noiselevel=-1)
+			except UseCachedCopyOfRemoteIndex:
+				writemsg_stdout("\n")
+				writemsg_stdout(
+					colorize("GOOD", _("Local copy of remote index is up-to-date and will be used.")) + \
+					"\n")
+				rmt_idx = pkgindex
 			except EnvironmentError as e:
 				writemsg(_("\n\n!!! Error fetching binhost package" \
 					" info from '%s'\n") % _hide_url_passwd(base_url))
diff --git a/pym/portage/util/_urlopen.py b/pym/portage/util/_urlopen.py
index 307624b..a5db411 100644
--- a/pym/portage/util/_urlopen.py
+++ b/pym/portage/util/_urlopen.py
@@ -2,6 +2,9 @@
 # Distributed under the terms of the GNU General Public License v2
 
 import sys
+from datetime import datetime
+from time import mktime
+from email.utils import formatdate, parsedate
 
 try:
 	from urllib.request import urlopen as _urlopen
@@ -14,12 +17,26 @@ except ImportError:
 	import urllib2 as urllib_request
 	from urllib import splituser as urllib_parse_splituser
 
-def urlopen(url):
+# to account for the difference between TIMESTAMP of the index' contents
+#  and the file-'mtime'
+TIMESTAMP_TOLERANCE=5
+
+def urlopen(url, if_modified_since=None):
 	try:
-		return _urlopen(url)
+		request = urllib_request.Request(url)
+		request.add_header('User-Agent', 'Gentoo Portage')
+		if if_modified_since:
+			request.add_header('If-Modified-Since', _timestamp_to_http(if_modified_since))
+		opener = urllib_request.build_opener()
+		hdl = opener.open(request)
+		if hdl.headers.get('last-modified', ''):
+			hdl.headers.addheader('timestamp', _http_to_timestamp(hdl.headers.get('last-modified')))
+		return hdl
 	except SystemExit:
 		raise
-	except Exception:
+	except Exception as e:
+		if hasattr(e, 'code') and e.code == 304: # HTTPError 304: not modified
+			raise
 		if sys.hexversion < 0x3000000:
 			raise
 		parse_result = urllib_parse.urlparse(url)
@@ -40,3 +57,13 @@ def _new_urlopen(url):
 	auth_handler = urllib_request.HTTPBasicAuthHandler(password_manager)
 	opener = urllib_request.build_opener(auth_handler)
 	return opener.open(url)
+
+def _timestamp_to_http(timestamp):
+	dt = datetime.fromtimestamp(float(long(timestamp)+TIMESTAMP_TOLERANCE))
+	stamp = mktime(dt.timetuple())
+	return formatdate(timeval=stamp, localtime=False, usegmt=True)
+
+def _http_to_timestamp(http_datetime_string):
+	tuple = parsedate(http_datetime_string)
+	timestamp = mktime(tuple)
+	return str(long(timestamp))
-- 
1.7.8.6


[-- Attachment #3: 0002-Add-support-for-HTTP-compression-bzip2-gzip-and-defl.patch --]
[-- Type: application/octet-stream, Size: 2588 bytes --]

From 88a289b07642cb200b83b98f03d508dcbfd2ce64 Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 20:36:31 +0200
Subject: [PATCH 2/3] Add support for HTTP compression (bzip2, gzip and
 deflate).

---
 pym/portage/util/_urlopen.py |   32 +++++++++++++++++++++++++++++++-
 1 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/pym/portage/util/_urlopen.py b/pym/portage/util/_urlopen.py
index a5db411..70535c5 100644
--- a/pym/portage/util/_urlopen.py
+++ b/pym/portage/util/_urlopen.py
@@ -5,6 +5,7 @@ import sys
 from datetime import datetime
 from time import mktime
 from email.utils import formatdate, parsedate
+from StringIO import StringIO
 
 try:
 	from urllib.request import urlopen as _urlopen
@@ -27,7 +28,7 @@ def urlopen(url, if_modified_since=None):
 		request.add_header('User-Agent', 'Gentoo Portage')
 		if if_modified_since:
 			request.add_header('If-Modified-Since', _timestamp_to_http(if_modified_since))
-		opener = urllib_request.build_opener()
+		opener = urllib_request.build_opener(CompressedResponseProcessor)
 		hdl = opener.open(request)
 		if hdl.headers.get('last-modified', ''):
 			hdl.headers.addheader('timestamp', _http_to_timestamp(hdl.headers.get('last-modified')))
@@ -67,3 +68,32 @@ def _http_to_timestamp(http_datetime_string):
 	tuple = parsedate(http_datetime_string)
 	timestamp = mktime(tuple)
 	return str(long(timestamp))
+
+class CompressedResponseProcessor(urllib_request.BaseHandler):
+	# Handler for compressed responses.
+
+	def http_request(self, req):
+		req.add_header('Accept-Encoding', 'bzip2,gzip,deflate')
+		return req
+	https_request = http_request
+
+	def http_response(self, req, response):
+		decompressed = None
+		if response.headers.get('content-encoding') == 'bzip2':
+			import bz2
+			decompressed = StringIO.StringIO(bz2.decompress(response.read()))
+		elif response.headers.get('content-encoding') == 'gzip':
+			from gzip import GzipFile
+			decompressed = GzipFile(fileobj=StringIO(response.read()), mode='r')
+		elif response.headers.get('content-encoding') == 'deflate':
+			import zlib
+			try:
+				decompressed = StringIO.StringIO(zlib.decompress(response.read()))
+			except zlib.error: # they ignored RFC1950
+				decompressed = StringIO.StringIO(zlib.decompress(response.read(), -zlib.MAX_WBITS))
+		if decompressed:
+			old_response = response
+			response = urllib_request.addinfourl(decompressed, old_response.headers, old_response.url, old_response.code)
+			response.msg = old_response.msg
+		return response
+	https_response = http_response
-- 
1.7.8.6


[-- Attachment #4: 0003-Fix-index-file-s-mtime-which-can-differ-from-TIMESTA.patch --]
[-- Type: application/octet-stream, Size: 1545 bytes --]

From 2b7ba96c8c6e81541cfba095c113638ac9a847f4 Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 21:12:24 +0200
Subject: [PATCH 3/3] Fix index file's mtime, which can differ from TIMESTAMP.

This enables Portage to reliably query for remote indices with
HTTP-header If-Modified-Since.

Without this patch mtime is greater than TIMESTAMP for large
indices and slow storages - because writing a large file takes
time. If the difference spans a second (TIMESTAMP 08:00:00, mtime
08:00:01), then Portage will always fetch the remote index because
it will appear being modified (mtime is used there) after the copy
has been made (local copy's TIMESTAMP is used here).
---
 pym/portage/dbapi/bintree.py |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/pym/portage/dbapi/bintree.py b/pym/portage/dbapi/bintree.py
index 16ae8ec..0367503 100644
--- a/pym/portage/dbapi/bintree.py
+++ b/pym/portage/dbapi/bintree.py
@@ -1186,9 +1186,13 @@ class binarytree(object):
 			pkgindex.packages.append(d)
 
 			self._update_pkgindex_header(pkgindex.header)
-			f = atomic_ofstream(os.path.join(self.pkgdir, "Packages"))
+			pkgindex_filename = os.path.join(self.pkgdir, "Packages")
+			f = atomic_ofstream(pkgindex_filename)
 			pkgindex.write(f)
 			f.close()
+			# some seconds might have elapsed since TIMESTAMP
+			atime = mtime = long(pkgindex.header["TIMESTAMP"])
+			os.utime(pkgindex_filename, (atime, mtime))
 		finally:
 			if pkgindex_lock:
 				unlockfile(pkgindex_lock)
-- 
1.7.8.6


       reply	other threads:[~2012-08-02  0:09 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAHw5crK-ek0EZashNgskiDPMM_bDycEX==KUXPmjWaTNyaOqCw@mail.gmail.com>
2012-08-01 23:32 ` Mark Kubacki [this message]
2012-08-02  1:02   ` [gentoo-portage-dev] [PATCH] portage: HTTP if-modified-since and compression Zac Medico
2012-08-02  2:31     ` Zac Medico
2012-08-02 19:57       ` Mark Kubacki
2012-08-02 21:13         ` Zac Medico
2012-08-03  1:29         ` Brian Dolbec
2012-08-03  9:33           ` W-Mark Kubacki
2012-08-03 14:33             ` Brian Dolbec

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHw5cr+A9zSRizzC+tYVv+gyZcd6Mpp6-ioOC6_RWGK8=AS4xA@mail.gmail.com' \
    --to=wmark@hurrikane.de \
    --cc=gentoo-portage-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox