From: Mark Kubacki <wmark@hurrikane.de>
To: gentoo-portage-dev@lists.gentoo.org
Subject: [gentoo-portage-dev] [PATCH] portage: HTTP if-modified-since and compression
Date: Thu, 2 Aug 2012 01:32:57 +0200 [thread overview]
Message-ID: <CAHw5cr+A9zSRizzC+tYVv+gyZcd6Mpp6-ioOC6_RWGK8=AS4xA@mail.gmail.com> (raw)
In-Reply-To: <CAHw5crK-ek0EZashNgskiDPMM_bDycEX==KUXPmjWaTNyaOqCw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]
Hi Portage devs,
The attached patches fix some issues I've noticed as maintainer and
user of Gentoo binhost(s). They're made against master/HEAD and can
easily be backported to 2.1*.
The first patch enables Portage to skip downloading a remote index if
the local copy is recent enough. E.g., the remote index didn't change
between to "emerge"-runs. This is done by setting "If-Modified-Since"
request-header. The server responds with HTTP code 304 and Portage
doesn't even load a single byte of the (large) index file.
By the second patch Portage will download remote indices—which are
text-files after all—compressed, if the remote server supports that.
Although de-compression introduces a small latency, this will save
bandwidth and transmission time. If the index needs to be fetched at
all, that is (see the patch above).
An index' TIMESTAMP entry is set before the corresponding file gets
written. If the difference between TIMESTAMP and modification time
("mtime") is greater than or the times span one second, remote index
files will be loaded despite the "If-Modified-Since" header. This is
because TIMESTAMP of the local copy is compared with the remote index'
"mtime". The third patch fixes that by setting "mtime" = TIMESTAMP.
--
Mark
[-- Attachment #2: 0001-Use-If-Modified-Since-HTTP-header-and-avoid-download.patch --]
[-- Type: application/octet-stream, Size: 4868 bytes --]
From a52924ce950fd2efbf99959e4dd0452b5feb92da Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 19:49:34 +0200
Subject: [PATCH 1/3] Use If-Modified-Since HTTP-header and avoid downloading
a remote index if the local copy is recent enough.
---
pym/portage/dbapi/bintree.py | 24 +++++++++++++++++++++---
pym/portage/util/_urlopen.py | 33 ++++++++++++++++++++++++++++++---
2 files changed, 51 insertions(+), 6 deletions(-)
diff --git a/pym/portage/dbapi/bintree.py b/pym/portage/dbapi/bintree.py
index 9527b07..16ae8ec 100644
--- a/pym/portage/dbapi/bintree.py
+++ b/pym/portage/dbapi/bintree.py
@@ -54,6 +54,11 @@ if sys.hexversion >= 0x3000000:
else:
_unicode = unicode
+class UseCachedCopyOfRemoteIndex(Exception):
+ # If the local copy is recent enough
+ # then fetching the remote index can be skipped.
+ pass
+
class bindbapi(fakedbapi):
_known_keys = frozenset(list(fakedbapi._known_keys) + \
["CHOST", "repository", "USE"])
@@ -852,6 +857,7 @@ class binarytree(object):
if e.errno != errno.ENOENT:
raise
local_timestamp = pkgindex.header.get("TIMESTAMP", None)
+ remote_timestamp = None
rmt_idx = self._new_pkgindex()
proc = None
tmp_filename = None
@@ -861,8 +867,13 @@ class binarytree(object):
# slash, so join manually...
url = base_url.rstrip("/") + "/Packages"
try:
- f = _urlopen(url)
- except IOError:
+ f = _urlopen(url, if_modified_since=local_timestamp)
+ if hasattr(f, 'headers') and f.headers.get('timestamp', ''):
+ remote_timestamp = f.headers.get('timestamp')
+ except IOError as err:
+ if hasattr(err, 'code') and err.code == 304: # not modified (since local_timestamp)
+ raise UseCachedCopyOfRemoteIndex()
+
path = parsed_url.path.rstrip("/") + "/Packages"
if parsed_url.scheme == 'sftp':
@@ -903,7 +914,8 @@ class binarytree(object):
_encodings['repo.content'], errors='replace')
try:
rmt_idx.readHeader(f_dec)
- remote_timestamp = rmt_idx.header.get("TIMESTAMP", None)
+ if not remote_timestamp: # in case it had not been read from HTTP header
+ remote_timestamp = rmt_idx.header.get("TIMESTAMP", None)
if not remote_timestamp:
# no timestamp in the header, something's wrong
pkgindex = None
@@ -931,6 +943,12 @@ class binarytree(object):
writemsg("\n\n!!! %s\n" % \
_("Timed out while closing connection to binhost"),
noiselevel=-1)
+ except UseCachedCopyOfRemoteIndex:
+ writemsg_stdout("\n")
+ writemsg_stdout(
+ colorize("GOOD", _("Local copy of remote index is up-to-date and will be used.")) + \
+ "\n")
+ rmt_idx = pkgindex
except EnvironmentError as e:
writemsg(_("\n\n!!! Error fetching binhost package" \
" info from '%s'\n") % _hide_url_passwd(base_url))
diff --git a/pym/portage/util/_urlopen.py b/pym/portage/util/_urlopen.py
index 307624b..a5db411 100644
--- a/pym/portage/util/_urlopen.py
+++ b/pym/portage/util/_urlopen.py
@@ -2,6 +2,9 @@
# Distributed under the terms of the GNU General Public License v2
import sys
+from datetime import datetime
+from time import mktime
+from email.utils import formatdate, parsedate
try:
from urllib.request import urlopen as _urlopen
@@ -14,12 +17,26 @@ except ImportError:
import urllib2 as urllib_request
from urllib import splituser as urllib_parse_splituser
-def urlopen(url):
+# to account for the difference between TIMESTAMP of the index' contents
+# and the file-'mtime'
+TIMESTAMP_TOLERANCE=5
+
+def urlopen(url, if_modified_since=None):
try:
- return _urlopen(url)
+ request = urllib_request.Request(url)
+ request.add_header('User-Agent', 'Gentoo Portage')
+ if if_modified_since:
+ request.add_header('If-Modified-Since', _timestamp_to_http(if_modified_since))
+ opener = urllib_request.build_opener()
+ hdl = opener.open(request)
+ if hdl.headers.get('last-modified', ''):
+ hdl.headers.addheader('timestamp', _http_to_timestamp(hdl.headers.get('last-modified')))
+ return hdl
except SystemExit:
raise
- except Exception:
+ except Exception as e:
+ if hasattr(e, 'code') and e.code == 304: # HTTPError 304: not modified
+ raise
if sys.hexversion < 0x3000000:
raise
parse_result = urllib_parse.urlparse(url)
@@ -40,3 +57,13 @@ def _new_urlopen(url):
auth_handler = urllib_request.HTTPBasicAuthHandler(password_manager)
opener = urllib_request.build_opener(auth_handler)
return opener.open(url)
+
+def _timestamp_to_http(timestamp):
+ dt = datetime.fromtimestamp(float(long(timestamp)+TIMESTAMP_TOLERANCE))
+ stamp = mktime(dt.timetuple())
+ return formatdate(timeval=stamp, localtime=False, usegmt=True)
+
+def _http_to_timestamp(http_datetime_string):
+ tuple = parsedate(http_datetime_string)
+ timestamp = mktime(tuple)
+ return str(long(timestamp))
--
1.7.8.6
[-- Attachment #3: 0002-Add-support-for-HTTP-compression-bzip2-gzip-and-defl.patch --]
[-- Type: application/octet-stream, Size: 2588 bytes --]
From 88a289b07642cb200b83b98f03d508dcbfd2ce64 Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 20:36:31 +0200
Subject: [PATCH 2/3] Add support for HTTP compression (bzip2, gzip and
deflate).
---
pym/portage/util/_urlopen.py | 32 +++++++++++++++++++++++++++++++-
1 files changed, 31 insertions(+), 1 deletions(-)
diff --git a/pym/portage/util/_urlopen.py b/pym/portage/util/_urlopen.py
index a5db411..70535c5 100644
--- a/pym/portage/util/_urlopen.py
+++ b/pym/portage/util/_urlopen.py
@@ -5,6 +5,7 @@ import sys
from datetime import datetime
from time import mktime
from email.utils import formatdate, parsedate
+from StringIO import StringIO
try:
from urllib.request import urlopen as _urlopen
@@ -27,7 +28,7 @@ def urlopen(url, if_modified_since=None):
request.add_header('User-Agent', 'Gentoo Portage')
if if_modified_since:
request.add_header('If-Modified-Since', _timestamp_to_http(if_modified_since))
- opener = urllib_request.build_opener()
+ opener = urllib_request.build_opener(CompressedResponseProcessor)
hdl = opener.open(request)
if hdl.headers.get('last-modified', ''):
hdl.headers.addheader('timestamp', _http_to_timestamp(hdl.headers.get('last-modified')))
@@ -67,3 +68,32 @@ def _http_to_timestamp(http_datetime_string):
tuple = parsedate(http_datetime_string)
timestamp = mktime(tuple)
return str(long(timestamp))
+
+class CompressedResponseProcessor(urllib_request.BaseHandler):
+ # Handler for compressed responses.
+
+ def http_request(self, req):
+ req.add_header('Accept-Encoding', 'bzip2,gzip,deflate')
+ return req
+ https_request = http_request
+
+ def http_response(self, req, response):
+ decompressed = None
+ if response.headers.get('content-encoding') == 'bzip2':
+ import bz2
+ decompressed = StringIO.StringIO(bz2.decompress(response.read()))
+ elif response.headers.get('content-encoding') == 'gzip':
+ from gzip import GzipFile
+ decompressed = GzipFile(fileobj=StringIO(response.read()), mode='r')
+ elif response.headers.get('content-encoding') == 'deflate':
+ import zlib
+ try:
+ decompressed = StringIO.StringIO(zlib.decompress(response.read()))
+ except zlib.error: # they ignored RFC1950
+ decompressed = StringIO.StringIO(zlib.decompress(response.read(), -zlib.MAX_WBITS))
+ if decompressed:
+ old_response = response
+ response = urllib_request.addinfourl(decompressed, old_response.headers, old_response.url, old_response.code)
+ response.msg = old_response.msg
+ return response
+ https_response = http_response
--
1.7.8.6
[-- Attachment #4: 0003-Fix-index-file-s-mtime-which-can-differ-from-TIMESTA.patch --]
[-- Type: application/octet-stream, Size: 1545 bytes --]
From 2b7ba96c8c6e81541cfba095c113638ac9a847f4 Mon Sep 17 00:00:00 2001
From: W-Mark Kubacki <wmark@hurrikane.de>
Date: Wed, 1 Aug 2012 21:12:24 +0200
Subject: [PATCH 3/3] Fix index file's mtime, which can differ from TIMESTAMP.
This enables Portage to reliably query for remote indices with
HTTP-header If-Modified-Since.
Without this patch mtime is greater than TIMESTAMP for large
indices and slow storages - because writing a large file takes
time. If the difference spans a second (TIMESTAMP 08:00:00, mtime
08:00:01), then Portage will always fetch the remote index because
it will appear being modified (mtime is used there) after the copy
has been made (local copy's TIMESTAMP is used here).
---
pym/portage/dbapi/bintree.py | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/pym/portage/dbapi/bintree.py b/pym/portage/dbapi/bintree.py
index 16ae8ec..0367503 100644
--- a/pym/portage/dbapi/bintree.py
+++ b/pym/portage/dbapi/bintree.py
@@ -1186,9 +1186,13 @@ class binarytree(object):
pkgindex.packages.append(d)
self._update_pkgindex_header(pkgindex.header)
- f = atomic_ofstream(os.path.join(self.pkgdir, "Packages"))
+ pkgindex_filename = os.path.join(self.pkgdir, "Packages")
+ f = atomic_ofstream(pkgindex_filename)
pkgindex.write(f)
f.close()
+ # some seconds might have elapsed since TIMESTAMP
+ atime = mtime = long(pkgindex.header["TIMESTAMP"])
+ os.utime(pkgindex_filename, (atime, mtime))
finally:
if pkgindex_lock:
unlockfile(pkgindex_lock)
--
1.7.8.6
next parent reply other threads:[~2012-08-02 0:09 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAHw5crK-ek0EZashNgskiDPMM_bDycEX==KUXPmjWaTNyaOqCw@mail.gmail.com>
2012-08-01 23:32 ` Mark Kubacki [this message]
2012-08-02 1:02 ` [gentoo-portage-dev] [PATCH] portage: HTTP if-modified-since and compression Zac Medico
2012-08-02 2:31 ` Zac Medico
2012-08-02 19:57 ` Mark Kubacki
2012-08-02 21:13 ` Zac Medico
2012-08-03 1:29 ` Brian Dolbec
2012-08-03 9:33 ` W-Mark Kubacki
2012-08-03 14:33 ` Brian Dolbec
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHw5cr+A9zSRizzC+tYVv+gyZcd6Mpp6-ioOC6_RWGK8=AS4xA@mail.gmail.com' \
--to=wmark@hurrikane.de \
--cc=gentoo-portage-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox