public inbox for gentoo-portage-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: Zac Medico <zmedico@gentoo.org>
To: gentoo-portage-dev@lists.gentoo.org
Cc: Zac Medico <zmedico@gentoo.org>
Subject: [gentoo-portage-dev] [PATCH 2/5 v4] Add IndexStreamIterator and MultiIterGroupBy.
Date: Sun,  2 Nov 2014 19:07:39 -0800	[thread overview]
Message-ID: <1414984059-23038-1-git-send-email-zmedico@gentoo.org> (raw)
In-Reply-To: <1414881983-19877-3-git-send-email-zmedico@gentoo.org>

This IndexStreamIterator class can be used together with the
pkg_desc_index_line_read function to read an index file incrementally
as a stream.

The MultiIterGroupBy class can be used to iterate over multiple
IndexStreamIterator instances at once, incrementally grouping results
for a particular package from multiple indices, while limiting the
amount of any given index that must be in memory at once.

Both of these classes are used by the  IndexedPortdb class in the next
patch of this series.

X-Gentoo-Bug: 525718
X-Gentoo-Bug-URL: https://bugs.gentoo.org/show_bug.cgi?id=525718
---
This updated patch uses python's bisect module to optimize the search for
completed groups.

 pym/portage/cache/index/IndexStreamIterator.py | 27 +++++++
 pym/portage/util/iterators/MultiIterGroupBy.py | 97 ++++++++++++++++++++++++++
 pym/portage/util/iterators/__init__.py         |  2 +
 3 files changed, 126 insertions(+)
 create mode 100644 pym/portage/cache/index/IndexStreamIterator.py
 create mode 100644 pym/portage/util/iterators/MultiIterGroupBy.py
 create mode 100644 pym/portage/util/iterators/__init__.py

diff --git a/pym/portage/cache/index/IndexStreamIterator.py b/pym/portage/cache/index/IndexStreamIterator.py
new file mode 100644
index 0000000..972aee1
--- /dev/null
+++ b/pym/portage/cache/index/IndexStreamIterator.py
@@ -0,0 +1,27 @@
+# Copyright 2014 Gentoo Foundation
+# Distributed under the terms of the GNU General Public License v2
+
+class IndexStreamIterator(object):
+
+	def __init__(self, f, parser):
+
+		self.parser = parser
+		self._file = f
+
+	def close(self):
+
+		if self._file is not None:
+			self._file.close()
+			self._file = None
+
+	def __iter__(self):
+
+		try:
+
+			for line in self._file:
+				node = self.parser(line)
+				if node is not None:
+					yield node
+
+		finally:
+			self.close()
diff --git a/pym/portage/util/iterators/MultiIterGroupBy.py b/pym/portage/util/iterators/MultiIterGroupBy.py
new file mode 100644
index 0000000..f5f8278
--- /dev/null
+++ b/pym/portage/util/iterators/MultiIterGroupBy.py
@@ -0,0 +1,97 @@
+# Copyright 2014 Gentoo Foundation
+# Distributed under the terms of the GNU General Public License v2
+
+import bisect
+
+class MultiIterGroupBy(object):
+	"""
+	This class functions similarly to the itertools.groupby function,
+	except that it takes multiple source iterators as input. The source
+	iterators must yield objects in sorted order. A group is yielded as
+	soon as the progress of all iterators reaches a state which
+	guarantees that there can not be any remaining (unseen) elements of
+	the group. This is useful for incremental display of grouped search
+	results.
+	"""
+
+	def __init__(self, iterators, key = None):
+		self._iterators = iterators
+		self._key = key
+
+	def __iter__(self):
+
+		trackers = []
+		for iterator in self._iterators:
+			trackers.append(_IteratorTracker(iterator))
+
+		key_map = {}
+		key_list = []
+		eof = []
+		key_getter = self._key
+		if key_getter is None:
+			key_getter = lambda x: x
+		min_progress = None
+
+		while trackers:
+
+			for tracker in trackers:
+
+				if tracker.current is not None and \
+					tracker.current != min_progress:
+					# The trackers are sorted by progress, so the
+					# remaining trackers are guaranteed to have
+					# sufficient progress.
+					break
+
+				# In order to avoid over-buffering (waste of memory),
+				# only grab a single entry.
+				try:
+					entry = next(tracker.iterator)
+				except StopIteration:
+					eof.append(tracker)
+				else:
+					tracker.current = key_getter(entry)
+					key_group = key_map.get(tracker.current)
+					if key_group is None:
+						key_group = []
+						key_map[tracker.current] = key_group
+						bisect.insort(key_list, tracker.current)
+					key_group.append(entry)
+
+			if eof:
+				for tracker in eof:
+					trackers.remove(tracker)
+				del eof[:]
+
+			if trackers:
+				trackers.sort()
+				min_progress = trackers[0].current
+				# yield if key <= min_progress
+				i = bisect.bisect_right(key_list, min_progress)
+				yield_these = key_list[:i]
+				del key_list[:i]
+			else:
+				yield_these = key_list
+				key_list = []
+
+			if yield_these:
+				for k in yield_these:
+					yield key_map.pop(k)
+
+class _IteratorTracker(object):
+
+	__slots__ = ('current', 'iterator')
+
+	def __init__(self, iterator):
+
+		self.iterator = iterator
+		self.current = None
+
+	def __lt__(self, other):
+		if self.current is None:
+			if other.current is None:
+				return False
+			else:
+				return True
+		return other.current is not None and \
+			self.current < other.current
diff --git a/pym/portage/util/iterators/__init__.py b/pym/portage/util/iterators/__init__.py
new file mode 100644
index 0000000..7cd880e
--- /dev/null
+++ b/pym/portage/util/iterators/__init__.py
@@ -0,0 +1,2 @@
+# Copyright 2014 Gentoo Foundation
+# Distributed under the terms of the GNU General Public License v2
-- 
2.0.4



  parent reply	other threads:[~2014-11-03  3:07 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-18  3:28 [gentoo-portage-dev] [PATCH] emerge --search: use description index Zac Medico
2014-10-18  5:59 ` [gentoo-portage-dev] " Zac Medico
2014-10-19 21:51   ` Zac Medico
2014-10-23  8:55     ` Brian Dolbec
2014-10-23  9:22       ` Zac Medico
2014-11-01  6:15         ` Zac Medico
2014-11-01 22:46 ` [gentoo-portage-dev] Zac Medico
2014-11-01 22:46   ` [gentoo-portage-dev] [PATCH 1/5] Add egencache --update-pkg-desc-index action Zac Medico
2014-11-04  9:03     ` [gentoo-portage-dev] [PATCH 1/5 v2] " Zac Medico
2014-11-01 22:46   ` [gentoo-portage-dev] [PATCH 2/5] Add IndexStreamIterator and MultiIterGroupBy Zac Medico
2014-11-02  0:18     ` Zac Medico
2014-11-02 22:50     ` [gentoo-portage-dev] [PATCH 2/5 v3] " Zac Medico
2014-11-03  3:07     ` Zac Medico [this message]
2014-11-01 22:46   ` [gentoo-portage-dev] [PATCH 3/5] Add IndexedPortdb class Zac Medico
2014-11-04  5:07     ` [gentoo-portage-dev] [PATCH 3/5 v2] " Zac Medico
2014-11-04 20:34       ` [gentoo-portage-dev] [PATCH 3/5 v3] " Zac Medico
2014-11-01 22:46   ` [gentoo-portage-dev] [PATCH 4/5] Add IndexedVardb class Zac Medico
2014-11-05  9:59     ` [gentoo-portage-dev] " Zac Medico
2014-11-07  8:45       ` [gentoo-portage-dev] [PATCH] Log changes between vdb_metadata.pickle updates Zac Medico
2014-11-07 16:51         ` Brian Dolbec
2014-11-07 20:17           ` Zac Medico
2014-11-08  9:16         ` [gentoo-portage-dev] [PATCH v2] " Zac Medico
2014-11-01 22:46   ` [gentoo-portage-dev] [PATCH 5/5] Add emerge --search-index option Zac Medico
2014-11-01 23:04     ` Zac Medico
2014-11-04  5:42       ` [gentoo-portage-dev] [PATCH 5/5 v3] " Zac Medico
2014-11-04  9:10         ` [gentoo-portage-dev] " Zac Medico
2014-11-04 22:09     ` [gentoo-portage-dev] [PATCH 5/5 v4] " Zac Medico
2014-11-03 21:42   ` [gentoo-portage-dev] Brian Dolbec
2014-11-04  9:19     ` [gentoo-portage-dev] Zac Medico

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1414984059-23038-1-git-send-email-zmedico@gentoo.org \
    --to=zmedico@gentoo.org \
    --cc=gentoo-portage-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox