public inbox for gentoo-portage-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566)
@ 2016-04-04  5:03 Zac Medico
  2016-04-04  8:39 ` Alexander Berntsen
  0 siblings, 1 reply; 5+ messages in thread
From: Zac Medico @ 2016-04-04  5:03 UTC (permalink / raw
  To: gentoo-portage-dev; +Cc: Zac Medico

Add --search-fuzzy option, with adjustable similarity ratio cutoff that
defaults to 0.8 (80% similarity).

X-Gentoo-bug: 65566
X-Gentoo-bug-url: https://bugs.gentoo.org/show_bug.cgi?id=65566
---
 man/emerge.1           | 13 ++++++++++++-
 pym/_emerge/actions.py |  6 ++++--
 pym/_emerge/main.py    | 32 +++++++++++++++++++++++++++++++-
 pym/_emerge/search.py  | 25 +++++++++++++++++++++++--
 4 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/man/emerge.1 b/man/emerge.1
index bfa2f73..2727ccb 100644
--- a/man/emerge.1
+++ b/man/emerge.1
@@ -1,4 +1,4 @@
-.TH "EMERGE" "1" "Feb 2016" "Portage VERSION" "Portage"
+.TH "EMERGE" "1" "Apr 2016" "Portage VERSION" "Portage"
 .SH "NAME"
 emerge \- Command\-line interface to the Portage system
 .SH "SYNOPSIS"
@@ -854,6 +854,17 @@ If ebuilds using EAPIs which \fIdo not\fR support \fBHDEPEND\fR are built in
 the same \fBemerge\fR run as those using EAPIs which \fIdo\fR support
 \fBHDEPEND\fR, this option affects only the former.
 .TP
+.BR "\-\-search\-fuzzy [ y | n ]"
+Enable or disable fuzzy search for search actions.
+.TP
+.BR "\-\-search\-fuzzy\-cutoff CUTOFF"
+Set similarity ratio cutoff (a floating-point number between 0 and 1).
+Results with similarity ratios lower than the cutoff are discarded.
+This option has no effect unless the \fB\-\-search\-fuzzy\fR option
+is enabled.
+.br
+Defaults to 0.8 (80% similarity).
+.TP
 .BR "\-\-search\-index < y | n >"
 Enable or disable indexed search for search actions. This option is
 enabled by default. The search index needs to be regenerated by
diff --git a/pym/_emerge/actions.py b/pym/_emerge/actions.py
index 59626ad..caae79a 100644
--- a/pym/_emerge/actions.py
+++ b/pym/_emerge/actions.py
@@ -1,4 +1,4 @@
-# Copyright 1999-2015 Gentoo Foundation
+# Copyright 1999-2016 Gentoo Foundation
 # Distributed under the terms of the GNU General Public License v2
 
 from __future__ import division, print_function, unicode_literals
@@ -1955,7 +1955,9 @@ def action_search(root_config, myopts, myfiles, spinner):
 			spinner, "--searchdesc" in myopts,
 			"--quiet" not in myopts, "--usepkg" in myopts,
 			"--usepkgonly" in myopts,
-			search_index = myopts.get("--search-index", "y") != "n")
+			search_index=myopts.get("--search-index", "y") != "n",
+			fuzzy=myopts.get("--search-fuzzy", False),
+			fuzzy_cutoff=myopts.get("--search-fuzzy-cutoff"))
 		for mysearch in myfiles:
 			try:
 				searchinstance.execute(mysearch)
diff --git a/pym/_emerge/main.py b/pym/_emerge/main.py
index 5dbafee..06c385e 100644
--- a/pym/_emerge/main.py
+++ b/pym/_emerge/main.py
@@ -1,4 +1,4 @@
-# Copyright 1999-2015 Gentoo Foundation
+# Copyright 1999-2016 Gentoo Foundation
 # Distributed under the terms of the GNU General Public License v2
 
 from __future__ import print_function
@@ -156,6 +156,7 @@ def insert_optional_args(args):
 		'--rebuild-if-unbuilt'   : y_or_n,
 		'--rebuilt-binaries'     : y_or_n,
 		'--root-deps'  : ('rdeps',),
+		'--search-fuzzy'         : y_or_n,
 		'--select'               : y_or_n,
 		'--selective'            : y_or_n,
 		"--use-ebuild-visibility": y_or_n,
@@ -647,6 +648,16 @@ def parse_opts(tmpcmdline, silent=False):
 			"choices" :("True", "rdeps")
 		},
 
+		"--search-fuzzy": {
+			"help": "Enable or disable fuzzy search",
+			"choices": true_y_or_n
+		},
+
+		"--search-fuzzy-cutoff": {
+			"help": "Set similarity ratio cutoff (a floating-point number between 0 and 1)",
+			"action": "store"
+		},
+
 		"--search-index": {
 			"help": "Enable or disable indexed search (enabled by default)",
 			"choices": y_or_n
@@ -908,6 +919,11 @@ def parse_opts(tmpcmdline, silent=False):
 	if myoptions.root_deps in true_y:
 		myoptions.root_deps = True
 
+	if myoptions.search_fuzzy in true_y:
+		myoptions.search_fuzzy = True
+	else:
+		myoptions.search_fuzzy = None
+
 	if myoptions.select in true_y:
 		myoptions.select = True
 		myoptions.oneshot = False
@@ -1000,6 +1016,20 @@ def parse_opts(tmpcmdline, silent=False):
 
 		myoptions.rebuilt_binaries_timestamp = rebuilt_binaries_timestamp
 
+	if myoptions.search_fuzzy_cutoff:
+		try:
+			fuzzy_cutoff = float(myoptions.search_fuzzy_cutoff)
+		except ValueError:
+			fuzzy_cutoff = 0.0
+
+		if fuzzy_cutoff <= 0.0:
+			fuzzy_cutoff = None
+			if not silent:
+				parser.error("Invalid --search-fuzzy-cutoff parameter: '%s'\n" % \
+					(myoptions.search_fuzzy_cutoff,))
+
+		myoptions.search_fuzzy_cutoff = fuzzy_cutoff
+
 	if myoptions.use_ebuild_visibility in true_y:
 		myoptions.use_ebuild_visibility = True
 	else:
diff --git a/pym/_emerge/search.py b/pym/_emerge/search.py
index 32d326e..3210854 100644
--- a/pym/_emerge/search.py
+++ b/pym/_emerge/search.py
@@ -1,8 +1,9 @@
-# Copyright 1999-2015 Gentoo Foundation
+# Copyright 1999-2016 Gentoo Foundation
 # Distributed under the terms of the GNU General Public License v2
 
 from __future__ import unicode_literals
 
+import difflib
 import re
 import portage
 from portage import os
@@ -28,7 +29,8 @@ class search(object):
 	# public interface
 	#
 	def __init__(self, root_config, spinner, searchdesc,
-		verbose, usepkg, usepkgonly, search_index=True):
+		verbose, usepkg, usepkgonly, search_index=True,
+		fuzzy=False, fuzzy_cutoff=None):
 		"""Searches the available and installed packages for the supplied search key.
 		The list of available and installed packages is created at object instantiation.
 		This makes successive searches faster."""
@@ -42,6 +44,8 @@ class search(object):
 		self.spinner = None
 		self.root_config = root_config
 		self.setconfig = root_config.setconfig
+		self.fuzzy = fuzzy
+		self.fuzzy_cutoff = 0.8 if fuzzy_cutoff is None else fuzzy_cutoff
 		self.matches = {"pkg" : []}
 		self.mlen = 0
 
@@ -248,11 +252,26 @@ class search(object):
 		if self.searchkey.startswith('@'):
 			match_category = 1
 			self.searchkey = self.searchkey[1:]
+		fuzzy = False
 		if regexsearch:
 			self.searchre=re.compile(self.searchkey,re.I)
 		else:
 			self.searchre=re.compile(re.escape(self.searchkey), re.I)
 
+			# Fuzzy search does not support regular expressions, therefore
+			# it is disabled for regular expression searches.
+			if self.fuzzy:
+				fuzzy = True
+				cutoff = self.fuzzy_cutoff
+				seq_match = difflib.SequenceMatcher()
+				seq_match.set_seq2(self.searchkey.lower())
+
+				def fuzzy_search(match_string):
+					seq_match.set_seq1(match_string.lower())
+					return (seq_match.real_quick_ratio() >= cutoff and
+						seq_match.quick_ratio() >= cutoff and
+						seq_match.ratio() >= cutoff)
+
 		for package in self._cp_all():
 			self._spinner_update()
 
@@ -280,6 +299,8 @@ class search(object):
 					continue
 
 				yield ("desc", package)
+			elif fuzzy and fuzzy_search(match_string):
+				yield ("pkg", package)
 
 		self.sdict = self.setconfig.getSets()
 		for setname in self.sdict:
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566)
  2016-04-04  5:03 [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566) Zac Medico
@ 2016-04-04  8:39 ` Alexander Berntsen
  2016-04-08  6:21   ` Zac Medico
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander Berntsen @ 2016-04-04  8:39 UTC (permalink / raw
  To: gentoo-portage-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

This is a great idea!


On 04/04/16 07:03, Zac Medico wrote:
> +.BR "\-\-search\-fuzzy [ y | n ]"
> +Enable or disable fuzzy search for search actions.
This is likely a good place to briefly explain what a "fuzzy search"
is.

Also, I'm not sold on "seach-fuzzy" as opposed to "fuzzy-search". Is
there a particular reasoning for it? Since we don't seem to have a
standardised "verbs mean this, nouns mean this" anyway, I would use
the latter phrase.

You also need to document your note on regexes.

Lastly, you also need to document that a fuzzy search is slower than a
regular search.

> +.TP
> +.BR "\-\-search\-fuzzy\-cutoff CUTOFF"
> +Set similarity ratio cutoff (a floating-point number between 0 and 1).
> +Results with similarity ratios lower than the cutoff are discarded.
> +This option has no effect unless the \fB\-\-search\-fuzzy\fR option
> +is enabled.
This explanation is a bit heavy to read. And I think that using 0 to 1
isn't very nice. And calling the number "floating point" instead of
decimal isn't very useful nor nice. How about making it a percentage,
and describing it simply as a similarity percentage -- "package names
must be at least N% similar to the search term to appear in search
results". The option could then be called --seach-fuzzy-similarity,
or (in keeping with the previous suggestion)
- --fuzzy-search-similarity, or -- wait for it -- something similar. ;)
Of course if you agree with this, you'll have to reverse the code to
represent which results to show, rather than which ones to not show.

You should also document here what happens if there's a mistake in the
input.

> +		"--search-fuzzy-cutoff": {
> +			"help": "Set similarity ratio cutoff (a floating-point number between 0 and 1)",
> +			"action": "store"
> +		},
See comments above regarding how to explain what this actually does.

> +	if myoptions.search_fuzzy_cutoff:
> +		try:
> +			fuzzy_cutoff = float(myoptions.search_fuzzy_cutoff)
> +		except ValueError:
> +			fuzzy_cutoff = 0.0
Is this a reasonable fallback? I guess so... but you need to mention
it in the manpage, as mentioned.

> +
> +		if fuzzy_cutoff <= 0.0:
> +			fuzzy_cutoff = None
> +			if not silent:
> +				parser.error("Invalid --search-fuzzy-cutoff parameter: '%s'\n" % \
> +					(myoptions.search_fuzzy_cutoff,))
> +
> +		myoptions.search_fuzzy_cutoff = fuzzy_cutoff
> +
I also don't understand why the first one is just 0.0, but this one
is an error. Why aren't both either errors and revert to 0.8 cut-off
(or 80% similarity) or 0.0/100?

And this needs to go in the manpage too.

> +		self.fuzzy_cutoff = 0.8 if fuzzy_cutoff is None else fuzzy_cutoff
See above.

> +		fuzzy = False
Here's an interesting discussion: maybe this should be True? After
all, it's True in any modern search engine. What do you think?

> +			# Fuzzy search does not support regular expressions, therefore
> +			# it is disabled for regular expression searches.
Manpage.
- -- 
Alexander
bernalex@gentoo.org
https://secure.plaimi.net/~alexander
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCgAGBQJXAig0AAoJENQqWdRUGk8BOOEQAIEYXkn86ibMiYhN5BBDlsL1
2a6zBOCzygTkpxiBg+8vPsWJcHmzyTO7M6H1x3bUCY/JEfWq0354WdvNMtDM5qZk
zpwIg0uPs/Q4Fo40hozHsc66f+jqZxgmy5rML2mO8cAFZANZdNtuvTkVQYF5zQXz
4CI06tVDwXmYAmg7wIBEpWJ8O+is2F1abzPJcr42tLz5ELYm1IRn4Em8WO5m5klm
mrYWWeesvNS1l2y8kbKCmtpQbSuzLYfFyVfFkSL/p6t16Tiu7edqGJ0HOrq5B5dx
+cwuT+vwbTtA8d/Qo/cifbyuxnNtO8JthhEvemAdCYkDC4DQHDStsKFjA+Za1Sos
r/eSQexXNOQ/oMgksm72aX9rIkfurtn73AhIthKEnzrzou3pVW+H5eHR25vF58EO
qHUJO9/Z8ZkHec3HopxFtYng16i26VlW2pDehdkWGVoZSXomaOyH7x7XQXZoE7B+
4e4vDOMbeIvxyA/j1+H35WBZCu6f9FstOrEptD5FIE6/QM4oAW+CBllUQf5iQVEB
4Rpodu2AvKWgqTTOMLcn9+HK8JgnbMlm6cYLT+YXP7j6OnJFB6yq5/L3dfS5rrEX
sxwrvVTTx2dCbX/RImQoMpEIQFaTfimZgKQDw3rmtv+JfP3OnpdOrN+QJJfHbCgb
4c9suzs/UTBLbtiFQhdO
=XsDv
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566)
  2016-04-04  8:39 ` Alexander Berntsen
@ 2016-04-08  6:21   ` Zac Medico
  2016-04-08 11:33     ` Alexander Berntsen
  0 siblings, 1 reply; 5+ messages in thread
From: Zac Medico @ 2016-04-08  6:21 UTC (permalink / raw
  To: gentoo-portage-dev

On 04/04/2016 01:39 AM, Alexander Berntsen wrote:
> This is a great idea!

Yeah, we should have done this sooner. The search index makes our search
function so much nicer, so that gave me some incentive to continue
improving it.

> 
> 
> On 04/04/16 07:03, Zac Medico wrote:
>> +.BR "\-\-search\-fuzzy [ y | n ]"
>> +Enable or disable fuzzy search for search actions.
> This is likely a good place to briefly explain what a "fuzzy search"
> is.

Okay, will do.

> Also, I'm not sold on "seach-fuzzy" as opposed to "fuzzy-search". Is
> there a particular reasoning for it? Since we don't seem to have a
> standardised "verbs mean this, nouns mean this" anyway, I would use
> the latter phrase.

Okay, that will work for me.

> You also need to document your note on regexes.

Will do.

> Lastly, you also need to document that a fuzzy search is slower than a
> regular search.

Will do.

>> +.TP
>> +.BR "\-\-search\-fuzzy\-cutoff CUTOFF"
>> +Set similarity ratio cutoff (a floating-point number between 0 and 1).
>> +Results with similarity ratios lower than the cutoff are discarded.
>> +This option has no effect unless the \fB\-\-search\-fuzzy\fR option
>> +is enabled.
> This explanation is a bit heavy to read. And I think that using 0 to 1
> isn't very nice. And calling the number "floating point" instead of
> decimal isn't very useful nor nice. How about making it a percentage,
> and describing it simply as a similarity percentage -- "package names
> must be at least N% similar to the search term to appear in search
> results". The option could then be called --seach-fuzzy-similarity,
> or (in keeping with the previous suggestion)
> --fuzzy-search-similarity, or -- wait for it -- something similar. ;)

Okay, that will work for me.

> Of course if you agree with this, you'll have to reverse the code to
> represent which results to show, rather than which ones to not show.

Reverse? You want it to measure dissimilarity? Not sure what you mean.

> You should also document here what happens if there's a mistake in the
> input.
> 
>> +		"--search-fuzzy-cutoff": {
>> +			"help": "Set similarity ratio cutoff (a floating-point number between 0 and 1)",
>> +			"action": "store"
>> +		},
> See comments above regarding how to explain what this actually does.

Yeah, the N% similar thing.

>> +	if myoptions.search_fuzzy_cutoff:
>> +		try:
>> +			fuzzy_cutoff = float(myoptions.search_fuzzy_cutoff)
>> +		except ValueError:
>> +			fuzzy_cutoff = 0.0
> Is this a reasonable fallback? I guess so... but you need to mention
> it in the manpage, as mentioned.

It's not supposed to be a fallback, but rather a failure path. It
triggers an error message and unsuccessful exit.

>> +
>> +		if fuzzy_cutoff <= 0.0:
>> +			fuzzy_cutoff = None
>> +			if not silent:
>> +				parser.error("Invalid --search-fuzzy-cutoff parameter: '%s'\n" % \
>> +					(myoptions.search_fuzzy_cutoff,))
>> +
>> +		myoptions.search_fuzzy_cutoff = fuzzy_cutoff
>> +
> I also don't understand why the first one is just 0.0, but this one
> is an error. Why aren't both either errors and revert to 0.8 cut-off
> (or 80% similarity) or 0.0/100?

I just want it to fail if the input is invalid.

> And this needs to go in the manpage too.
> 
>> +		self.fuzzy_cutoff = 0.8 if fuzzy_cutoff is None else fuzzy_cutoff
> See above.
> 
>> +		fuzzy = False
> Here's an interesting discussion: maybe this should be True? After
> all, it's True in any modern search engine. What do you think?

Yeah, I agree.

>> +			# Fuzzy search does not support regular expressions, therefore
>> +			# it is disabled for regular expression searches.
> Manpage.

Will do.
-- 
Thanks,
Zac


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566)
  2016-04-08  6:21   ` Zac Medico
@ 2016-04-08 11:33     ` Alexander Berntsen
  2016-07-25  2:58       ` Zac Medico
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander Berntsen @ 2016-04-08 11:33 UTC (permalink / raw
  To: gentoo-portage-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 08/04/16 08:21, Zac Medico wrote:
> Reverse? You want it to measure dissimilarity? Not sure what you
> mean.
Sorry, I meant reverse the *docs* to mean "find things that are at
least 50% similar" rather than "cut off things that aren't above the
0.5 threshold". I.e. use an inclusive sentence. I feel that this is
more clear.

> I just want it to fail if the input is invalid.
Yes, I just realised you checked if it were <=, not just <. I think
this is a bad idea. It's easily missed -- I just missed it last time
around. I would suggest to make it fail early, rather than set it to
0.0 which you then set to None. Just set it to None immediately.
- -- 
Alexander
bernalex@gentoo.org
https://secure.plaimi.net/~alexander
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCgAGBQJXB5bsAAoJENQqWdRUGk8BSO8QAKXJlxYpetrA8kKbvW/cknoO
wrLPkHsmn7imnfQddcXVHtE+K1GJCRcG1Eu2VD+Pqza3HLRPl20eTm8+iu+maftk
K41+SCNEb8qs9v05P2wcqPxtUlzI4OO9GwJFJkbycRxgCFtCYnM/0B2kXaSKOkHr
KY4cF9CbdzfwIYL2FkmaCJrCBI9ac1sjsnug9yN+wXIYVV6nzpPLPq8QJU9P6sef
XI7na2mMHpK75FHl5fW/yVJfCXuBHmGgryfyEm+uUtvpLWpGceRBbRl4naJljbsf
AVNSocBmPdWGL6PCdfcD5MID8iriIBfTYWLsAoBN1HcKKasSKr1BG+UxT3wGov7n
STbQ7MLVQpDluS3kCgjbVNWUlouOcVhcNdOniC3GEDxzpT9ev7Tk/FilMNNu167N
l28SaGUokLQnf/EuQfQmNJJyHpFIVsxeRs5ODQZDlvb10WHDFMtYCkXZDhrLJmm6
Ej+tFJiuMWfAIejzVkJ0gvZTvg5FzVknvEey9iNokzXnOsngIjaR4gS8KjUUH0k8
EF1348cJ3KwQxbkWifsEuVosDmiSFaF38j73IoYaHOQh06bVPm2gL/zGeGntDGQY
X+RXL5XTJefiKps1jG4e96jYEUWIIlA/fodxkERKXcEmOsvT29v5gEAwcq6YGWXG
McLGDbOpF/n9tuxihQQK
=GgNP
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566)
  2016-04-08 11:33     ` Alexander Berntsen
@ 2016-07-25  2:58       ` Zac Medico
  0 siblings, 0 replies; 5+ messages in thread
From: Zac Medico @ 2016-07-25  2:58 UTC (permalink / raw
  To: gentoo-portage-dev

On 04/08/2016 04:33 AM, Alexander Berntsen wrote:
> On 08/04/16 08:21, Zac Medico wrote:
>> Reverse? You want it to measure dissimilarity? Not sure what you
>> mean.
> Sorry, I meant reverse the *docs* to mean "find things that are at
> least 50% similar" rather than "cut off things that aren't above the
> 0.5 threshold". I.e. use an inclusive sentence. I feel that this is
> more clear.
> 
>> I just want it to fail if the input is invalid.
> Yes, I just realised you checked if it were <=, not just <. I think
> this is a bad idea. It's easily missed -- I just missed it last time
> around. I would suggest to make it fail early, rather than set it to
> 0.0 which you then set to None. Just set it to None immediately.
> 

I've just sent "[PATCH] emerge: add --fuzzy-search and
--search-similarity (bug 65566)" which hopefully accounts for all of the
previous feedback.
-- 
Thanks,
Zac


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-07-25  2:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-04  5:03 [gentoo-portage-dev] [PATCH] emerge: add --search-fuzzy and --search-fuzzy-cutoff options (bug 65566) Zac Medico
2016-04-04  8:39 ` Alexander Berntsen
2016-04-08  6:21   ` Zac Medico
2016-04-08 11:33     ` Alexander Berntsen
2016-07-25  2:58       ` Zac Medico

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox