From: "Göktürk Yüksek" <gokturk@gentoo.org>
To: gentoo-commits@lists.gentoo.org
Subject: [gentoo-commits] proj/devmanual:master commit in: /, bin/
Date: Thu, 19 Dec 2019 21:02:44 +0000 (UTC) [thread overview]
Message-ID: <1576789082.926e0d0855afa40f5dcbc16b1b7c66187afd7d73.gokturk@gentoo> (raw)
commit: 926e0d0855afa40f5dcbc16b1b7c66187afd7d73
Author: Göktürk Yüksek <gokturk <AT> gentoo <DOT> org>
AuthorDate: Tue Dec 10 02:08:12 2019 +0000
Commit: Göktürk Yüksek <gokturk <AT> gentoo <DOT> org>
CommitDate: Thu Dec 19 20:58:02 2019 +0000
URL: https://gitweb.gentoo.org/proj/devmanual.git/commit/?id=926e0d08
Rewrite the search functionality and extend the coverage
The current script only indexes the first <p> in a text.xml, and
sometimes only partially if the text is interrupted by another tag
such as <c/>.
Modify build_search_documents.py such that:
- It recursively traverses from chapter all the way down to
subsubsection
- Each <p>, <important>, <note>, <warning> is indexed separately
- In the search results, the match entry will have the title in the
form "Chapter[ -> Section[ -> Subsection[ -> Subsubsection]]]"
Modify search.js such that:
- The ref returned for a match is its index into "documents" array,
which makes it possible to retrieve the document in O(1).
Signed-off-by: Göktürk Yüksek <gokturk <AT> gentoo.org>
bin/build_search_documents.py | 112 ++++++++++++++++++++++++++++++++++++------
search.js | 22 ++++-----
2 files changed, 108 insertions(+), 26 deletions(-)
diff --git a/bin/build_search_documents.py b/bin/build_search_documents.py
index 9af2753..3816fdb 100755
--- a/bin/build_search_documents.py
+++ b/bin/build_search_documents.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/python3
# Copyright 2019 Gentoo Authors
# Distributed under the terms of the GNU GPL version 2 or later
import json
@@ -6,19 +6,103 @@ import os.path
import sys
import xml.etree.ElementTree as ET
-files = sys.argv[1:]
-documents = []
-url_root = 'https://devmanual.gentoo.org/'
-for f in files:
- tree = ET.parse(f)
- root = tree.getroot()
- for chapter in root.findall('chapter'):
+def stringify_node(parent: ET.Element) -> str:
+ """Flatten this node and its immediate children to a string.
+
+ Combine the text and tail of this node, and any of its immediate
+ children, if there are any, into a flat string. The tag <d/> is a
+ special case that resolves to the dash ('-') character.
+
+ Keyword arguments:
+ parent -- the node to convert to a string
+
+ """
+ if parent.text:
+ text = parent.text.lstrip()
+ else:
+ text = str()
+
+ for child in parent.getchildren():
+ # The '<d/>' tag is simply a fancier '-' character
+ if child.tag == 'd':
+ text += '-'
+ if child.text:
+ text += child.text.lstrip()
+ if child.tail:
+ text += child.tail.rstrip()
+
+ text += parent.tail.rstrip()
+ return text.replace('\n', ' ')
+
+
+def process_node(documents: list, node: ET.Element, name: str, url: str) -> None:
+ """Recursively process a given node and its children based on tag values.
+
+ For the top level node <chapter>, extract the title and recurse
+ down to the children.
+ For the intermediary nodes with titles, such as <section>, update
+ the search result title and url, and recurse down.
+ For the terminal nodes, such as <p>, convert the contents of the
+ node to a string, and add it to the search documents.
+
+ Keyword arguments:
+ documents -- the search documents array
+ node -- the node to process
+ name -- the title to display for the search term match
+ url -- the url for the search term match in the document
+
+ """
+ if node.tag == 'chapter':
+ name = stringify_node(node.find('title'))
+
+ for child in node:
+ process_node(documents, child, name, url)
+ elif node.tag in ['section', 'subsection', 'subsubsection']:
+ title = stringify_node(node.find('title'))
+ name += ' -> ' + title
+ url = "{url_base}#{anchor}".format(
+ url_base=url.split('#')[0],
+ anchor=title.lower().replace(' ', '-'))
+
+ for child in node:
+ process_node(documents, child, name, url)
+ elif node.tag in ['body', 'guide']:
+ for child in node:
+ process_node(documents, child, name, url)
+ elif node.tag in ['p', 'important', 'note', 'warning']:
+ text = stringify_node(node)
+
+ documents.append({'id': len(documents),
+ 'name': name,
+ 'text': text,
+ 'url': url})
+ else:
+ pass
+
+
+def main(pathnames: list) -> None:
+ """The entry point of the script.
+
+ Keyword arguments:
+ pathnames -- a list of path names to process in sequential order
+ """
+ url_root = 'https://devmanual.gentoo.org/'
+ documents = []
+
+ for path in pathnames:
+ tree = ET.parse(path)
+ root = tree.getroot()
+
try:
- documents.append({"name": chapter.find('title').text,
- "text": chapter.find('body').find('p').text,
- "url": url_root + os.path.dirname(f) + '/'})
- except AttributeError:
- pass
+ url = url_root + os.path.dirname(path) + '/'
+
+ process_node(documents, root, None, url)
+ except:
+ raise
+
+ print('var documents = ' + json.dumps(documents) + ';')
+
-print('var documents = ' + json.dumps(documents) + ';')
+if __name__ in '__main__':
+ main(sys.argv[1:])
diff --git a/search.js b/search.js
index 0b9292f..ab28f87 100644
--- a/search.js
+++ b/search.js
@@ -5,9 +5,9 @@
"use strict";
var search_index = lunr(function () {
- this.ref('name');
+ this.ref('id');
this.field('text');
- this.field('url');
+ this.metadataWhitelist = ['position']
documents.forEach(function (doc) {
this.add(doc);
@@ -23,15 +23,13 @@ search_input.addEventListener("keyup", function(event) {
}
});
-function getContents(docs, article) {
- var contents = { text: "", url: "" };
+function getContents(docs, uid) {
+ var contents = { name: "", text: "", url: "" };
+
+ contents.name = docs[uid].name;
+ contents.text = docs[uid].text;
+ contents.url = docs[uid].url;
- for (var i = 0; i< docs.length; i++) {
- if (docs[i].name == article) {
- contents.text = docs[i].text;
- contents.url = docs[i].url;
- }
- }
return contents;
}
@@ -42,8 +40,8 @@ function search() {
if (results.length > 0) {
$("#searchResults .modal-body").empty();
$.each(results, function(index, result) {
- var title = result.ref;
- var contents = getContents(documents, title);
+ var uid = result.ref;
+ var contents = getContents(documents, uid);
$("#searchResults .modal-body").append(`<article><h5><a href="${contents.url}">
${title}</a></h5><p>${contents.text}</p></article>`);
next reply other threads:[~2019-12-19 21:02 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-19 21:02 Göktürk Yüksek [this message]
-- strict thread matches above, loose matches on Subject: below --
2024-10-29 11:19 [gentoo-commits] proj/devmanual:master commit in: /, bin/ Ulrich Müller
2023-10-05 19:00 Ulrich Müller
2022-03-26 19:12 Ulrich Müller
2020-02-28 6:26 Ulrich Müller
2020-01-22 18:24 Ulrich Müller
2019-12-16 6:45 Ulrich Müller
2019-12-14 10:46 Ulrich Müller
2019-03-22 18:51 Brian Evans
2019-03-22 13:27 Brian Evans
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1576789082.926e0d0855afa40f5dcbc16b1b7c66187afd7d73.gokturk@gentoo \
--to=gokturk@gentoo.org \
--cc=gentoo-commits@lists.gentoo.org \
--cc=gentoo-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox