* [gentoo-user] [OT sphinx] Any users of sphinx here @ 2010-06-04 22:52 Harry Putnam 2010-06-05 11:54 ` [gentoo-user] " Hans de Graaff 2010-06-06 4:11 ` [gentoo-user] " Brandon Vargo 0 siblings, 2 replies; 7+ messages in thread From: Harry Putnam @ 2010-06-04 22:52 UTC (permalink / raw To: gentoo-user I've been looking for a perl based search tool that uses some kind of indexing to index and render searchable my home library of software manual and the like. Quite a few html pages involved, maybe 15-16,000. Webglimpse is something I've worked with before and know a bit about but thought I might like to see what else is available. Googling lead to a tool called Sphinx that apparently is coupled with a data base tool like mysql. It is advertised as the kind of search tool I'm after and has a perl front-end also available in portage (dev-perl/Sphinx-Search). The trouble is I haven't been able to figure out the first thing about using it. The overview, and Introduction, like a lot of such documents fails to give a really basic idea of what the tool does. The call it a `full text search engine', but never really say what that means. There are 12-15 FEATURES listed, and none appear to describe sensibly what they really do. The faq is a string a questions about using sql.. really. So far I haven't found a good statement of what the darn thing really does or how to aim it at data. The manual is probably great if you already know a lot about using sphinx but very thin for my case. I've not even been able to get a rough idea of how to aim the darn thing at the desired (Local lan) web site. Or, to show how thin it really is or how dumb I really am, I've been unable to tell if it can even do what I want to do. I've posted on a sphinx list on gmane... but it appears to be only moderately active and haven't gotten any replies... I hoped some one here may be familiar with sphinx and willing to coach me a bit or at least let me know if it can even do what I want to do. Also any other perl based search tools involving indexing and some kind of versatile search query capability.. like regular expressions I'd be interested to know about. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [gentoo-user] Re: [OT sphinx] Any users of sphinx here 2010-06-04 22:52 [gentoo-user] [OT sphinx] Any users of sphinx here Harry Putnam @ 2010-06-05 11:54 ` Hans de Graaff 2010-06-06 4:11 ` [gentoo-user] " Brandon Vargo 1 sibling, 0 replies; 7+ messages in thread From: Hans de Graaff @ 2010-06-05 11:54 UTC (permalink / raw To: gentoo-user On Fri, 04 Jun 2010 17:52:05 -0500, Harry Putnam wrote: > Googling lead to a tool called Sphinx that apparently is coupled with a > data base tool like mysql. It is advertised as the kind of search tool > I'm after and has a perl front-end also available in portage > (dev-perl/Sphinx-Search). > > The call it a `full text search engine', but never really say what that > means. It means that you can dump a lot of text "documents" into it (based on html pages, database records, actual documents, etc). sphinx efficiently indexes all the text in it, and then allows you to retrieve it again, supporting things that are useful for searching in text such as stemming. It can use MySQL but this isn't needed to use it. It should be able to help you with the task you want to solve, although I'm not familiar with the capabilities of the Sphinx-Search front-end/ binding. Kind regards, Hans ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] [OT sphinx] Any users of sphinx here 2010-06-04 22:52 [gentoo-user] [OT sphinx] Any users of sphinx here Harry Putnam 2010-06-05 11:54 ` [gentoo-user] " Hans de Graaff @ 2010-06-06 4:11 ` Brandon Vargo 2010-06-06 20:37 ` [gentoo-user] " Harry Putnam 1 sibling, 1 reply; 7+ messages in thread From: Brandon Vargo @ 2010-06-06 4:11 UTC (permalink / raw To: gentoo-user On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote: > I've been looking for a perl based search tool that uses some kind of > indexing to index and render searchable my home library of software > manual and the like. Quite a few html pages involved, maybe 15-16,000. > > Webglimpse is something I've worked with before and know a bit about > but thought I might like to see what else is available. > > Googling lead to a tool called Sphinx that apparently is coupled with > a data base tool like mysql. It is advertised as the kind of search > tool I'm after and has a perl front-end also available in portage > (dev-perl/Sphinx-Search). > > The trouble is I haven't been able to figure out the first thing about > using it. The overview, and Introduction, like a lot of such > documents fails to give a really basic idea of what the tool does. > > The call it a `full text search engine', but never really say what > that means. > > There are 12-15 FEATURES listed, and none appear to describe sensibly > what they really do. > > The faq is a string a questions about using sql.. really. > > So far I haven't found a good statement of what the darn thing really > does or how to aim it at data. > > The manual is probably great if you already know a lot about using > sphinx but very thin for my case. > > I've not even been able to get a rough idea of how to aim the darn > thing at the desired (Local lan) web site. > > Or, to show how thin it really is or how dumb I really am, I've been > unable to tell if it can even do what I want to do. > > I've posted on a sphinx list on gmane... but it appears to be only > moderately active and haven't gotten any replies... > > I hoped some one here may be familiar with sphinx and willing to coach > me a bit or at least let me know if it can even do what I want to do. > > Also any other perl based search tools involving indexing and some > kind of versatile search query capability.. like regular expressions > I'd be interested to know about. If you can put your HTML pages into a database, Sphinx might be able to help you with your issue. Basically what Sphinx does is let you search databases. You specify one or more SQL sources of data ans associated queries, and Sphinx provides an API (or a emulated SQL server) that makes searching easy. Sphinx is for full text database searching; it does not index files or websites directly. (Note that is this not actually true; it can search XML files directly, but you still specify XML attributes instead of database columns, etc, so it is treating the XML as a data store and not as a generic document.) I recall reading that Craigslist uses Sphinx to search their database of listings. As an example of how it works, suppose I am making a news website and have a bunch of news posts, each of which has an author, category, and text. With Sphinx, I can setup a source -- let's call it news_catalog -- that will index this data. news_catalog will be associated with an SQL query that will allow Sphinx to access the data it needs to index. Let's use "SELECT id, author, category, text FROM catalog" as our query. Note that catalog is a table or view in your database, though this query can also use complex joins, etc, as long as the database supports it. Via the Sphinx API, I can say I want to search for "Europe | America" and it will return a list of news articles containing the terms Europe, America, or both, as a pipe is the or operator. It actually returns a list of ids which correspond to the id I specified in my query; a unique key is always the first argument in the query. My application is responsible for fetching the actual data from the original database using that id and presenting the data in a useful way to the user. Extended query syntax allows for other boolean operators, searching specific fields, strict order, exact match, field start/end, etc. The documentation has lots of examples; look at http://www.sphinxsearch.com/docs/current.html for the current reference manual. If you have a bunch of HTML files on a disk or website that you want to index and search, I do not think Sphinx is the software you want. Yes, you could load your data into a database and then use Sphinx, but that does not seem like the best solution. Sphinx provides the API for use in your application; it does not provide a user interface. As an alternative, I recommend you look at something like ht://Dig (htdig.org), which will search HTML pages directly in addition to PDF, Word, Excel, Powerpoint, etc with the help of external converters. It also includes a user interface. After glancing at webglimpse, with which I am not familiar, it looks like it does something similar to ht://Dig. Regards, Brandon Vargo ^ permalink raw reply [flat|nested] 7+ messages in thread
* [gentoo-user] Re: [OT sphinx] Any users of sphinx here 2010-06-06 4:11 ` [gentoo-user] " Brandon Vargo @ 2010-06-06 20:37 ` Harry Putnam 2010-06-08 5:07 ` Brandon Vargo 0 siblings, 1 reply; 7+ messages in thread From: Harry Putnam @ 2010-06-06 20:37 UTC (permalink / raw To: gentoo-user Brandon Vargo <brandon.vargo@gmail.com> writes: > As an example of how it works, suppose I am making a news website and > have a bunch of news posts, each of which has an author, category, and Thank you brandon for such a nice through answer... Yeah, looks like I'm barking up the wrong tree. I know about htdig.. Not much though. Far as remember it didn't have much in the way of search interface... something like google. Where as webglimpse has a rich set of search terms, including some regular expressions and regular expression like operators... all the same tools as glimpse (and agrep). So many in fact it can be a bit daunting to try to become proficient with. Maybe you can enlighten me about htdig... its been yrs since I tried htdig. Even webglimpse fails though when it comes to trying to search for snippets of code like perl or C etc. No body want the sloth and cpu overhead of serious regular expression searching and that maybe the only (good) way to search for things like /,{,$,(,[,!,@ etc etc like one would need to find types of code snippets. Also I guess it would be pretty hard to build an index with that in mind. I keep thinking some good developer will come out with a tool aimed at websites like might be found on a home lan (in scope)... where regular expression searching wouldn't be so far out. Or maybe there just is no herd of people who are competent in regular expression searching, and hence no audience for such a tool ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] Re: [OT sphinx] Any users of sphinx here 2010-06-06 20:37 ` [gentoo-user] " Harry Putnam @ 2010-06-08 5:07 ` Brandon Vargo 2010-06-12 22:18 ` Harry Putnam 2010-06-12 23:05 ` Harry Putnam 0 siblings, 2 replies; 7+ messages in thread From: Brandon Vargo @ 2010-06-08 5:07 UTC (permalink / raw To: gentoo-user On Sun, 2010-06-06 at 15:37 -0500, Harry Putnam wrote: > Brandon Vargo <brandon.vargo@gmail.com> writes: > > > As an example of how it works, suppose I am making a news website and > > have a bunch of news posts, each of which has an author, category, and > > Thank you brandon for such a nice through answer... Yeah, looks like > I'm barking up the wrong tree. > > I know about htdig.. Not much though. Far as remember it didn't have > much in the way of search interface... something like google. Where > as webglimpse has a rich set of search terms, including some regular > expressions and regular expression like operators... all the same > tools as glimpse (and agrep). So many in fact it can be a bit > daunting to try to become proficient with. > > Maybe you can enlighten me about htdig... its been yrs since I tried > htdig. Sorry, it has been awhile since I have used it as well. > Even webglimpse fails though when it comes to trying to search for > snippets of code like perl or C etc. No body want the sloth and cpu > overhead of serious regular expression searching and that maybe the > only (good) way to search for things like /,{,$,(,[,!,@ etc etc like > one would need to find types of code snippets. Also I guess it > would be pretty hard to build an index with that in mind. Certainly it is a hard problem to index for arbitrary regular expressions. Even Google's code search [1] is not terribly good at it. However, I also do not think it is something most people will want to do. When I go to find code that I have written, I do not remember variable names, lines of code, etc that I can match with a regular expression. Thus, that kind of search is pointless for me. I remember what the code does, the project for which I wrote the code, and approximately where the code is located within the project. I remember function calls for libraries that I probably used. If I cannot find what I am looking for, I use grep on the name of a function call I remember, or I have a ctags file containing all the information I need about function definitions. I suggest, for code, you just organize whatever you have in a sane directory structure. Or, even better, you can put your code in a central place using a version control system (SVN, git, hg, CVS, etc), where it is organized in a way that makes sense to you. After all, it sounds like this is for your personal use, so use something that makes you happy. Personally, I have a series of git repositories that I use to keep track of my code and some of my documents. > I keep thinking some good developer will come out with a tool aimed at > websites like might be found on a home lan (in scope)... where regular > expression searching wouldn't be so far out. > > Or maybe there just is no herd of people who are competent in regular > expression searching, and hence no audience for such a tool I do not think the problem is a lack of people with knowledge of regular expressions, but rather the lack of a need for such a product. Many people, at least those I know, do not think "Oh, I want to search for xyz; I'll write a regular expression to search for what I want across all my data." Instead, they have a directory structure of organized documents that makes finding that particular document or series of documents on xyz easy. When that fails, there is the find and locate commands for terminal users, which support regex searching in filenames, desktop search tools such as Beagle [2], and of course grep. Certainly it would be really nice to have a search tool that would produce results for "show me all the code on this computer used for validating HTTP POST requests in Python for a submitted HTML form, preferably using Django." If you find one, let me know, as I would love to try it. In the meantime, `grep -RE 'form|POST' projects/python/django/project_xyz` works fairly well once I figure out that what I want is probably in that directory. (grep -E, or egrep, supports extended regular expression; -R is recursive) Or, I just go search through the documentation, if available. Maybe someone here can suggestion something better for code searching. For everything else, use Beagle/something similar or a web-based search engine you can install locally if you really want to be able to search through your documents. Maybe there is something better for that too; I do not know. I still use directories and git repositories in said directories, where appropriate, as it is more efficient for me. Of course your mileage may vary. [1]: http://www.google.com/codesearch [2]: http://beagle-project.org/ Regards, Brandon Vargo ^ permalink raw reply [flat|nested] 7+ messages in thread
* [gentoo-user] Re: [OT sphinx] Any users of sphinx here 2010-06-08 5:07 ` Brandon Vargo @ 2010-06-12 22:18 ` Harry Putnam 2010-06-12 23:05 ` Harry Putnam 1 sibling, 0 replies; 7+ messages in thread From: Harry Putnam @ 2010-06-12 22:18 UTC (permalink / raw To: gentoo-user Brandon Vargo <brandon.vargo@gmail.com> writes: > do. When I go to find code that I have written, I do not remember > variable names, lines of code, etc that I can match with a regular > expression. Thus, that kind of search is pointless for me. I remember > what the code does, the project for which I wrote the code, and > approximately where the code is located within the project. I remember > function calls for libraries that I probably used. If I cannot find what > I am looking for, I use grep on the name of a function call I remember, > or I have a ctags file containing all the information I need about > function definitions. Again, thanks for a thorough answer... just a note on the above comment. I often find myself searching for a technique... NOT variable names or sub function names because who knows what I might call stuff in any particular script. For example... I once was shown how to compile as regular expression an element of @ARGV in perl, in one step: my $what_re = qr/@{[shift]}/; I liked that and have used it many times... but only recently could I remember at a moments notice how to write it. I used `grep -r' or 'egrep -r' as you've mentioned, now I use a my own perl script (recently written [since posting original query]) that uses regex and File::Find, where user feeds the regex and the approximate location to begin the search, on the cmd line. In my case that would be an nfs share /projects/reader/perl which is kept in my ENV as $perlp So: script.pl 'qr/.*?@' $perlp Will find a number of examples of using that particular technique. What prompted my query here, was looking for a way to search several thousand html pages that are a collection of Perl books on CD. These are 2 of the Oreilly Perl CDbooks. (I spent $150 for the first one, and I think the second was a little cheaper, it was yrs ago) The Books on CD have built in search tools but those only work on a windows OS and aren't up to much anyway. I've since downloaded the data from the CDS onto an opensolaris zfs server and access them through NFS. I was attempting to use `webglimpse' (http://webglimpse.net/download.php) for the task, hence the interest in indexing. But I suspect a search for a particular technique I read about, but have forgotten how to code, would be best searched for using regular expressions. This would be long after I've forgotten which section or even which book I read about it in. The tool I've written can be made to strip html if necessary and can be made to include (by regex) only certain kinds of filenames, but uses no index so consequently is pretty slow... but still very useful and is fully perl regex capable. It returns up to 4 lines of context, 2 above the line with the hit, and 1 below (where possible), along with the page number and the absolute filename where the hit was found. Here is an example search being timed: ------- --------- ---=--- --------- -------- (I purposely picked something that would be found many times) time ./pgrep3 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash (So above we are searching a collection from the Oreilly CDbooks for the term `hash'..) (Just one example of the thousands of lines returned) [...] /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm 135 dereferencing with : [104]4.8.2. Dereferencing 136 modulus operator : [105]4.5.3. Arithmetic Operators 137 prototype symbol (hash) : [106]4.7.5. Prototypes 138 %= (assignment) operator : [107]4.5.6. Assignment Operators --- [...] Total files searched: 522 Total lines searched: 431689 real 1m48.344s user 1m25.234s sys 0m14.336s ------- --------- ---=--- --------- -------- Almost 2 minutes to search 431689 lines So it is slow, maybe even very slow by comparison to tools using an indexed search. I don't really mind the sloth, but of course it would not be scalable very much above the scope of use I'm doing with it. I do like the precision search capability and plenty of context. All of the above is also possible with grep, egrep... and friends too, of course, but only with quite a lot more cmdline manipulation and piping. I'm currently working on using something like this basic search script to return URLS linking to the page and lines found, and working the whole thing into something that can be carried out with a web browser. Something pretty similar to webglimpse, I guess but without the benefit of indexing. Also webglimpe relies on glimpse which is not capable of full regex search but does have a rich mixture of regex, regex like and boolean query capability. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [gentoo-user] Re: [OT sphinx] Any users of sphinx here 2010-06-08 5:07 ` Brandon Vargo 2010-06-12 22:18 ` Harry Putnam @ 2010-06-12 23:05 ` Harry Putnam 1 sibling, 0 replies; 7+ messages in thread From: Harry Putnam @ 2010-06-12 23:05 UTC (permalink / raw To: gentoo-user Brandon Vargo <brandon.vargo@gmail.com> writes: > [1]: http://www.google.com/codesearch > [2]: http://beagle-project.org/ Acckk, I forgot to thank you for the URLS you posted.. thanks ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-06-12 23:08 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-06-04 22:52 [gentoo-user] [OT sphinx] Any users of sphinx here Harry Putnam 2010-06-05 11:54 ` [gentoo-user] " Hans de Graaff 2010-06-06 4:11 ` [gentoo-user] " Brandon Vargo 2010-06-06 20:37 ` [gentoo-user] " Harry Putnam 2010-06-08 5:07 ` Brandon Vargo 2010-06-12 22:18 ` Harry Putnam 2010-06-12 23:05 ` Harry Putnam
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox