From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1OL7ER-0003cK-DR for garchives@archives.gentoo.org; Sun, 06 Jun 2010 04:13:07 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 6DEF3E088B; Sun, 6 Jun 2010 04:12:04 +0000 (UTC) Received: from inception.Mines.EDU (inception.Mines.EDU [138.67.130.4]) by pigeon.gentoo.org (Postfix) with ESMTP id 50633E088B for ; Sun, 6 Jun 2010 04:12:04 +0000 (UTC) Received: from [192.168.125.12] (c-76-25-183-245.hsd1.co.comcast.net [76.25.183.245]) (authenticated bits=0) by inception.Mines.EDU (8.13.1/8.13.1) with ESMTP id o564C3la031209 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sat, 5 Jun 2010 22:12:03 -0600 Subject: Re: [gentoo-user] [OT sphinx] Any users of sphinx here From: Brandon Vargo To: gentoo-user@lists.gentoo.org In-Reply-To: <87ljauxvfu.fsf@newsguy.com> References: <87ljauxvfu.fsf@newsguy.com> Content-Type: text/plain Date: Sat, 05 Jun 2010 22:11:58 -0600 Message-Id: <1275797518.900.60.camel@venus> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org Mime-Version: 1.0 X-Mailer: Evolution 2.26.3 Content-Transfer-Encoding: 7bit X-Archives-Salt: 40955947-0884-4f02-9519-2362a00006d1 X-Archives-Hash: 49775c7283016e7478ceb620192289f9 On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote: > I've been looking for a perl based search tool that uses some kind of > indexing to index and render searchable my home library of software > manual and the like. Quite a few html pages involved, maybe 15-16,000. > > Webglimpse is something I've worked with before and know a bit about > but thought I might like to see what else is available. > > Googling lead to a tool called Sphinx that apparently is coupled with > a data base tool like mysql. It is advertised as the kind of search > tool I'm after and has a perl front-end also available in portage > (dev-perl/Sphinx-Search). > > The trouble is I haven't been able to figure out the first thing about > using it. The overview, and Introduction, like a lot of such > documents fails to give a really basic idea of what the tool does. > > The call it a `full text search engine', but never really say what > that means. > > There are 12-15 FEATURES listed, and none appear to describe sensibly > what they really do. > > The faq is a string a questions about using sql.. really. > > So far I haven't found a good statement of what the darn thing really > does or how to aim it at data. > > The manual is probably great if you already know a lot about using > sphinx but very thin for my case. > > I've not even been able to get a rough idea of how to aim the darn > thing at the desired (Local lan) web site. > > Or, to show how thin it really is or how dumb I really am, I've been > unable to tell if it can even do what I want to do. > > I've posted on a sphinx list on gmane... but it appears to be only > moderately active and haven't gotten any replies... > > I hoped some one here may be familiar with sphinx and willing to coach > me a bit or at least let me know if it can even do what I want to do. > > Also any other perl based search tools involving indexing and some > kind of versatile search query capability.. like regular expressions > I'd be interested to know about. If you can put your HTML pages into a database, Sphinx might be able to help you with your issue. Basically what Sphinx does is let you search databases. You specify one or more SQL sources of data ans associated queries, and Sphinx provides an API (or a emulated SQL server) that makes searching easy. Sphinx is for full text database searching; it does not index files or websites directly. (Note that is this not actually true; it can search XML files directly, but you still specify XML attributes instead of database columns, etc, so it is treating the XML as a data store and not as a generic document.) I recall reading that Craigslist uses Sphinx to search their database of listings. As an example of how it works, suppose I am making a news website and have a bunch of news posts, each of which has an author, category, and text. With Sphinx, I can setup a source -- let's call it news_catalog -- that will index this data. news_catalog will be associated with an SQL query that will allow Sphinx to access the data it needs to index. Let's use "SELECT id, author, category, text FROM catalog" as our query. Note that catalog is a table or view in your database, though this query can also use complex joins, etc, as long as the database supports it. Via the Sphinx API, I can say I want to search for "Europe | America" and it will return a list of news articles containing the terms Europe, America, or both, as a pipe is the or operator. It actually returns a list of ids which correspond to the id I specified in my query; a unique key is always the first argument in the query. My application is responsible for fetching the actual data from the original database using that id and presenting the data in a useful way to the user. Extended query syntax allows for other boolean operators, searching specific fields, strict order, exact match, field start/end, etc. The documentation has lots of examples; look at http://www.sphinxsearch.com/docs/current.html for the current reference manual. If you have a bunch of HTML files on a disk or website that you want to index and search, I do not think Sphinx is the software you want. Yes, you could load your data into a database and then use Sphinx, but that does not seem like the best solution. Sphinx provides the API for use in your application; it does not provide a user interface. As an alternative, I recommend you look at something like ht://Dig (htdig.org), which will search HTML pages directly in addition to PDF, Word, Excel, Powerpoint, etc with the help of external converters. It also includes a user interface. After glancing at webglimpse, with which I am not familiar, it looks like it does something similar to ht://Dig. Regards, Brandon Vargo