From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-user+bounces-111573-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1OL7ER-0003cK-DR
	for garchives@archives.gentoo.org; Sun, 06 Jun 2010 04:13:07 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 6DEF3E088B;
	Sun,  6 Jun 2010 04:12:04 +0000 (UTC)
Received: from inception.Mines.EDU (inception.Mines.EDU [138.67.130.4])
	by pigeon.gentoo.org (Postfix) with ESMTP id 50633E088B
	for <gentoo-user@lists.gentoo.org>; Sun,  6 Jun 2010 04:12:04 +0000 (UTC)
Received: from [192.168.125.12] (c-76-25-183-245.hsd1.co.comcast.net [76.25.183.245])
	(authenticated bits=0)
	by inception.Mines.EDU (8.13.1/8.13.1) with ESMTP id o564C3la031209
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <gentoo-user@lists.gentoo.org>; Sat, 5 Jun 2010 22:12:03 -0600
Subject: Re: [gentoo-user] [OT sphinx] Any users of sphinx here
From: Brandon Vargo <brandon.vargo@gmail.com>
To: gentoo-user@lists.gentoo.org
In-Reply-To: <87ljauxvfu.fsf@newsguy.com>
References: <87ljauxvfu.fsf@newsguy.com>
Content-Type: text/plain
Date: Sat, 05 Jun 2010 22:11:58 -0600
Message-Id: <1275797518.900.60.camel@venus>
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
Mime-Version: 1.0
X-Mailer: Evolution 2.26.3 
Content-Transfer-Encoding: 7bit
X-Archives-Salt: 40955947-0884-4f02-9519-2362a00006d1
X-Archives-Hash: 49775c7283016e7478ceb620192289f9

On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote:
> I've been looking for a perl based search tool that uses some kind of
> indexing to index and render searchable my home library of software
> manual and the like.  Quite a few html pages involved, maybe 15-16,000.
> 
> Webglimpse is something I've worked with before and know a bit about
> but thought I might like to see what else is available.
> 
> Googling lead to a tool called Sphinx that apparently is coupled with
> a data base tool like mysql.  It is advertised as the kind of search
> tool I'm after and has a perl front-end also available in portage 
> (dev-perl/Sphinx-Search).
> 
> The trouble is I haven't been able to figure out the first thing about
> using it.  The overview, and Introduction, like a lot of such
> documents fails to give a really basic idea of what the tool does.
> 
> The call it a `full text search engine', but never really say what
> that means.
> 
> There are 12-15 FEATURES listed, and none appear to describe sensibly
> what they really do.
> 
> The faq is a string a questions about using sql.. really.
> 
> So far I haven't found a good statement of what the darn thing really
> does or how to aim it at data.
> 
> The manual is probably great if you already know a lot about using
> sphinx but very thin for my case.
> 
> I've not even been able to get a rough idea of how to aim the darn
> thing at the desired (Local lan) web site.
> 
> Or, to show how thin it really is or how dumb I really am, I've been
> unable to tell if it can even do what I want to do.
> 
> I've posted on a sphinx list on gmane... but it appears to be only
> moderately active and haven't gotten any replies... 
> 
> I hoped some one here may be familiar with sphinx and willing to coach
> me a bit or at least let me know if it can even do what I want to do.
> 
> Also any other perl based search tools involving indexing and some
> kind of versatile search query capability.. like regular expressions
> I'd be interested to know about.

If you can put your HTML pages into a database, Sphinx might be able to
help you with your issue. Basically what Sphinx does is let you search
databases. You specify one or more SQL sources of data ans associated
queries, and Sphinx provides an API (or a emulated SQL server) that
makes searching easy. Sphinx is for full text database searching; it
does not index files or websites directly. (Note that is this not
actually true; it can search XML files directly, but you still specify
XML attributes instead of database columns, etc, so it is treating the
XML as a data store and not as a generic document.) I recall reading
that Craigslist uses Sphinx to search their database of listings.

As an example of how it works, suppose I am making a news website and
have a bunch of news posts, each of which has an author, category, and
text. With Sphinx, I can setup a source -- let's call it news_catalog --
that will index this data. news_catalog will be associated with an SQL
query that will allow Sphinx to access the data it needs to index. Let's
use "SELECT id, author, category, text FROM catalog" as our query. Note
that catalog is a table or view in your database, though this query can
also use complex joins, etc, as long as the database supports it. Via
the Sphinx API, I can say I want to search for "Europe | America" and it
will return a list of news articles containing the terms Europe,
America, or both, as a pipe is the or operator. It actually returns a
list of ids which correspond to the id I specified in my query; a unique
key is always the first argument in the query. My application is
responsible for fetching the actual data from the original database
using that id and presenting the data in a useful way to the user.
Extended query syntax allows for other boolean operators, searching
specific fields, strict order, exact match, field start/end, etc. The
documentation has lots of examples; look at
http://www.sphinxsearch.com/docs/current.html for the current reference
manual.

If you have a bunch of HTML files on a disk or website that you want to
index and search, I do not think Sphinx is the software you want. Yes,
you could load your data into a database and then use Sphinx, but that
does not seem like the best solution. Sphinx provides the API for use in
your application; it does not provide a user interface. As an
alternative, I recommend you look at something like ht://Dig
(htdig.org), which will search HTML pages directly in addition to PDF,
Word, Excel, Powerpoint, etc with the help of external converters. It
also includes a user interface. After glancing at webglimpse, with which
I am not familiar, it looks like it does something similar to ht://Dig.

Regards,

Brandon Vargo