When I was a child in the ‘70s I was promised three things: an empathetic computer with a conversational voice bearing a Canadian accent, a ride on a Pan AmericanWorld Airways space shuttle, and a good selection of open source enterprise search software. These things were to be ubiquitous by 2001. And no, I was never promised a flying car. Heck, that’s just silly. Armed with expectation, I invested in a computer science degree and then waited for the juggernaut of open source enterprise search (OSES to coin an acronym) software to sweep me into a world of linguistic bliss. And I waited. . . . And waited. Waiting is not as lucrative as one might hope, so after a while I got a job with a commercial enterprise search vendor. I slogged thousands of miles helping hard-working North Americans find stuff on their internal networks. I tell myself that this was rewarding, which dampens the cognitive dissonance that the past 10 years of my life has been a complete failure of potential.
If only I had chosen a career in database, operating system, or even front-office applications, I would be luxuriating in the likes of Linux, MySQL,Tomcat, and even OpenOffice. It seems like the great open source solutions to common computing problems have been around for a long time.
Take databases: Relational database management systems have been in common use since the 1980s, and we have quite a few open source options like MySQL, which has been around for at least a decade.Tomcat is equally well regarded as a web application server, and OpenOffice is making a concerted effort to unseat the seemingly unmovable Microsoft Office as king of the spreadsheets.
A few years ago, Doug Cutting gave the open source world a wonderful gift. Not to get all gushy about it, but it really is a nice piece ofsoftware: Lucene is an open source search library that is fast, extensible, scalable, and easy to embed. Originally written in Java, there are ports available for all of the major programming languages. It is highly thought of and embedded in dozens of commercial and open source web and desktop applications. But before you get too excited, there are a few things you shouldknow.
Lucene does some things very well, but even indexing an HTMLdocument from a file system requires you to write code. If you’re building the next killer blog software or email client then you can use Lucene as is. Enterprise search, however, requires a lot more software.
Ultimately, Lucene is just a library. All of you librarians are out there saying, "Hey, wait a gosh-darn minute. Just alibrary? Libraries are the backbone of modern civilization." I also hear you mutter "what a poltroon" in that passiv eaggressive librarian tone of yours. Don’t tell me I’m hearing things. And yes, I do know what "poltroon" means.
Solr is a subproject of Lucene. Apparently, it follows the Web2.0 hipstr trend of eliminating those redundant schwa’s from our vocabulary. if ur <30 u know what I mean lol. jk.
The Solr developers define it better than I do, so here goes: "Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat."
In other words, Solr adds infrastructure on top of Lucene that makes it much easier to implement in an enterprise context. This is pretty much the way it works in the commercia lworld, too. For example, Verity had a K2 layer on top of the VDK libraries. Autonomy has an IDOL layer on top of the DRE coreengine. Some other commercial engines actually use Lucene under the covers.
Solr uses the buzzword-compliant REST (Representational StateTransfer) as its communications protocol. It is implemented as a J2EEapplication (war file), which takes input from a standard HTTP query string and produces results in a standard XML document. This is a simple protocol that is getting good traction in the Web2.0 community.
For example, to run a query against Solr, you set up a URL with parameters and send it to the Solr server. The Solr server in turn returns an XML document, which you can then parse with any programming language known to man.
Click here for your free PDF (including extra charts).
If It Was Easy, Everybody Would Do It
As with any enterprise product, the search tool is only partof the equation. An enterprise search tool is not very helpful unless you canextract data from the source repositories and get it into the engine in the first place.
A director at an enormous search company once told me, "It’sall about connectors." Indexing and searching are relatively easy. The hard part is getting access to every organization’s odd mix of document repositories, crawling those repositories, preserving access control information, and filtering the documents into a standard set of metadata.
No matter how vanilla you think your back-office structure is, it is unique. No two organizations have precisely the same mix of content sources. It amazes me how often a client asks me something like,"How do your other customers index their custom Lotus Notes/Oracle hybrid web application on OS/2 with IBM BookMastersource files?" I generally respond with a blank stare.
"Don’t you see?" the well-intentioned client stammers on, "We can just do what they do!" I nod with the conviction of someone who just signed a statement-of-work.
"See how we’ve made your life easier? We must be your best customer ever."
"Yes, yes you are. Thank you for that." I say this with no hint of sarcasm, because I’m a professional that way.
Not incidentally, this is where the enterprise search vendors make the big bucks. Mapping all of those different content repository sources into a single unified full-text index is hard work. Don’t even get me started on security.
Making the Connection
So, if you want to use Solr, you need to find a way to get your data into it. Adding data to Solr is easy … once you have it the right format. For example, indexing information from a database is as simple as rendering the output of an SQL query into XML—something all modern relational databases can do. Even so, writing a crawler or connector is hard work for all but the most trivial applications. Crawlers have to know how to efficiently find relevant information in the repository, retrieve it, extract metadata and content, and send it to the indexer. In addition, crawlers have to know how to do incremental updates from the source repository by either keeping its own database of documents that it has seen before, or by keeping some kind of high-water mark.
Most enterprise search vendors supply crawlers that use amodular pipeline approach. There is a component that interacts directly with the source repository API. In the case of a web crawler, this component makes HTTP requests and retrieves webpages. It then parses the webpage for links contained within the webpage and adds them to the list of pages to download and index.
File system crawlers are much simpler—usually recursively walking a directory structure and not bothering to parse the files for links.
Document management system crawlers, on the other hand, use a variety of techniques, the most common of which is to use the native API of the document management system and retrieve documents according to rules set forth in the crawler configuration specific to that repository. Another technique is to create a simple web front-end page that contains a site map and templates,which render the document management repository as a set of simple webpages.These webpages can then be crawled by the HTTP module.
<!--[endif]-->Finally, databases are the most flexible of all data sources.Since good normalized database design is an impedance mismatch with good full-text index design, your well-normalized database must be denormalized into a flat table of "documents."
There are three general techniques for database indexing:
- Build SQL or stored procedures on the database side, which denormalize the tables into a ingle "documents" table.
- Use database connectivity from the source language (e.g.,JDBC for Java) to retrieve the individual component tables and do the "join" in your program code.
- Build HTML or XML templates hosted on a website, which renders the database information as "pages" that can be crawled by the HTTP crawler.
In a way, it is kind of nice that Solr does not impose its own method of crawling data. In my experience, crawlers are often the weakest part of a commercial full-text search system. They are the redheaded stepchildren of companies that build fancy commercial systems. It is an unglamorous job to write these things, and all the cool kids want to work on the kernel.
Although there is no built-in crawler for Solr, there is a loose integration between the Nutch crawler and Solr. Currently, this integration is implemented as a patch for the Nutch code base. In order to use it, you must build Nutch from scratch and apply patches. In the open source community this is seen as no big deal. It is not a big deal for open source Java geeks, so if you are one of those, right on! To me, it’s kind of a pain in the neck, and it’s one side of the double-edged sword that is open source software.
Click here to download your Free PDF
(including additional illustrations)
Bells, Whistles, and Spangles
Solr is a big step toward true enterprise search, but it’s not quite there yet. If you are expecting a complete out-ofbox experience for indexing and searching your enterprise, then look elsewhere.While Lucene, Nutch,and Solr are quite sophisticated and are not much more difficult to install than equivalent closed source software, you sometimes may have to dive into the arcane world of tools like Subversion, Ant, Maven, and Eclipse to make your own build of the software. This can be daunting even for grizzled veterans since there are a lot of different ways to do it, and all the open source guys (yes,they are mostly guys) have their own idea of what constitutes a good build environment.
One thing you can do is index and search a huge number of documents into Solr.The theoretical limit is the size of a big integer, about 2 billion documents.That’s a lot, even for commercial engines. Of course, if you are really thinking of indexing that many documents, you should consider Nutch rather than Solr. Nutch is more suited to web-scale applications than enterprise-scale applications.
While I’ve not personally tried to index and search that many documents, the confident chatter on the message boards is that 30 million documents is a walk in the park for Solr.You probably don’t have 30 million documents to index, but it is comforting to know that you could.
In the Schema Things
As I mentioned before, Lucene is a low-level library. All informationin Lucene is stored as text even if the values could be interpreted as integers, floating points, dates, or even custom data types. It lacks any kind of schema. This means you can’t do arithmetic comparisons, numeric range searches, or date-parameter searches—which can seriously limit the effectiveness of search in the enterprise.
Solr enhances Lucene by adding sophisticated schema support. Not only does this schema support common data types, but you can also define your own custom data types, as well as dynamic fields whose names match a specific pattern. For example, you can designate all fields that end with *s to be treated as string fields.This gives you some flexibility if you don’t know the types of all of the fields that are being added to the index beforehand, yet you still want to represent different fields in different ways.
Configuration in Solr is handled by a schema configuration XML file, which allows you to specify advanced Lucene analyzers for each field type. You can also specify things like stemming, lemmatization, synonyms, stop-word lists, and sounds-like filters in the configuration file.
Solr supports "keyword in context" results highlighting, advanced query caching, and index replication and integration with the Luke index analyzer toolkit. It has a nifty feature called "copy fields" which allows you to treat the same incoming data in different ways for different purposes. You will find all of these features in the best commercial software packages in varying mixes. It is surprising to see such a complete set of features this early in the release cycle of an open source software project.
Licensed to Drive
Technology isn’t the only difference between open source and vendor-based solutions. Remember—open source is not public domain. One of the practical problems for any organization considering the use of open source software is the particular requirements of a given open source license. The Apache Software License, for example, is fairly liberal. Other licenses, like the BSD license or the GNU Public License (GPL), have some pretty significant strings attached, and they may have licensing fees that rival those of commercial licenses for commercial use. Solr and Lucene are under the Apache Software License version 2.0. At this point, my utter lack of lawyerly credentials compels me to point you to the Apache website for the actual text of the license (www.apache.org/licenses/LICENSE-2.0).
Any software that you consider using for any significant project should be vetted through your legal and purchasing departments. So be careful. Don’t assume that open source necessarily means free. That being said, there is something magical about the open source community getting together to build cool software without the benefit of endless marketing meetings and concerns about the bottom line. Some of the most talented software engineers in the world are involved in writing open source software. The Apache Software Foundation,which hosts Lucene, Solr, and Nutch, is particularly well-regarded in the open source community. The Apache web server is still the dominant server on the web, with more than 50% market share as of September 2007. It has held that spot since 1996.
Who to Sue
Say you are a giant company. You spend hundreds of thousands of clams on a proprietary search solution from a commercial vendor.You have all kinds of contracts, and your lawyers have met with their lawyers. Maybe your CTOhas been golfing with their sales manager.
Unfortunately, the software doesn’t quite live up to yourexpectations. They told you it was easy to set up. It isn’t. It crashes all thetime, and you can’t for the life of you sort out why a search for "toilet plunger" returns your corporate homepage.
You really need to talk to the surly guy with the neck beard who wrote the software. Sadly, a phalanx of scrip treading gatekeepers keeps asking you for your "site ID" and telling you to try a reinstall and reboot.
It probably won’t get to the point of a lawsuit—after all, your CEO hasn’t yet berated their CEO on the phone—but at least you have the threat of legal action and revenue reversals to get the vendor’s attention.
Not so with open source software. There is nobody to sue. There are no invoices to withhold. Which is all a big risk. Luckily, there are some features of open source software that mitigate this risk. It is "opensource" after all. You have complete access to the source code. If the software doesn’t do what you want it to do, you can fix it. More precisely, you can hire someone to fix it. Since it is open source, there are probably quite a few fervent hackers who would be happy to add a patch or fix that crash bug for a reasonable and customary fee.
Try wresting a peek at the source code to a commercial search application to figure out exactly why toilet plunger accuracy is so elusive. I’ll wait ...
So what about support? For active open source projects like Lucene, Solr, and Nutch, there are very active support mailing lists and fora. Often, you can post a question in the morning and one or more of the committers or contributors to the project will reply by the afternoon, if not immediately. This level of support is superior to what you often get with an annual maintenance contract for commercial software, and it is provided free of charge by people who love and develop the software.
For example, I subscribe to the Solr-dev mailing list. Everyday I get dozens of questions and answers posted to my inbox. There are 1,717 unread messages in my "Solr Subscription" email folder. Yet even if (or possibly because) I’m not the one responding, very few of the questions go unanswered. The list archive on the Nabble forum website (www.nabble.com) has thousands more questions.The community is helpful, friendly, and responsive to politely worded requests for answers. Since ostensibly nobody is being paid to monitor and respond to questions, it is amazing that this process works at all.
Elixir or Poison?
Presenting two alternatives as the only possible options to describe Solr, when in reality there exists one or more other options, is what we in the logical fallacy business call a false dichotomy. Solr is neither an elixir nor a poison. It’s more like a hamburger: satisfying and nutritious, and you can live on nothing but hamburgers (for a while at least). But in order to get full value from hamburgers, you really need some broccoli; maybe a bun; some lettuce,tomatoes, onions; and a tall glass of unsweetened iced tea. Surprisingly, you don’t need fries at all.
Of course not everybody loves hamburgers. If you are a vegan for example, you probably want nothing to do with a hamburger, no matter how juicy and delicious it may be. Sadly,this exquisite analogy breaks down rather quickly. For example, hamburgers really taste best in summer, right off the grill—whereas, despite its sunny name, Solr is a computer program. Forget I ever mentioned it.
How about this: Solr is like a giant badger ...
Oh, never mind.
Flawed analogies aside, you have many options when choosing an enterprise search platform. Open source software can provide a compelling choice for many organizations, maybe even yours.
Download the Complete PDF (Free)
About the Author
GEORGE EVERITT (email@example.com) is currently president of Applied Relevance LLC, an enterprise search consulting firm. He began his career in the information management sector as a senior consultant for Verity, Inc. and later with Autonomy, Inc. During his time with these firms, his clients included well-known organizations in many disparate verticals including pharmaceutical, law enforcement, government and defense, publishing, and financial services. Everitt lives in the Tampa Bay,Fla., area with his wife and two sons.