Enterprise Search Center

RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES

March 05, 2008

Table of Contents

Open Source Search: Elixir or Poison?

Coveo and NavigationArts Announce Strategic Partnership

Medio Systems Research Panel Tracks Shift In Mobile Search

Text analytics for patient safety

Ektron gets to the point

Able WCM on demand

Thomson Scientific Announces Alliance With Collexis, New Data on Thomson Innovation

Northern Light Launches MI Analyst 2.0

Open Source Search: Elixir or Poison?

When I was a child in the ‘70s I was promised three things: an empathetic computer with a conversational voice bearing a Canadian accent, a ride on a Pan AmericanWorld Airways space shuttle, and a good selection of open source enterprise search software. These things were to be ubiquitous by 2001. And no, I was never promised a flying car. Heck, that’s just silly. Armed with expectation, I invested in a computer science degree and then waited for the juggernaut of open source enterprise search (OSES to coin an acronym) software to sweep me into a world of linguistic bliss. And I waited. . . . And waited. Waiting is not as lucrative as one might hope, so after a while I got a job with a commercial enterprise search vendor. I slogged thousands of miles helping hard-working North Americans find stuff on their internal networks. I tell myself that this was rewarding, which dampens the cognitive dissonance that the past 10 years of my life has been a complete failure of potential.

If only I had chosen a career in database, operating system, or even front-office applications, I would be luxuriating in the likes of Linux, MySQL,Tomcat, and even OpenOffice. It seems like the great open source solutions to common computing problems have been around for a long time.

Take databases: Relational database management systems have been in common use since the 1980s, and we have quite a few open source options like MySQL, which has been around for at least a decade.Tomcat is equally well regarded as a web application server, and OpenOffice is making a concerted effort to unseat the seemingly unmovable Microsoft Office as king of the spreadsheets.

Solr Nrgy

A few years ago, Doug Cutting gave the open source world a wonderful gift. Not to get all gushy about it, but it really is a nice piece ofsoftware: Lucene is an open source search library that is fast, extensible, scalable, and easy to embed. Originally written in Java, there are ports available for all of the major programming languages. It is highly thought of and embedded in dozens of commercial and open source web and desktop applications. But before you get too excited, there are a few things you shouldknow.

Lucene does some things very well, but even indexing an HTMLdocument from a file system requires you to write code. If you’re building the next killer blog software or email client then you can use Lucene as is. Enterprise search, however, requires a lot more software.

Ultimately, Lucene is just a library. All of you librarians are out there saying, "Hey, wait a gosh-darn minute. Just alibrary? Libraries are the backbone of modern civilization." I also hear you mutter "what a poltroon" in that passiv eaggressive librarian tone of yours. Don’t tell me I’m hearing things. And yes, I do know what "poltroon" means.

Solr is a subproject of Lucene. Apparently, it follows the Web2.0 hipstr trend of eliminating those redundant schwa’s from our vocabulary. if ur <30 u know what I mean lol. jk.

The Solr developers define it better than I do, so here goes: "Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat."

In other words, Solr adds infrastructure on top of Lucene that makes it much easier to implement in an enterprise context. This is pretty much the way it works in the commercia lworld, too. For example, Verity had a K2 layer on top of the VDK libraries. Autonomy has an IDOL layer on top of the DRE coreengine. Some other commercial engines actually use Lucene under the covers.

Solr uses the buzzword-compliant REST (Representational StateTransfer) as its communications protocol. It is implemented as a J2EEapplication (war file), which takes input from a standard HTTP query string and produces results in a standard XML document. This is a simple protocol that is getting good traction in the Web2.0 community.

For example, to run a query against Solr, you set up a URL with parameters and send it to the Solr server. The Solr server in turn returns an XML document, which you can then parse with any programming language known to man.

Click here for your free PDF (including extra charts).

If It Was Easy, Everybody Would Do It

As with any enterprise product, the search tool is only partof the equation. An enterprise search tool is not very helpful unless you canextract data from the source repositories and get it into the engine in the first place.

A director at an enormous search company once told me, "It’sall about connectors." Indexing and searching are relatively easy. The hard part is getting access to every organization’s odd mix of document repositories, crawling those repositories, preserving access control information, and filtering the documents into a standard set of metadata.

No matter how vanilla you think your back-office structure is, it is unique. No two organizations have precisely the same mix of content sources. It amazes me how often a client asks me something like,"How do your other customers index their custom Lotus Notes/Oracle hybrid web application on OS/2 with IBM BookMastersource files?" I generally respond with a blank stare.

"Don’t you see?" the well-intentioned client stammers on, "We can just do what they do!" I nod with the conviction of someone who just signed a statement-of-work.

"See how we’ve made your life easier? We must be your best customer ever."

"Yes, yes you are. Thank you for that." I say this with no hint of sarcasm, because I’m a professional that way.

Not incidentally, this is where the enterprise search vendors make the big bucks. Mapping all of those different content repository sources into a single unified full-text index is hard work. Don’t even get me started on security.

Making the Connection

So, if you want to use Solr, you need to find a way to get your data into it. Adding data to Solr is easy … once you have it the right format. For example, indexing information from a database is as simple as rendering the output of an SQL query into XML—something all modern relational databases can do. Even so, writing a crawler or connector is hard work for all but the most trivial applications. Crawlers have to know how to efficiently find relevant information in the repository, retrieve it, extract metadata and content, and send it to the indexer. In addition, crawlers have to know how to do incremental updates from the source repository by either keeping its own database of documents that it has seen before, or by keeping some kind of high-water mark.

Most enterprise search vendors supply crawlers that use amodular pipeline approach. There is a component that interacts directly with the source repository API. In the case of a web crawler, this component makes HTTP requests and retrieves webpages. It then parses the webpage for links contained within the webpage and adds them to the list of pages to download and index.

File system crawlers are much simpler—usually recursively walking a directory structure and not bothering to parse the files for links.

Document management system crawlers, on the other hand, use a variety of techniques, the most common of which is to use the native API of the document management system and retrieve documents according to rules set forth in the crawler configuration specific to that repository. Another technique is to create a simple web front-end page that contains a site map and templates,which render the document management repository as a set of simple webpages.These webpages can then be crawled by the HTTP module.

Finally, databases are the most flexible of all data sources.Since good normalized database design is an impedance mismatch with good full-text index design, your well-normalized database must be denormalized into a flat table of "documents."

There are three general techniques for database indexing:

Build SQL or stored procedures on the database side, which denormalize the tables into a ingle "documents" table.
Use database connectivity from the source language (e.g.,JDBC for Java) to retrieve the individual component tables and do the "join" in your program code.
Build HTML or XML templates hosted on a website, which renders the database information as "pages" that can be crawled by the HTTP crawler.

In a way, it is kind of nice that Solr does not impose its own method of crawling data. In my experience, crawlers are often the weakest part of a commercial full-text search system. They are the redheaded stepchildren of companies that build fancy commercial systems. It is an unglamorous job to write these things, and all the cool kids want to work on the kernel.

Although there is no built-in crawler for Solr, there is a loose integration between the Nutch crawler and Solr. Currently, this integration is implemented as a patch for the Nutch code base. In order to use it, you must build Nutch from scratch and apply patches. In the open source community this is seen as no big deal. It is not a big deal for open source Java geeks, so if you are one of those, right on! To me, it’s kind of a pain in the neck, and it’s one side of the double-edged sword that is open source software.

Click here to download your Free PDF (including additional illustrations)

Bells, Whistles, and Spangles

Solr is a big step toward true enterprise search, but it’s not quite there yet. If you are expecting a complete out-ofbox experience for indexing and searching your enterprise, then look elsewhere.While Lucene, Nutch,and Solr are quite sophisticated and are not much more difficult to install than equivalent closed source software, you sometimes may have to dive into the arcane world of tools like Subversion, Ant, Maven, and Eclipse to make your own build of the software. This can be daunting even for grizzled veterans since there are a lot of different ways to do it, and all the open source guys (yes,they are mostly guys) have their own idea of what constitutes a good build environment.

One thing you can do is index and search a huge number of documents into Solr.The theoretical limit is the size of a big integer, about 2 billion documents.That’s a lot, even for commercial engines. Of course, if you are really thinking of indexing that many documents, you should consider Nutch rather than Solr. Nutch is more suited to web-scale applications than enterprise-scale applications.

While I’ve not personally tried to index and search that many documents, the confident chatter on the message boards is that 30 million documents is a walk in the park for Solr.You probably don’t have 30 million documents to index, but it is comforting to know that you could.

In the Schema Things

As I mentioned before, Lucene is a low-level library. All informationin Lucene is stored as text even if the values could be interpreted as integers, floating points, dates, or even custom data types. It lacks any kind of schema. This means you can’t do arithmetic comparisons, numeric range searches, or date-parameter searches—which can seriously limit the effectiveness of search in the enterprise.

Solr enhances Lucene by adding sophisticated schema support. Not only does this schema support common data types, but you can also define your own custom data types, as well as dynamic fields whose names match a specific pattern. For example, you can designate all fields that end with *s to be treated as string fields.This gives you some flexibility if you don’t know the types of all of the fields that are being added to the index beforehand, yet you still want to represent different fields in different ways.

Configuration in Solr is handled by a schema configuration XML file, which allows you to specify advanced Lucene analyzers for each field type. You can also specify things like stemming, lemmatization, synonyms, stop-word lists, and sounds-like filters in the configuration file.

Solr supports "keyword in context" results highlighting, advanced query caching, and index replication and integration with the Luke index analyzer toolkit. It has a nifty feature called "copy fields" which allows you to treat the same incoming data in different ways for different purposes. You will find all of these features in the best commercial software packages in varying mixes. It is surprising to see such a complete set of features this early in the release cycle of an open source software project.

Licensed to Drive

Technology isn’t the only difference between open source and vendor-based solutions. Remember—open source is not public domain. One of the practical problems for any organization considering the use of open source software is the particular requirements of a given open source license. The Apache Software License, for example, is fairly liberal. Other licenses, like the BSD license or the GNU Public License (GPL), have some pretty significant strings attached, and they may have licensing fees that rival those of commercial licenses for commercial use. Solr and Lucene are under the Apache Software License version 2.0. At this point, my utter lack of lawyerly credentials compels me to point you to the Apache website for the actual text of the license (www.apache.org/licenses/LICENSE-2.0).

Any software that you consider using for any significant project should be vetted through your legal and purchasing departments. So be careful. Don’t assume that open source necessarily means free. That being said, there is something magical about the open source community getting together to build cool software without the benefit of endless marketing meetings and concerns about the bottom line. Some of the most talented software engineers in the world are involved in writing open source software. The Apache Software Foundation,which hosts Lucene, Solr, and Nutch, is particularly well-regarded in the open source community. The Apache web server is still the dominant server on the web, with more than 50% market share as of September 2007. It has held that spot since 1996.

Who to Sue

Say you are a giant company. You spend hundreds of thousands of clams on a proprietary search solution from a commercial vendor.You have all kinds of contracts, and your lawyers have met with their lawyers. Maybe your CTOhas been golfing with their sales manager.

Unfortunately, the software doesn’t quite live up to yourexpectations. They told you it was easy to set up. It isn’t. It crashes all thetime, and you can’t for the life of you sort out why a search for "toilet plunger" returns your corporate homepage.

You really need to talk to the surly guy with the neck beard who wrote the software. Sadly, a phalanx of scrip treading gatekeepers keeps asking you for your "site ID" and telling you to try a reinstall and reboot.

It probably won’t get to the point of a lawsuit—after all, your CEO hasn’t yet berated their CEO on the phone—but at least you have the threat of legal action and revenue reversals to get the vendor’s attention.

Not so with open source software. There is nobody to sue. There are no invoices to withhold. Which is all a big risk. Luckily, there are some features of open source software that mitigate this risk. It is "opensource" after all. You have complete access to the source code. If the software doesn’t do what you want it to do, you can fix it. More precisely, you can hire someone to fix it. Since it is open source, there are probably quite a few fervent hackers who would be happy to add a patch or fix that crash bug for a reasonable and customary fee.

Try wresting a peek at the source code to a commercial search application to figure out exactly why toilet plunger accuracy is so elusive. I’ll wait ...

So what about support? For active open source projects like Lucene, Solr, and Nutch, there are very active support mailing lists and fora. Often, you can post a question in the morning and one or more of the committers or contributors to the project will reply by the afternoon, if not immediately. This level of support is superior to what you often get with an annual maintenance contract for commercial software, and it is provided free of charge by people who love and develop the software.

For example, I subscribe to the Solr-dev mailing list. Everyday I get dozens of questions and answers posted to my inbox. There are 1,717 unread messages in my "Solr Subscription" email folder. Yet even if (or possibly because) I’m not the one responding, very few of the questions go unanswered. The list archive on the Nabble forum website (www.nabble.com) has thousands more questions.The community is helpful, friendly, and responsive to politely worded requests for answers. Since ostensibly nobody is being paid to monitor and respond to questions, it is amazing that this process works at all.

Elixir or Poison?

Presenting two alternatives as the only possible options to describe Solr, when in reality there exists one or more other options, is what we in the logical fallacy business call a false dichotomy. Solr is neither an elixir nor a poison. It’s more like a hamburger: satisfying and nutritious, and you can live on nothing but hamburgers (for a while at least). But in order to get full value from hamburgers, you really need some broccoli; maybe a bun; some lettuce,tomatoes, onions; and a tall glass of unsweetened iced tea. Surprisingly, you don’t need fries at all.

Of course not everybody loves hamburgers. If you are a vegan for example, you probably want nothing to do with a hamburger, no matter how juicy and delicious it may be. Sadly,this exquisite analogy breaks down rather quickly. For example, hamburgers really taste best in summer, right off the grill—whereas, despite its sunny name, Solr is a computer program. Forget I ever mentioned it.

How about this: Solr is like a giant badger ...

Oh, never mind.

Flawed analogies aside, you have many options when choosing an enterprise search platform. Open source software can provide a compelling choice for many organizations, maybe even yours.

Download the Complete PDF (Free)

About the Author

GEORGE EVERITT (geveritt@appliedrelevance.com) is currently president of Applied Relevance LLC, an enterprise search consulting firm. He began his career in the information management sector as a senior consultant for Verity, Inc. and later with Autonomy, Inc. During his time with these firms, his clients included well-known organizations in many disparate verticals including pharmaceutical, law enforcement, government and defense, publishing, and financial services. Everitt lives in the Tampa Bay,Fla., area with his wife and two sons.

Back to Contents...

Coveo and NavigationArts Announce Strategic Partnership

Coveo Solutions, Inc., and NavigationArts today announced a strategic partnership between the companies. Through this partnership, Coveo, a global provider of secure, enterprise search solutions, and NavigationArts, a web consultancy specializing in user-centered design, content management, and development, will collaborate to deliver search solutions.

(www.coveo.com, www.navigationarts.com)

Back to Contents...

Medio Systems Research Panel Tracks Shift In Mobile Search

Medio Systems, Inc., a provider of mobile search and advertising solutions, announced that its 2H07 research demonstrates the emerging use of mobile search to find information from the mobile web. Medio's research illustrates the evolution of mobile search away from downloadable content towards information on the mobile internet. As a result of this growing use of mobile web search from the handset, Medio is also tracking a healthy adoption of mobile search among users as well as the predominance of certain ad-related functions that are well-suited to the mobile interface.

The majority of mobile search queries have traditionally been performed in relation to downloading mobile content. Information from Medio Systems' research suggests that downloadable content is still the most popular query type in mobile search, but that the prevalence of this type of search has shrunk by just over 10% to 60% of all searches since July 2007. This finding corresponds with the associated growth in the Web/WAP category which has seen the greatest usage increase of 43% in the same period.

(www.medio.com)

Back to Contents...

Text analytics for patient safety

The National Center for Patient Safety (NCPS) of the U.S. Dept. of Veterans Affairs (VA) is using text analytics software to help analyze patient safety reports received from 153 hospitals operated by the VA.

The NCPS has deployed the PolyAnalyst data and text mining solution from Megaputer Intelligence to find common patterns, emerging trends and root causes in the safety reports, according to a news release from Megaputer Intelligence.

Megaputer says the NCPS’s goal is to reduce and prevent inadvertent harm to patients as a result of care. VA analysts try to learn from the reports details of close calls, also known as "near misses," which occur at a higher frequency rate than actual adverse events. Through use of the reports, the analysts try to identify and fix problems to improve safety and quality of care.

According to Megaputer, the narrative is the most important part of the reports, which is why the VA needed a text mining system capable of solving text clustering and categorization tasks to detect patterns and present results in a user friendly way. The aim of the technology is to overcome the problems of manual analysis, which includes slow processing, low accuracy and potential bias, according to Megaputer.

Back to Contents...

Ektron gets to the point

Ektron has built a SharePoint Connector for Ektron CMS400.Net, allowing Ektron customers to employ SharePoint's collaborative capabilities to create documents. These documents can then be delivered to a public facing Web site, corporate intranet or extranet, enabled with all the latest search, navigation, Web 2.0 and social networking functionality provided by CMS400.Net, Ektron says.

The Connector is integrated fully into the menu structure of SharePoint and allows SharePoint users to distribute documents to their corporate intranet or public facing Web site, through a simple wizard, without leaving the SharePoint environment.

Ektron claims CMS400.Net adds value to SharePoint-created documents including enterprise search and taxonomy technology; "social bookmaking" favorites to personalize how users interact with the assets; and Web 2.0 functionality, such as content ratings and discussion boards, which allow users to provide feedback on documents.

Back to Contents...

Able WCM on demand

Clickability has unveiled its new On Demand WCM Platform and three new product packages tailored specifically for the company’s media and publishing customers, Fortune 500 enterprises and SMB clients.

The company claims its platform is the only end-to-end solution that enables non-technical users to create, manage, publish, deliver, measure and adapt Web sites easily and efficiently. Further, the company says, publishers and enterprise marketers can harness the real-time power of the Web, and are freed from relying on slow, costly and anti-green on-premise software.

Clickability's on demand WCM platform combines software as a service (SaaS) with infrastructure as a service (IaaS), so Clickability platform users are no longer dependent on resource and budget-strapped IT departments. Agile companies can now eliminate the need for costly hardware and other overhead, and instead leverage Clickability’s multi-tenancy, patent-pending IaaS solution. With just-in-time scalability, the IaaS solution "spike proofs" companies from brownouts during peak Web site usage, resulting in fast, dependable and consistent Web page delivery and performance—all the time, anytime, Clickability reports.

While all four components of the Clickability platform are best in class, the company says, the platform’s real power is unleashed with the combination and seamless integration of the components in parallel, resulting in reduced costs, increased revenues and more valuable brands.

The Clickability On Demand WCM Platform includes:

Infrastructure as a Service. The Clickability platform is backed by a comprehensive on demand infrastructure that includes hosting, security, data storage, service-level agreements (SLAs) and disaster recovery.

Implementation and Support as a Service. Clickability’s on-boarding process is supported by a legendary client services and customer support organization that has delivered more than 400 successful Web site deployments. Clickability’s branded Implementation Practice ensures that the platform is configured to customer needs, often moving from discovery to launch in a few short weeks.

Software as a Service. The SaaS component of the platform covers the entire Web content life cycle, and includes Content Management, Analytics, Email Newsletters, Site Search, Ad Server, Polls and Surveys, RSS/XML Syndication, Multilingual Support and Social Media.

Innovation as a Service. Based on the Clickability Platform Innovation Model (which includes shared customer best practices, benchmarking, code libraries and solution databases), customers can easily and quickly innovate on top of the platform.

The Clickability On Demand WCM Platform is available immediately in three editions: Express, Professional and Enterprise. A single code base across all editions creates a cost-effective upgrade path that scales with an organization’s growth, says the company.

Back to Contents...

Thomson Scientific Announces Alliance With Collexis, New Data on Thomson Innovation

Thomson Scientific, part of The Thomson Corporation and provider of information solutions to the worldwide research and business communities, and Collexis Holdings Inc., a developer of high definition search and knowledge discovery software, announced plans to join together Collexis' Knowledge Dashboard with Thomson Scientific's Web of Science to create a custom data mining solution for the research community. Called the Thomson Collexis Dashboard, it is intended to provide knowledge discovery for the academic and government R&D communities.

Thomson also announced the addition of Derwent World Patents Index and scientific literature, including Web of Science and Inspec, to Thomson Innovation its new intellectual property research and analysis solution. In addition to English translations of Japanese full text patent data, Thomson Innovation now includes editorially enhanced English-language patent abstracts for China, coupled with additional Asian coverage.

(www.thomson.com, www.collexis.com)

Back to Contents...

Northern Light Launches MI Analyst 2.0

Northern Light has launched its second major release of MI Analyst, an automated "meaning extraction" application designed specifically for market intelligence, market research, and product research. MI Analyst 2.0 adds many new "facets" (categories of terms) by which the software can analyze search results, automatically extracting meaning from internal and research documents, licensed secondary research, news stories and web sources. Joining the previously released facets (Companies, Venture-Funded Companies, IT Technologies, IT Markets), new and expanded facets include Government Agencies, Industries, Business Issues and Strategic Scenarios. Also new in MI Analyst 2.0 is a facility to improve the value of search results based on the proximity of specified terms or phrases to each other and to any of the terms in any of the facets in MI Analyst. With the 2.0 release, MI Analyst expands beyond its roots in the IT sector to the pharmaceutical industry research. New facets relevant to pharmaceuticals include Human Anatomy, Diseases, Drugs, Cells, Cell Receptors, Proteins, Genes, Enzymes, Pharmaceutical Markets, Life Sciences Scenarios and Research Strategies and Therapeutic Approaches. MI Analyst is immediately vailable from Northern Light as an added-value option for SinglePoint enterprise market research portals, and as an integrated capability within Analyst Direct(TM), Northern Light's subscription-based market research search engine.

(www.northernlight.com)

Back to Contents...

[Newsletters] [Home]