Most organizations purchase a search system based on a set of goals.They want to save time that staff spends searching through mountains of unstructured "stuff" (meaning not tagged, organized, nor fielded full text), and they want to preserve knowledge— dare we say wisdom—of retiring staff members who hold the most information about what the organization does. Unfortunately, many organizations fall prey to some common fallacies:
1. Full-text search is sufficient for good results.
2. It’s possible to get good search results without structure/metadata/well-formed data.
3. Most search engines automatically know how to make the most of available structure.
4. Taxonomy modules within search engines support a full-featured taxonomy/thesaurus.
What can happen at this point is that the goals become secondary to the software purchase.The software is bought and then implementation is attempted. This is backward. The overall intellectual property (IP) strategy should not be based on the selection of a search solution. Rather, careful consideration of the uses, content, context, and structure of the organization’s current data can make a big difference in the selection of a search tool—one that will really work.
The Final Piece of the Big Puzzle
The search software should be the last thing you purchase in an information solution investment, not the first. Most companies over rely and over invest in search software as the solution to their information management problems. They treat search software as the centerpiece of a strategy when it should play a much smaller role. In fact, search technology, in its current state, is creating problems rather than solving them.
In the last few years, since search has been widely available within companies for their employees, the amount of time spent searching has increased to the point where staff now spends more time searching than any other job function—up to 15 hours per week. More time is spent on searching than on thinking about and analyzing the information staff memebers finally retrieve. Not only researchers and information professionals are affected, but those in administration, accounting, human resources, and other departments are facing the same problems. This costs hundreds of millions of dollars per year for large companies. Desperately seeking a better solution, the average large organization has not one but four search software systems and is dissatisfied with all of them.
So what should happen? Ask yourself the following:
- What kind of information are searchers looking for?
- How is it organized now?
- What should the answers look like?
- Where is the search software to be used?
- Is your new search engine for internal use, a public website, or both?
- Is the data really unstructured? Does it have a title? A date? (MS Word files, for example, contain elements of structured metadata that has not been captured.)
If you want the general public coming to your website to have the same experience it gets at most sites, then you can do it all with most any search engine on the market. For a basic search solution that meets low-level expectations, consider MySQL, Postgres, or Lucene,which are free.They are also under the hood of many for-fee systems on the market.
However, inside the enterprise, you certainly want users to have an above-average experience, with rich, accurate, and fast search. This requires a robust metadata and taxonomic strategy deployed with the appropriate tools.
Get Organized
Most search engines these days have some categorization and taxonomy capabilities. The reason for this is simple: Search engines don’t work very well without them. Most search software vendors have resorted to kludging on some sort of categorization functionality or rules system to get to an acceptable accuracy rate. Most don’t really believe in taxonomies or thesauri or metadata.They are steeped in the theoretical basis that these add-ons are unnecessary. They are seduced by the thought that human eyes never need to touch the data to make it searchable. However, since it doesn’t work that way in real life, they begrudgingly provide a pseudo taxonomy system with minimal features.
Some of the biggest names in the search software industry have a long and colorful history of publicly denouncing thesauri as worthless and dead. Now they claim to have supported them all along. Search is a complex undertaking with many variables in implementation and many constituencies to serve! If you want your colleagues to be able to develop new products from your treasure house of intellectual property, then you will also need a robust taxonomic strategy deployed with the appropriate tools.
Unfortunately, information technology (IT) departments purchase new search software with little consideration of how to make it actually work and, more importantly, how it is going to work with your content. The search software alone will not provide the results hoped for unless the data is "well-formed." To further aggravate the situation, many of the search software companies convince the buyer in IT that the software will work with minimal training and "automatically" tag the data. No need, they say, to add that to the budget. In fact, it takes up to 1 hour per term searched to build the training sets. Most organizations need about 5,000 to 6,000 terms to cover their intellectual holdings.
That means you have 6,000 staff hours—3 years of stafftime—invested before the search can begin to work. This substantial investmentin time and money is conveniently not included in the purchase estimate. No wonder many vendors make most of their money on the "associated services" and not on software sales.
Search software alone will not accomplish organizational retrieval goals. Consider the issue of persistence of retrieved documents among search results. Most search engines will automatically place a document into one or more categories with about 50% accuracy. That means half of the data is correct for the query and half of it is wrong. What a waste of time. Take the document out of a particular search environment and the original categorization is lost. When changes are made to the search system, everything gets moved to new category folders.This lack of persistence and precision frustrates users. Researchers want precision (exactly what they want), recall (all of what they want), and relevance (the data matches their request). This can also be expressed in hits (exactly what humans would choose), misses (stuff they wanted that was not retrieved), and noise (stuff the computer suggests that they didn’t want). Searchers want to feel confident they got all the relevant data and not a lot of extraneous junk.
Get Your Content in Shape
Under the hood of a search engine is a set of algorithms or logic statements.The kinds of search algorithms used to build and implement a search software system vary widely. There are Bayesian engines, inference, vector, ranking, natural language processing and its parts (semantic, syntactic, phraseological, morphological, grammatical, common sense, etc.), co-occurrence,clustering, sequel rules, neural networks, etc. How does one create content to work with so many different search system options? The basic concepts to support well-formed content while ensuring that the end user will be able to find your data easily, quickly, and accurately have not changed. Good search depends on well-formatted, well-formed data.
Look at your data. Decide how it is currently organized and how you would like it to be organized. What elements do you want to be able to search on? Create a sample of your content identifying those elements. Then investigate which search engines will work with your data. Don’t try to squeeze your data into a tool that is not made for it. Understand the critical elements that result in search success, without being confused by inaccurate claims and being misled into believing fallacies.
About the Author
MARJORIE M.K. HLAVA is president, chairman, and founder of Access Innovations, Inc. (www.accessinn.com). She is past president of NFAIS and the American Society for Information Science and Technology. Hlava has done extensive research and given numerous presentations domestically and internationally on thesaurus development, taxonomy creation, natural language processing, machine translations, and machine-aided indexing.