EnterpriseSearchCenter.com Home
  News   Features   White Papers   Research Reports   Web Events   Conferences  
 
RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES
March 18, 2009

Table of Contents

Fight Search Fallacies with Well-Formed Content
Endeca and Gabriels partner
Autonomy DAM for MOSS
DeepDyvers Updates DeepDyve Search Engine
Datawatch Releases Datawatch ES Version 4.6
Gabriels and Endeca Partner
Hearing (customer) voices
Tsavo Media Launches Twithority
WordLogic Announces Text Solution Now Available on Windows Mobile

Fight Search Fallacies with Well-Formed Content


Most organizations purchase a search system based on a set of goals.They want to save time that staff spends searching through mountains of unstructured "stuff" (meaning not tagged, organized, nor fielded full text), and they want to preserve knowledge— dare we say wisdom—of retiring staff members who hold the most information about what the organization does. Unfortunately, many organizations fall prey to some common fallacies:

1. Full-text search is sufficient for good results.

2. It’s possible to get good search results without structure/metadata/well-formed data.

3. Most search engines automatically know how to make the most of available structure.

4. Taxonomy modules within search engines support a full-featured taxonomy/thesaurus.

What can happen at this point is that the goals become secondary to the software purchase.The software is bought and then implementation is attempted. This is backward. The overall intellectual property (IP) strategy should not be based on the selection of a search solution. Rather, careful consideration of the uses, content, context, and structure of the organization’s current data can make a big difference in the selection of a search tool—one that will really work.

The Final Piece of the Big Puzzle

The search software should be the last thing you purchase in an information solution investment, not the first. Most companies over rely and over invest in search software as the solution to their information management problems. They treat search software as the centerpiece of a strategy when it should play a much smaller role. In fact, search technology, in its current state, is creating problems rather than solving them.

In the last few years, since search has been widely available within companies for their employees, the amount of time spent searching has increased to the point where staff now spends more time searching than any other job function—up to 15 hours per week. More time is spent on searching than on thinking about and analyzing the information staff memebers finally retrieve. Not only researchers and information professionals are affected, but those in administration, accounting, human resources, and other departments are facing the same problems. This costs hundreds of millions of dollars per year for large companies. Desperately seeking a better solution, the average large organization has not one but four search software systems and is dissatisfied with all of them.

So what should happen? Ask yourself the following:

  • What kind of information are searchers looking for?
  • How is it organized now?
  • What should the answers look like?
  • Where is the search software to be used?
  • Is your new search engine for internal use, a public website, or both?
  • Is the data really unstructured? Does it have a title? A date? (MS Word files, for example, contain elements of structured metadata that has not been captured.)

If you want the general public coming to your website to have the same experience it gets at most sites, then you can do it all with most any search engine on the market. For a basic search solution that meets low-level expectations, consider MySQL, Postgres, or Lucene,which are free.They are also under the hood of many for-fee systems on the market.

However, inside the enterprise, you certainly want users to have an above-average experience, with rich, accurate, and fast search. This requires a robust metadata and taxonomic strategy deployed with the appropriate tools.

 

 

Get Organized

Most search engines these days have some categorization and taxonomy capabilities. The reason for this is simple: Search engines don’t work very well without them. Most search software vendors have resorted to kludging on some sort of categorization functionality or rules system to get to an acceptable accuracy rate. Most don’t really believe in taxonomies or thesauri or metadata.They are steeped in the theoretical basis that these add-ons are unnecessary. They are seduced by the thought that human eyes never need to touch the data to make it searchable. However, since it doesn’t work that way in real life, they begrudgingly provide a pseudo taxonomy system with minimal features.

Some of the biggest names in the search software industry have a long and colorful history of publicly denouncing thesauri as worthless and dead. Now they claim to have supported them all along. Search is a complex undertaking with many variables in implementation and many con­stituencies to serve! If you want your colleagues to be able to develop new products from your treasure house of intellectual property, then you will also need a robust taxonomic strategy deployed with the appropriate tools.

Unfortunately, information technology (IT) departments purchase new search software with little consideration of how to make it actually work and, more importantly, how it is going to work with your content. The search software alone will not provide the results hoped for unless the data is "well-formed." To further aggravate the situation, many of the search software companies convince the buyer in IT that the software will work with minimal training and "automatically" tag the data. No need, they say, to add that to the budget. In fact, it takes up to 1 hour per term searched to build the training sets. Most organizations need about 5,000 to 6,000 terms to cover their intellectual holdings.

That means you have 6,000 staff hours—3 years of stafftime—invested before the search can begin to work. This substantial investmentin time and money is conveniently not included in the purchase estimate. No wonder many vendors make most of their money on the "associated services" and not on software sales.

Search software alone will not accomplish organizational retrieval goals. Consider the issue of persistence of retrieved documents among search results. Most search engines will automatically place a document into one or more categories with about 50% accuracy. That means half of the data is correct for the query and half of it is wrong. What a waste of time. Take the document out of a particular search environment and the original categorization is lost. When changes are made to the search system, everything gets moved to new category folders.This lack of persistence and precision frustrates users. Researchers want precision (exactly what they want), recall (all of what they want), and relevance (the data matches their request). This can also be expressed in hits (exactly what humans would choose), misses (stuff they wanted that was not retrieved), and noise (stuff the computer suggests that they didn’t want). Searchers want to feel confident they got all the relevant data and not a lot of extraneous junk.

Get Your Content in Shape

Under the hood of a search engine is a set of algorithms or logic statements.The kinds of search algorithms used to build and implement a search software system vary widely. There are Bayesian engines, inference, vector, ranking, natural language processing and its parts (semantic, syntactic, phraseological, morphological, grammatical, common sense, etc.), co-occurrence,clustering, sequel rules, neural networks, etc. How does one create content to work with so many different search system options? The basic concepts to support well-formed content while ensuring that the end user will be able to find your data easily, quickly, and accurately have not changed. Good search depends on well-formatted, well-formed data.

Look at your data. Decide how it is currently organized and how you would like it to be organized. What elements do you want to be able to search on? Create a sample of your content identifying those elements. Then investigate which search engines will work with your data. Don’t try to squeeze your data into a tool that is not made for it. Understand the critical elements that result in search success, without being confused by inaccurate claims and being misled into believing fallacies.

About the Author

MARJORIE M.K. HLAVA is president, chairman, and founder of Access Innovations, Inc. (www.accessinn.com). She is past president of NFAIS and the American Society for Information Science and Technology. Hlava has done extensive research and  given numerous presentations domestically and internationally on thesaurus development, taxonomy creation, natural language processing, machine translations, and machine-aided indexing.

 

Back to Contents...

Endeca and Gabriels partner

Endeca and Gabriels Technology Solutions, a private-label e-commerce technology provider, have announced new search products for media that offer increased site activity, higher lead generation, enhanced user experience and increased advertising revenue, the companies say.

Gabriels’ portal search technology is based on the Endeca Information Access Platform designed to combine search simplicity with the analytical power of business intelligence. Endeca says that by using its search capabilities, Gabriels is able to provide a superior experience to site visitors, while offering site owners the tools needed to create contextual advertising and premium placement opportunities.

Gabriels’ Endeca-powered technology provides vertical portals for more than 300 media properties.

Back to Contents...

Autonomy DAM for MOSS

Autonomy has debuted ControlPoint for Multimedia, which has been specifically designed to enable global customers using Microsoft SharePoint Server (MOSS) to exploit the full value of their rich media assets.

Autonomy explains ControlPoint for Multimedia uses an advanced conceptual approach to analyze and retrieve the rich media content within MOSS, without relying solely on the manual tagging of metadata. Users are presented with the most relevant content—both textual and non-textual—that is conceptually related to the selected rich media file in an intuitive interface.

The company emphasizes that for knowledge enrichment, the ability to de-duplicate, cluster and retrieve rich media content in an automated fashion increases employee productivity while reducing cost, as well as helping businesses comply with the exacting demands of new e-discovery regulations.

Autonomy reports ControlPoint for Multimedia seamlessly integrates with MOSS to deliver:

  • encoding and indexing of rich media content, automatically creating insightful metadata and the ability to manage, repurpose and archive digital assets;
  • audio, video, e-mail and chat to be indexed, searched and cross-referenced alongside other data formats;
  • automatic clustering functionality for identifying duplicate and near-duplicate rich media files;
  • accuracy by understanding words in context and retrieval of information according to its meaning, thereby distinguishing homophones;
  • use of both phonetic and conceptual approaches to offer unique high-level functions, including automatic categorization;
  • analytical tools for scene analysis, speaker and audio segmentation, speaker recognition and classification, and audio/text synchronization;
  • workflow for managing rich media content; and
  • support of a wide range of languages, including those that require multibyte encoding capabilities, such as Arabic, Mandarin Chinese and Russian, as well as single-byte European languages such as English, French and Spanish.

Back to Contents...

DeepDyvers Updates DeepDyve Search Engine

DeepDyvers announced that the DeepDyve search engine has been updated. Enhancements include: simplified user interface; refine or add filters to query with a drop-down menu directly from the search bar; read an abstract of every document as well as see the best matching portion of text from the document by clicking on the "Details" button; and share your results to email, Digg, MySpace, Facebook, Twitter, and other channels.
 
(www.deepdyve.com)

Back to Contents...

Datawatch Releases Datawatch ES Version 4.6

Datawatch Corporation, a provider of Enterprise Information Management (EIM), announced the general availability of Datawatch ES, Version 4.6, the latest version of the company’s enterprise Business Intelligence (BI) and enterprise reporting solution. Datawatch ES 4.6 offers new features that allow enterprises to extract and analyze information from business documents. Datawatch ES 4.6 features compatibility with Monarch V10 and provides the ability to export data to Excel with extended support for Excel 2007, including .xlsx files containing up to one million rows of data, embedded instructions for pivot tables, and Excel formulas. Also included are additional enhancements to improve performance when processing large documents, increased granularity when extracting data from multi-subject document pages, new data filters and sorting capabilities, and improved administrative functionality.

(www.datawatch.com)

Back to Contents...

Gabriels and Endeca Partner

Gabriels Technology Solutions, an e-commerce technology provider, and Endeca Technologies, Inc., a search and information access software company, announced the introduction of search products for new media. Gabriels portal search technology is based on the Endeca Information Access Platform designed to offer search with analytical power of business intelligence. Through Endeca Search, Guided Navigation, and Content Spotlighting capabilities, Gabriels offers site owners the tools to create contextual advertising and placement opportunities. Gabriels’ Endeca-powered technology provides vertical portals for over 300 media properties including such organizations as Scripps Networks, Network Communications Inc., The New York Times Company, Hearst Newspapers, Freedom Communications Cox Newspapers, Scripps Newspapers, and Lee Newspapers.
 
(www.endeca.com)

Back to Contents...

Hearing (customer) voices

Enherent Corp., an IT consulting services and solutions provider, has formed a partnership with Attensity to deliver solutions leveraging Attensity's text analytics software.

The companies say their complementary capabilities allow effective management of brand reputation and facilitate innovation. Enherent reports it will combine Attensity's First Person Intelligence Platform with its vocabularies, analytics and subject matter expertise to manage brand risk and provide customer insight.

Enherent delivers analytics, collaboration, enterprise content management and infrastructure solutions to enterprise and mid-market organizations. It says it helps clients create, contribute, understand and transform structured and unstructured data into actionable intelligence to enhance decision-making and innovation that create competitive advantage.

Back to Contents...

Tsavo Media Launches Twithority

Tsavo Media, a company focused on the delivery and monetization of niche content to digital consumers, introduced Twithority,"Twitter Search by–Authority," a new search engine for the microblogging service Twitter. Twithority returns Twitter search results, looking back as many as 1,000 results. It sequences results by rank, with highest ranking users first, and time (with most recent tweets first), within the top 10,000 Twitter users. Tsavo will integrate Twithority into its Daymix network of consumer content sites. Twithority will serve as an informal metric for Tsavo’s content sites, along the lines of Google’s zeitgeist feature.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

(www.tsavo.com, www.twithority.com) 

 

Back to Contents...

WordLogic Announces Text Solution Now Available on Windows Mobile

WordLogic Corporation, a technology company developing methods of text and information entry, announced that its text solution is available for Windows Mobile compatible handheld devices. Wordlogic's handheld version of its software can now be utilized on all pocket PCs, smartphones, and portable media centers. Upon typing a letter the WordLogic system, users receive a list of five completion candidates at a time, based on more than 50,000 dictionary entries, which can be selected and inserted into the text. The handheld version of the WordLogic predictive text software features a configurable keyboard Graphical User Interface (GUI) for touch-screen smartphones, new WordChunking technology, and customizable dictionaries. Word Logics' keyboard works with all Microsoft Windows Mobile 5/6 compatible programs including Microsoft Office Mobile, instant messaging programs, web browsers, Google search, and Skype Mobile. WordLogic Handheld is available in seven languages, including English,German, French, Dutch, Italian, Spanish, and Portuguese.

(www.wordlogic.com)

Back to Contents...
 
[Newsletters] [Home]

Problems with this site? Please contact the webmaster. | About ITI | Privacy Policy