Enterprise Search Center

RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES

October 18, 2006

Table of Contents

FEATURED CONTENT: Taxonomies, Metadata, and Search

SearchInform Technologies Introduces SearchInform 3.0

Endeca Announces Endeca Information Access Platform 5.0

ClipBlast! Indexes Video for Websites; Introduces Free Search Box

Alacra Adds Google Powered Vertical Search Engine

Serials Solutions Launches Central Search Results Clustering

WebSideStory Launches Ajax-Enabled Site Search Solution

TEMIS Announces Luxid Solution

EMC and Microsoft Form New ECM Alliance

Search Extender for Google Desktop

Vivisimo Upgrades Clusty.Com and Leverages Site for Search Experiments

Skyarc Releases Intrablog Package

Attensa Introduces Free RSS Reader

FEATURED CONTENT: Taxonomies, Metadata, and Search

The information age poses all kinds of challenges, the most fundamental of which is how to find things—so let's explore how a taxonomy, through application of metadata, can help users find what they need.

Browsing and Searching

When people know what they are looking for (or think they do), they typically enter a search term and then browse through a result set. But if they aren't sure what they want or have no particular goal in mind, they'll usually look through navigational links and labels and use them as clues about where interesting information might lie. These methods aren't mutually exclusive; some users will move back and forth between both approaches when one isn't fulfilling their needs successfully.

The Google Effect

There is a great deal of confusion surrounding the value of metadata and taxonomic terms in organizing documents. In many organizations, there is a line of thought that "if we just get a really good search engine," then the problem of people not being able to locate information in the context of their work will go away. People will be able to just enter a search term in a Google-like interface, and the precise information they are looking for will appear. This is a typical argument against the process of formally building metadata structures and standards and developing a well thought-out taxonomy.

The answer to this line of thinking is that although algorithms are getting better, it is not yet possible for machines to infer intent. They can count words, look for patterns, derive categories, cluster results, extract entities, compare word occurrences, and apply complex rules and statistical analyses. But they cannot tell what you want to do. They don't know the context of your work task. They cannot determine what is important to you.

One could argue that no one can determine a user's intent, even taxonomists and metadata architects. That may be true, but if we know something about who users are and understand how they do their jobs, then we can start to make some assumptions about the information that we think they want.We might also begin to understand both the specific language and terminology that searchers use and their mental models of their world and work tasks.

What is the significance of knowing all this? Well, the more we know about a user's world, the more precise our assumptions will be about the types of artifacts that user will look for in day-to-day work tasks. If I am a salesperson and am doing some cold calling, I might first look for calling scripts, some articles on the market or customer needs, or perhaps a follow-up presentation or white paper I can send my prospects after I get them on the phone.

If I am a consultant trying to install the latest version of engineering design software, I will want to look for technical bulletins, bug fixes, common installation problems, previous engagements' lessons learned, customer site histories, specific configuration documents, and so on.

These process steps or work tasks help describe artifacts and the context in which they are used.These descriptions and contexts become the raw material for the taxonomy. They can also be the basis for metadata fields that are applied to documents.

For example, perhaps my work process looks like this:

Scope project
Write proposal
Deliver project
Capture lessons learned
Close project

In the first step, I need to find the following:

Fee worksheets
Prior projects
Example solutions
Scoping worksheets

These artifacts should ideally be labeled, so they can be more easily retrieved.

I might label content in an application according to the process step, so that when I am scoping a project, I can search according to any documents that are appropriate for the scoping phase.

Click Here to Get Your Free PDF of the Full Article

Now imagine that I serve a number of markets: pharmaceutical, financial services, aerospace, automotive, and high-technology. It may also make sense to allow retrieval of documents in the first process step that are related to my client's industry.

We could also say that there are documents for different audiences, perhaps for technical versus non-technical. Or documents may be distinguished as internal or external, partners or customers, and so on.

Each of these perspectives represents a different "facet" of the content. Metadata is applied by deriving a list of terms for each facet and a using a combination of these terms to describe the exact context of the content. By applying metadata to content in this way, and then letting users select the appropriate terms that describe their tasks, we are in effect letting users describe their intent—they are telling us who they are, what they are attempting to accomplish, and what is important to them.

This is called faceted navigation or faceted search and is really just search on metadata—the old "advanced search" that no one ever used. But now we can "fool" people into thinking that they are navigating instead of searching. This is done through clever user interfaces, like those by companies such as Endeca and Siderean, but this can also be accomplished through "stored searches"—queries that are preconfigured for a particular task.To users, this looks just like navigation: they simply click on a link, the search is executed behind the scenes, and a set of results is presented.

This type of search can help precisely distinguish content with fine shades of meaning, especially when there are large numbers of complex documents in a repository. An "outsourcing strategy" can vary widely across industry and process and contain many types of documents and deliverables. Broad searches using ambiguous terms will not zero in on "best practices for telecommunications call center outsourcing strategy for the insurance industry, using service firms located in India" by searching on "deliverables." However, searching on the metadata facets of "industry," "process," "locale," and "best practices" will yield more appropriate results.

The Tagging Process

Does that mean we have to add meta-tags to everything? When I describe faceted search, many people say "our users won't tag content" or "that is too expensive." Do we have to tag all of our content?

The answer is no, not all content, because not all content is not of equal value.

You should concentrate on what is important for users in their context, what needs to be easily and quickly accessible. Information required for work tasks (e.g., worksheets) or that is reviewed for timeliness and appropriateness (e.g., best practices) is considered highvalue content, since its findability and use directly impact business objectives. Unfiltered information that is less directly involved in specific work processes is generally less valuable to daily operations and should be less of a concern.

This is not to say that unfiltered information has no value. On the contrary, emails and discussions can be rich sources of tacit knowledge, often valuable in complex or novel situations.However, this type of content is generally unstructured and therefore more difficult to organize. And since the information these documents contain is tied less directly to specific work processes, the cost associated with applying a formal tagging structure to this type of information is harder to justify. Content that is already structured or tied to a structured process tends to derive greater value from controlled organization.

So you don't have to formally tag all your content. But if you do, keep in mind that tagging large volumes of content can be a time-consuming and costly process. You will likely want to tag in phases, prioritizing categories of content based on their relative value to users. It is not unusual to run out of budget or have progress postponed during the course of a tagging project, so you want to be sure you've really focused your efforts where they matter most.

Decide what content has the greatest value and prioritize there. You can also set up a "prioritization matrix" that assigns values to various attributes of your project and attempts to place a score on one focus area versus another.

Table 1 shows the subjective evaluation of various project factors, and Table 2 assigns a numerical value to each grade for each factor. In this example, we are looking for a high score in order to decide where to focus.

Table 1. Subjective evaluation of various project factors

Repository
Processes Supported
Relative Impact Ownership SMES Identified Size of Repository Value of Documents Currency of Content
Technical library Self service, engineering, field consulting High Yes Large High High
Methodologies Field consulting Medium No Small High Low
Proposals Sales consulting High No Small High Medium
Customer support Customer support Low Yes Medium High Medium

Table 2. Numerical assignment of project factor value

Repository Processes Supported (Many=3) Relative Impact (High=3)
Ownership /SMES Identified (Yes=3)
Size of Repository (Small=3) Value of Documents (High=3) Currency of Conent (Current=3) Total
Technical Library
3
3
3
1
3
3
16
Methodologies
1
2
1
3
3
1
11
Proposals
2
3
1
3
3
2
14
Customer Support
1
1
3
2
3
2
12

Value of Social Tagging

Although you may not be ready to invest in formally tagging all your content, there is a less structured and expensive approach to adding metadata to content. A social tagging approach gives users the ability to add keyword metadata to content items and is not controlled by any taxonomy or term list. User-generated tagging has some benefits: it is useful for identifying emerging knowledge and terminology, it can take into account multiple perspectives, and it certainly costs less than controlled tagging! However, with social tagging you lose many of the benefits of a controlled approach.

Terms may be ambiguous or overly broad, there may be many variants, and terms alone lack the context provided by a structure.

Of course, one approach does not preclude the other. Not all tools lend themselves well to leveraging metadata standards or a taxonomy. Collaborative tools tend to be less structured (along with their content) and focus on knowledge creation rather than access; thus, these tools respond better to an unstructured tagging approach. Tools that support controlled processes tend to focus more on knowledge access and require a more structured and rigorous approach to organization.

You can even use social tagging as a supplement to a controlled vocabulary in some contexts, using it both to raise awareness about how users think about finding information and as a source for tracking vocabulary changes. There are many hybrid approaches possible; it is up to you to decide what suits your context and budget best.

The point is that it is not an "either/or" situation. Tagging with controlled metadata is important when a process requires more structured information. One would not use social tagging for validated processes for FDA drug submissions. Those require very precise editing, vetting, and control processes. Less structured information can do with a less formal process. The goal is to determine both the relative value of content and how users think of that content in the context of their work and then to determine how formal the process of applying metadata needs to be.

SETH EARLEY is founder and senior consultant for Earley & Associates. Email him at seth@earley.com.

CLICK HERE FOR YOUR FREE PDF

Back to Contents...

SearchInform Technologies Introduces SearchInform 3.0

SearchInform Technologies Inc. has introduced a new version of SearchInform, a program of full text search and search for documents similar in their content, featuring an enhanced indexing and information search algorithms as well as a new request caching system. Features in SearchInform 3.0, include: performance at a higher speed; phrase search with due consideration to stemming and thesaurus; new SoftInform Search Technology of search for similar documents; high indexing speed (from 15 to 30 Gb/hour), index size of 15-25% from the actual size of the text data; query caching system; support of over 60 text formats, Outlook and TheBat electronic messages, mp3 and avi tags, and logs of MSN and ICQ instant messaging programs; correct work with archives; and universal data sources (indexing of DBMS).

(www.searchinform.com)

Back to Contents...

Endeca Announces Endeca Information Access Platform 5.0

Endeca, an information access company, has unveiled the Endeca Information Access Platform (IAP) 5.0, a new release that combines search with business intelligence. The Endeca IAP 5.0 brings together all the tools and techniques people need to speed information discovery and analysis--search, Guided Navigation, ad hoc analysis, visualization, and text analytics--into a single, interactive user experience. It also introduces a new and enhanced capabilities, including Relationship Discovery, extended enterprise integration, and business control features to help organizations deploy new classes of information access applications.

Features of the Endeca IAP 5.0 include: new relationship discovery with interactive text analytics capabilities that complement the Endeca IAP's search, Guided Navigation, analysis and context-sensitive presentation capabilities--all integrated in a single user experience and all powered by Endeca's MDEX Engine technology; Entity Discovery, Term Discovery, and Cluster Discovery capabilities reveal relationships in enterprise content (records, documents, web pages, text files, etc.) based on shared people, places, concepts, tags, and other entities hidden in free form text files; relationships can be presented in information access applications as Guided Navigation refinements, highlighted links, related terms and/or new document clusters; new, valid relationships are continually exposed and recalculated based on the context of the user; new workflow capabilities; a web-based relevance ranking evaluator gives Endeca application owners the ability to understand the impact of different relevancy ranking strategies and respond to changing business needs; new granular permissions delivers enhanced security and gives Endeca application owners new abilities to manage who has control over specific areas of information access applications; a new SOA interface; new reporting and visualization integration; extended integration with critical enterprise applications; and extended platform support. The Endeca IAP 5.0 will be generally available in November 2006.

(www.endeca.com)

Back to Contents...

ClipBlast! Indexes Video for Websites; Introduces Free Search Box

ClipBlast!, a video search engine, has announced that it will index video for websites, and introduced a free, downloadable search box, designed to enable website owners to quickly implement video site search and monetize video content. Website owners can add ClipBlast!'s search box to their sites, by copying and pasting HTML code. ClipBlast's technology gives users the ability to search for video clips from within a single site, video blog (vlog) or across the entire web. For site owners, ClipBlast! offers back-end video search technology that organizes video libraries, which enables content to be monetized. The company's technology crawls the web in search of video, then categorizes video files, web pages, and feeds so that the most relevant clips can be served up in real-time, on demand.

(www.clipblast.com)

Back to Contents...

Alacra Adds Google Powered Vertical Search Engine

Alacra, Inc., a provider of online business information solutions, has announced the deployment of Google Search Appliances to create the Alacra Compliance Web, a vertical search engine designed to facilitate the account vetting process for financial institutions. The Alacra Compliance Web is a part of Alacra Compliance, which helps banks and broker/dealers employ a comprehensive and documented process for Customer Identification Programs (CIP), Know-Your-Customer (KYC) efforts, and Enhanced Due Diligence (EDD) to meet Patriot Act, BSA, and FSA regulatory requirements.

The Alacra Compliance Web is a continuously updated index of more than 500 global regulatory websites including stock exchanges, government regulatory agencies and insurance regulators. Specific features include: Google Quality and Ranking for each query; a compliance- specific taxonomy, including sites hand-selected and classified by Alacra Content Analysts; dynamic page summaries designed to allow users to judge the relevance of results with dynamically generated snippets showing your query, in bold text, within the context of the page; and multi-lingual capabilities. As part of the launch, Alacra has released a searchable free trial version of the Alacra Compliance Web.

(www.alacra.com)

Back to Contents...

Serials Solutions Launches Central Search Results Clustering

Serials Solutions has launched a new Results Clustering feature for its Central Search federated search service. Serials Solutions' web-hosted model requires no local software loading. Results Clustering integrates the Vivisimo Clustering Engine with Serials Solutions Central Search. Serials Solutions is offering the feature free to all Central Search subscribers. No software or hardware installation is required. As with all Central Search user interface features, results clustering is customizable.

(www.serialssolutions.com; www.vivisimo.com; www.il.proquest.com)

Back to Contents...

WebSideStory Launches Ajax-Enabled Site Search Solution

WebSideStory, Inc., a provider of digital marketing and analytics solutions, has announced its Ajax-enabled site search solution. This patent-pending capability, called Active Browsing, is an extended service of WebSideStory Search and enables any ecommerce site to integrate one of Web 2.0's enabling technologies, Ajax (Asynchronous JavaScript and XML), into product search results.

This is designed to enhance speed and interactivity, and provide an improved customer shopping experience in which shoppers are able to engage in "clickless browsing"--the ability to preview additional product facets and "more like this" categories by rolling their mouse over specific search results.

Active Browsing works by transforming site search into an interactive application that accesses server data optimally, and allows for user interface innovations such as "bubbles" that overlay the page to provide more information and navigation choices. This enables visitors to speed through product search results and related facets--color, size, gender, etc.--without having to reload the page every time.

(www.websidestory.com)

Back to Contents...

TEMIS Announces Luxid Solution

TEMIS, a provider of Text Analytics applications, has announced that Luxid, its recently unveiled Information Intelligence solution, is designed from the ground up on UIMA. The UIMA standard provides a standard framework, with interoperability to allow for the combination of multiple technologies. Luxid becomes fully compliant with the IBM OmniFind enterprise search and discovery platform, enabling the search experience to be enhanced with semantics. It also enables Luxid to use any third-party UIMA-compliant annotators.

Luxid gives immediate access to non-obvious information and delivers knowledge from vast and unstructured data. Luxid has been structured into three stackable software applications: Luxid Annotation Factory performs the extraction of information from text. Powers the ability to reliably identify entities and relationships. It offers an interface to manage the full document annotation pipeline; Luxid Information Mart is a platform that federates heterogeneous sources, and enriches the harvested document leveraging Luxid Annotation Factory in order to build a knowledge base; and Luxid Information Analytics is a web-based and feature-rich portal enabling information discovery on top of Luxid Information Mart. Its User Interface provides access to advanced search and filtering, document navigation, time analysis, cross-tab views, information mapping, and clustering. The Knowledge Browser turns any document set into a navigable knowledge graph, displaying extracted entities semantic relationships. To enable collaborative discoveries, Luxid Information Analytics also offers multiple-view dashboards through user-defined and shared Centers of Interest.

(www.temis.com)

Back to Contents...

EMC and Microsoft Form New ECM Alliance

EMC Corporation and Microsoft Corp. have announced a new enterprise content management (ECM) alliance aimed at helping to enable organizations to become People-Ready with their compliance, regulatory, and other critical business data. This alliance is designed to enable information workers to take advantage of the Microsoft tools and applications they use every day to access and contribute to the critical business processes available in their ECM infrastructure.

As part of this alliance, EMC will introduce a set of new content and archiving products designed to enable tighter integration between the EMC Documentum ECM platform and Microsoft solutions and platform technologies. EMC will bring to market new solutions that integrate the EMC Documentum platform with multiple Microsoft solutions and platform technologies including Microsoft Office SharePoint Server 2007, the 2007 Microsoft Office system, SQL Server 2005, and enterprise search solutions. With this new alliance SharePoint users can take advantage of the advanced ECM capabilities of the Documentum platform. Information workers will be able to access the Documentum platform natively from within Microsoft Office SharePoint Server 2007 and the Microsoft Office system.

(www.emc.com; www.microsoft.com)

Back to Contents...

Search Extender for Google Desktop

Inxight is offering a free download of its Search Extender for Google Desktop, which will enable users to take advantage of Inxight's entity extraction and natural-language processing to enhance the Google Desktop search experience. Search Extender for Google Desktop is a .Net application that helps users find documents faster and locate hidden information in document sets, e-mail and presentations.

Inxight claims Search Extender gives users better summaries; allows them to see filter search result sets by the relevant people, companies, products and other information contained within them; and makes it easier to navigate through long documents. Inxight reports that search results from Google Desktop are automatically clustered on-the-fly, adding that keyword-in-context summaries are enhanced by an automated "Top Mentions" list, revealing the most important people, places, companies and other information contained within the full text of a document. Also, Inxight says, an automated document index makes it easy to see what key information is contained within documents, such as one mention of a potential new competitor in a large PDF file, a certain version of a presentation, or a relevant phone number within a long e-mail thread.

To download the free Inxight Search Extender for Google Desktop, visit here.

Back to Contents...

Vivisimo Upgrades Clusty.Com and Leverages Site for Search Experiments

Vivisimo, a provider of search software has marked the second birthday of its consumer search site Clusty.com with a redesign and new features, including a font size selector, a "more" tab, and the ability to search within results. Clusty.com now features a "Labs" page that hosts new search applications developed by Vivisimo's engineers. At the Clusty Labs page, consumers can try out Clusty Cloud, a customizable visualization tool that lets users see an immediate overview of any topic on the web, and do specialized searches of a Benjamin Franklin portal or of Shakespeare's plays and sonnets. The company plans to use the Labs page as a testing ground for improvements in the overall search experience for its end users. New technologies developed and tested in the labs will also shape future versions of Velocity, the company's enterprise search platform.
(www.clusty.com; www.vivisimo.com)

Back to Contents...

Skyarc Releases Intrablog Package

Skyarc System has announced its release of SKYARC Enterprise IntraBlog, an intrablog package utilizing Six Apart Ltd.'s Movable Type Enterprise. The SKYARC Enterprise package contains MTE functions such as blog, template, data API, and LDAP. In addition, it includes WiSE search functions for finding temp files in Excel, Word, or PDF and looking up articles and sentences. Users can customize their own page, and add HTML and RSS into their own page using SkyFEED. SKYARC Enterprise IntraBlog will be available from the end of October. A server license costs around $101,000 (800,000 yen) with $114,000 (900,000 yen) per 100 users.

(www.skyarc.com)

Back to Contents...

Attensa Introduces Free RSS Reader

Attensa, Inc., a developer of RSS software for business, has introduced Attensa for Outlook 2.0, a RSS reader designed specifically for enterprise use.

Attensa for Outlook 2.0 brings new usability features and additional functionality including: secure connectivity to Internal and External Web Feeds and AttentionStream reporting with the Attensa Feed Server, so when Attensa for Outlook is connected to the Attensa Feed Server articles read and deleted are treated consistently whether they are read in Outlook, the Attensa AJAX web reader, or on mobile devices; the ability to let users listen and watch audio and video content in Outlook, directly in the River of News; a new Desktop Alert toaster allows users to track breaking business information whether they are working in Outlook or not. The Desktop Alert appears when user-selected priority feeds are updated with new articles; and Attensa's AttentionStream technology uses predictive ranking to observe and analyze how users read and process RSS articles and subscriptions. Through this analysis of AttentionStream data, including the time and frequency that feeds are accessed and articles read, deleted, and ignored, Attensa for Outlook 2.0 displays feeds in a prioritized list based on the likelihood that they will be of interest to the reader.

Other productivity features include: the ability to automatically find, preview, and add RSS feeds; automated search tools; the ability to create, import, and export custom reading lists; the ability to keep feeds and articles organized with tagging and categories; the ability to republish to blogs and forward articles; and advanced compatibility with the Microsoft RSS Platform. Attensa for Outlook for Outlook 2.0 can be downloaded at no charge immediately from the Attensa the website.

(www.attensa.com)

Back to Contents...

[Newsletters] [Home]