Enterprise Search Center

RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES

April 14, 2010

Table of Contents

PROJECT LEFTY: More Bang for the Search Query

Hot Neuron releases Clustify 2.2

Digital Reef releases ECA application

Precisely targeted content delivery

Access Innovations Joins Cloud

Karsa Announces Fulltext Search Manager 2.0

ISIS Papyrus Becomes Newest OASIS Foundational Sponsor

Rogers Digital Media Taps Brightcove and Endeca for Relaunch of Citytv Video Theatre

PROJECT LEFTY: More Bang for the Search Query

In the aggregate, libraries spend vast amounts of money on electronic databases. In the aggregate, they are not utilized as extensively or efficiently as we librarians would like them to be. Traditional federated search has been the gold standard of article discovery for years. The expectations for speed and apparent relevancy brought forth by "web scale discovery tools" (in Marshall Breeding’s phrase) such as Google Scholar and Serials Solutions Summon in recent years have made traditional federated search tools seem clunky and unwieldy. This latest generation of search tools highlights the unpleasant side-effects of brokering searches among multiple targets and integrating the results: The old method is slow, and while the results may well be worth waiting for, library users don’t want to. Still, the newer tools have their faults. They may search "everything" the library offers, but it is often unclear exactly which databases and journals are being searched, and the relevancy rankings are arbitrary and largely out of the library’s and the user’s control.

What is lacking is fine-tuning of searches based on who the patron is, what they are researching, and what level of academic investigation is appropriate. Article discovery tools must, on a query-by-query level, search the right databases (that is, databases specifically relevant to that particular search query) at the right level of academic inquiry (that is, the databases are appropriate to the academic level of the user in the subject domain they are searching), and use the right query (that is, domain- or database-specific vocabulary). Thus, I propose Project Lefty (three rights, of course, make a left).

Project Lefty is a search system that, at a minimum, adds a layer on top of traditional federated search tools that will make the wait for results more worthwhile for researchers. At best, Project Lefty improves search queries and relevance rankings for web-scale discovery tools to make the results themselves more relevant to the researcher’s specific query. Project Lefty has three components, each directed at a particular right.

Determining the Right Database

Picking the correct database or databases for a particular user’s specific query is a challenge perhaps best met with the traditional reference interview. In Project Lefty, we accomplish this in an automated way, through a two-step process:

1. The first is to understand the contents of the databases themselves. Vendors already provide descriptions (both narrative and keyword)of their products. Librarians have added additional metadata. These collective descriptions can be improved by adding abstracts of the sources indexed by the database.

2. The second part is to map the user’s query to a well-defined set of databases. There are several possibilities for doing this. In one method, we could use historical searches and targets to predict the future. The University of Michigan (UM) Library already has a "database finder" that maps a user’s query against historical uses of library-provided databases for that query. The databases that patrons frequently use to find results for that query are likely candidates for future searches. (See, for example, the left-hand navigation column on the articles search results for "middle east": www.lib.umich.edu/article/General%20Interest/middle%20east.) Alternately, we could perform the query against a general interest article database (such as FirstSearch or Google Scholar) and perform an analysis of the first results to be returned to determine subject clusters. These subject clusters would then be mapped to narrower databases in the library’s collection.

Determining the Right Academic Level

Figuring out if the researcher is likely seeking introductory,overview, or advanced information in response to a given question is the next step. We have a two-part strategy to figure out what level of information is most likely appropriate:

First, we need to make a best guess at the presumed academic level of the specific query. We can do this by pulling together a variety of "environmental information" about the person asking the question (assuming the user has authenticated), including the following:

1. For students,through accessing the registrar’s information, determining the courses in which the student is enrolled. If the query fits a course, we can assume that course’s academic level. For example, a search for "government" when the student is enrolled in Political Science 101 would imply a basic query. A student enrolled in several higher-level political science courses would imply a higher level of inquiry.

2. For faculty,inquiries in their subject domain (their academic department) would be assumed to be at the highest possible level. For queries out of their subject domain, a lower level of academic inquiry could be inferred.

3. For people who are not authenticated, or about whom we can make no inference, we don’t infer an academic level.

Second, we need to determine the specificity/academic level of the databases in the subject domain and then give higher relevance to items coming from the appropriately selected databases. It is not that we exclude databases with the "wrong" presumed academic level, but we give sources with the "right" academic level a relevance bump. We determine the relevance increase or decrease through any of a range of methods. The specific method chosen depends on what bibliographic data we have about the articles being returned from the search tool:

1. Running sample user queries (from historical query logs) against the database to find which databases are most general for that query and which are the most specific

2. Observing a user's interaction with the system to tune the default behavior

3. Assigning broad levels (introductory to intermediate; in-depth; etc.) to specific journals beforehand and matching the user's presumed academic level to citations in the result list

4. Over time, observing user behavior to learn which level of database a particular individual generally goes to (someone who goes to "high level" or very specific databases routinely has those weighted more strongly; someone who routinely goes to the more general databases has those weighted more strongly)

5. Recency--given a query in a subject domain, similar queries over time (by the user or by all users) influence sorting of results for future searches

By substituting vocabulary that better fits the subject domain and database, we enhance the user's query to be better targeted. This can be achieved through search query analysis--what similar queries provide objectively better results. Expanding keywords appropriately requires understanding the subject domain of the query. Several approaches for this process include the following:

1. Using full-text materials available through the HathiTrust and other sources, combined with Library of Congress (LC) call numbers and/or LC subject headings assigned to those items, to assemble large bodies of text from current publications by subject area and to develop maps of search terms to the subject domains in which they most frequently occur

2. Using Google Scholar results to find best-fit articles for a given query, and then use those returned results to generate a better-targeted query by identifying common keywords in Google Scholar’s first few results

Outcomes

In the system just described, a first-year student who has not declared a major who is taking a history class and a full professor in the history department would both enter the same search query, "Boer war." The student and the professor would get a very different set of results based on their different presumed level of academic investigation.

The student would see general texts nearer the top of his of her results list while the professor would see more scholarly academic papers closer to the top. Because the professor frequently searches this topic and ends up at articles from two particular scholarly journals, articles from those journals are right at the top of his or her results list. (A librarian assisting either patron could perform a search as that individual so as to see the same results list.

So what does this achieve? Without having to do anything at all different—beyond authenticating—the users gets a results set that is more likely to contain highly relevant content and gets content that is more likely to be directly relevant to the particular query. A customized relevance ranking is not much differently arbitrary than the one-size-fits-all approach taken by existing technologies. And this proposed tool can be used on top of an existing, or not-yet-developed, "old-school" or "newfangled" cross-database search platform. Sitting on top of the available technology, it improves results quietly,leading the researcher to better articles than they would find in an unassisted search.

About the Author

Ken Varnum is the web systems manager at the University of Michigan Library. He has been working with digital information technologies as a librarian for more than 15 years. His blog about libraries and technology is at http://rss4lib.com, and he can be reached by email at varnum@umich.edu.

Author's Note: Two colleagues at the University of Michigan Library have been in instrumental in developing this concept and working toward implementation. Albert Bertram (lead developer, library web systems) provided constructive feedback and helped define the technical parameters of this project. Judy Yu (federated search developer, library web systems) is developing a pilot of the tool described herein.

About the Contest

The Federated Search Blog (http://federatedsearch blog.com)held its second annual contest to increase awareness of and interest in federated search. The blog asked participants to describe the most impressive federated search application they’ve ever seen or imagined. Blog and contest sponsor Deep Web Technologies awarded cash prizes to the top three winners: Ken Varnum, Hope Leman, and David Walker. Industry experts Abe Lederman, Todd Miller, Helen Mitchell, Richard Tong, and Walt Warnick judged the submissions. In addition to receiving a $1,000 cash prize, top winner Ken Varnum participated in a panel discussion at the Computers In Libraries Conference,and his winning essay is published here.

The judges selected Hope Leman to receive the second-place prize for her essay, "Not So Wild a Dream: The Science 2.0 Federated SearchDream Machine." Hope is a research information technologist for Samaritan Health Services in Oregon, where she is helping to develop a service to help scientists and public health researchers find professional conferences and places to submit their research papers. Hope’s essay shares her dream of creating a federated search engine to help scientists with two key aspects of research: finding the current state of research on a topic and finding calls for papers and presentations.

David Walker received third place in the contest. David, library web services manager at California State University,produced a video titled Using Metasearch to Create a Journal Table of Contents Alerting Service. The video describes the work his library is doing to connect researchers to journal articles. The challenge is that while many publishers have alerting services to notify subscribers of new content, procedures for accessing the services vary greatly between publishers. Additionally, these publisher-provided services typically provide links to content that a researcher may not have permission to access due to authentication and location issues. David explains how combining a number of existing technologies overcomes these hurdles.

The blog received a number of other innovative submissions. Charles Knight, search editor for the News Web, won honorable mention for proposing that federated search be used to mash up geographic data that is then projected onto a globe-shaped screen. Other submissions included applying artificial intelligence to search, developing common standards for publishers to follow to simplify the search and aggregation process, and chucking federated search altogether in favor of "small town librarians."

Learn more about the winning contest entries at http://federatedsearchblog.com/category/contest-winners-2009.

Back to Contents...

Hot Neuron releases Clustify 2.2

Hot Neuron has announced the release of Version 2.2 of its Clustify document clustering software, featuring user-adjustable word weights and other improvements.

Clustify groups related documents into clusters and labels each cluster with a few words to tell what it is about, allowing the user to explore the document set and efficiently and consistently categorize documents. Version 2.2 gives users the ability to adjust the weighting of words used for clustering in order to encourage clusters to form around words that are of special interest. The new version also features an improved near-duplicate detection algorithm, more flexible export of results to other e-discovery tools and more useful cluster sorting.

Clustify can generate concept-based clusters, or it can require documents in the same cluster to contain identical passages of text to detect near-duplicates (i.e. different revisions of the same document). It also has an automatic categorization capability to reduce the amount of manual labor necessary for categorization when new documents are added to a dataset. It can handle millions of documents on a typical desktop computer.

Back to Contents...

Digital Reef releases ECA application

Digital Reef has unveiled a new early case assessment (ECA) application with indexing and analysis speeds of more than 10 terabytes per day. Introduced with the Digital Reef Virtual Governance Warehouse 3.0, the new e-discovery application indexes and analyzes full content across e-mails, documents, repositories and more than 400 file types using industry standard servers and storage. With the Virtual Governance Warehouse, organizations can simultaneously support additional information governance projects driven by legal and regulatory mandates and internal IT policies, says the company.

Digital Reef ECA’s application features include:

federated search—search an unlimited number of storage and content repositories in a single query;

content reporting—make decisions based on economics of the potential data volumes and document types;

search builder—intelligent and automated search to easily find relevant information;

customizable de-duplication strategy—reduces review time and costs using de-duplication strategy specific to a case;

customizable data view—easily organize sets of documents into batches, collections or views within duplicating data;

search results reporting—immediately test the responsiveness of keywords and queries, including both hit counts and unique counts;

custodian analytics—racks custodian identities, documents and conversations automatically and provides instant data population visible and review state of all custodian documents;

batching—analyze and tag content in a private space without the need to copy files or re-import;

connector management—configure, secure and monitor connectivity between the Virtual Governance Warehouse and content storage systems;

data area management—allows IT to define, manage and secure into content storage systems;

index sharing—eliminates the need to index large volumes of documents multiple times;

Documentum connector—discover, analyze and govern content across systems, including Documentum;

export to relational databases—generate SQL-ready schema for joining virtual governance warehouse with structured business intelligence information;

LDAP and Active Directory integration—integrates with existing user and group management infrastructure, simplifying application management; and

policy-based access control—enables IT and application administrators to define rich security policies for all system and application level resources.

Back to Contents...

Precisely targeted content delivery

Author-it has launched Author-it Aspect, a Web-based application that dynamically delivers content to users based on their profiles. When the content is completed, it is published to the Author-it Aspect Server.

Author-it Aspect dynamically renders the content and displays the appropriate variants specifically for each user based on their profile settings. In this way, Author-it Aspect displays only the most relevant information for the particular user based on their specific needs.

Aspect works seamlessly with other Author-it family products and Microsoft SharePoint, providing an end-to-end solution from authoring, right through to content delivery.

Back to Contents...

Access Innovations Joins Cloud

The Data Harmony suite of products will now be available through SaaS and cloud computing technology, parent company Access Innovations announced. This model eliminates the need for companies to purchase the software and acquire appropriate support equipment and personnel up front. Access Innovations offers 90-day free trials of their Data Harmony software through remote access models. The cloud and SaaS versions are hosted entirely on the internet, freeing up server space and speed.

(www.accessinn.com)

Back to Contents...

Karsa Announces Fulltext Search Manager 2.0

Czech company Karsa Technologies released version 2.0 of its Fulltext Search Manager, including new search optimization tools and a revamped user interface. The desktop application is designed to increase accuracy and efficiency in full text searches. The new version includes new filters, a search debugger, and a prefix of stored procedures.

(www.karsa.eu)

Back to Contents...

ISIS Papyrus Becomes Newest OASIS Foundational Sponsor

The international open standards consortium, OASIS, welcomed process and CMS provider ISIS Papyrus as its newest Foundational Sponsor. The company joins IBM, Microsoft, Oracle, and Primeton in supporting the mission of OASIS.

Founded in 1988, ISIS Papyrus supports 2,000 corporate and government customers in more than 50 countries with integrated enterprise software for process and content management. The Papyrus architecture acts as the front-end for the business user to auto-discover processes and case work, enable business rules, and manage and optimize personalized correspondence.

In related news, OASIS also that Robin Cover, managing editor of the Cover Pages, will be speaking at the ISIS Papyrus Open House and User Conference 2010 in the Dallas area on May 24. Recipient of the XML Cup, Cover maintains a comprehensive online repository on structured information standards.

(www.isis-papyrus.com, www.oasis-open.org)

Back to Contents...

Rogers Digital Media Taps Brightcove and Endeca for Relaunch of Citytv Video Theatre

Rogers Digital Media and Brightcove Inc., an online video platform developer, announced they have partnered to deliver a newly re-designed Citytv video theatre featuring a variety of locally produced Citytv content and exclusive full-length episodes of popular television shows, including 30 Rock, The Bachelor, and The Biggest Loser. The new Citytv video initiative also utilizes search technology from Endeca Technologies to improve the site's overall search and navigation capabilities, as well as the speed at which content is delivered.

In re-launching on the Brightcove platform, Citytv now provides viewers throughout Canada with exclusive access to HD-quality, full-length episodes of popular U.S. and Canadian television programs. With Brightcove, Citytv is able to deliver the highest quality user experience possible, as well as to extend the reach of its video content through the platform's advanced social sharing capabilities. Endeca's search technology also allows Citytv to deliver advanced search and navigation capabilities across all of their video content.

(www.citytv.com, www.endeca.com, www.brightcove.com)

Back to Contents...

[Newsletters] [Home]