Enterprise Search Center

RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES

April 15, 2009

Table of Contents

Pictures Worth a Thousand . . . Finding Nontextual Assets Within the Enterprise

Getting to the point

Google Announces Google Voice

Visual Evidence/E-Discovery Releases VeREVIEW 3.0

MarketResearch.com Launches Profound 2.0

dtSearch Corp. Announces New Product Line

Sherpa Software Launches New Version of Solution for Microsoft Exchange Users

Dialog and QUOSA Team Up

iStockphoto Adds 30,000 Audio Tracks

BA-Insight Offers New International Operations

Vivisimo Releases New Multimedia Outreach, Stan

Search goes to the mountaintop

Noodle V6.7.10 Released

Minnesota Uses Vivisimo’s Velocity Search Platform

Pictures Worth a Thousand . . . Finding Nontextual Assets Within the Enterprise

As if finding text-based content wasn’t already challenging enough, the increasing number of graphics, audio, video, and other nontextual assets within organizations presents yet another enterprise search conundrum. Here we’ll explore the intersection of enterprise search and digital asset management (DAM), as well as various ways of approaching the challenge of finding nontextual assets.

Digital asset management—the technology and discipline that refers to the management of (mostly) nontextual assets—is one of the fastest growing areas of content technology. DAM technology originally focused on anything that wasn’t purely text (often called "rich media"). Rich media can also refer to the information contained in the media itself: for example, a video containing visual and oral information. Because much of that information isn’t text, software may have difficulty finding the content without human assistance.

Rich media poses many findability challenges:

It can be large and unwieldy. Some video files are terabytes in size. Size affects all aspects of working with the media: storage, processing, movement, and distribution, as well as search and retrieval.

Because rich media is visual or linear and time based, it introduces new challenges for how it’s identified, presented, searched for, manipulated, and segmented. A search tool can’t read and analyze it in the same way it can a text document.

Rich media tends to lend itself to the creation of derivative works. Think of a brochure that contains several images or a 1-hour news broadcast, 1 minute of which may need to be found and used later.

Because rich media is nontextual, it requires additional textual information (metadata) describing the media to accompany it either directly (as part of the file), within a repository, or in both places.

Though we talk a lot about the importance of metadata and search, the reality is that most (though not all) search technology does a decent job of analyzing text documents and identifying relevant keywords and metadata, facilitating findability later on. The picture is much less pretty when it comes to the retrieval of nontextual assets. The need for metadata is essential to finding nontextual assets because without metadata, access to your rich media is nearly impossible or terribly unwieldy.

Unlocking the Metadata

DAM technologies can do a lot to facilitate the creation of metadata around digital assets, including extracting metadata from assets as they’re ingested into the system. This is what makes the difference between a run-of-the mill rich media file and a true digital asset, which can be found, monetized, and reused.

If something enters the system totally bereft of metadata, information about the object will be extracted, either from the object directly (via rules created beforehand) or from the person who manually uploaded the object. DAM offerings vary greatly in sophistication with regard to metadata handling, but the point is, nothing gets into a DAM system without some kind of metadata association or extraction taking place.

The information could be embedded in the file as attributes in various standard formats, including IPTC, EXIF, and XMP; as Microsoft Office file implicit metadata, such as author, date, and modified by; or as (in the case of video) scene changes, keyframes, time codes, closed-caption text, or possibly through speech-to-text conversion that indexes text back to points in the video.

Some DAM systems use third-party search engines to perform this task: for example, Autonomy, Microsoft’s FAST Search & Transfer solution (which has a "multimedia miner") or an open source search engine such as Lucene (heavily adapted). You need to be sure that if your metadata includes information stored in standard formats, you can extract and store it in a format where the search tool can index it. More generally, you need to understand which information in the file the enterprise search tool or DAMsystem can extract for a given file type and which it cannot.

What it boils down to is that, in most cases, rich media search is simply about anything that can be converted to text, using audio-to-text conversion or optical character recognition (OCR). Images that can’t be OCR’d are still extremely troublesome.

Reading Images and Video

Let’s once again use a video asset as an example. Because a video asset is rich in visual and audio content, video ingestion and automated metadata extraction is complicated and involved. Video ingestion typically includes the following automated process:

Extraction of file information, as would occur with any file
Extraction of keyframes and their associated time codes
Extraction of closed-caption text, if it exists, and indexing that text back to the time code and frame in which it occurs
Optional conversion of speech into text, specifically into a text transcript that indexes the text back to the particular frame in which the spoken words occur, thereby enabling full-text search to locate sections of relevant video

Despite the advances in creating metadata through audio/phonetic recognition, these approaches continue to be relatively rare and don’t always provide reliable results. Anyone who’s watched the closed-captioned text of a live news broadcast while on a treadmill at the gym knows that gaffes are common. Systems tend to fail with things like homonym resolution and brand names ("Xanax") that aren’t in dictionaries and don’t follow standard pronunciation rules.

As with any speech recognition software, improvements come slowly if the software is trained against a particular voice. So, if the system has been trained on Jon Stewart’s voice, you can search The Daily Show archives for spoken occurrences of a particular word and get much more accurate results than you would if you searched for the same word spoken by someone else on his or another show.

For graphics and images, some tools can "read" images and look for similarities to other images (a form of pattern recognition). Interwoven’s MediaBin has long offered image similarity search, which allows users to search for assets that look like a selected asset (e.g., find everything that has the Coca-Cola logo or looks like a soda can). You can use a file picker to designate an image and then select a target folder to search; MediaBin will try to find assets that look like the designated image.

Why Bother?

Despite the challenges involved, the benefits of improving rich media findability are numerous and, in some cases, necessary. For many brand managers and broadcasters, it’s often the associated metadata that indicates what rights someone has to reuse a certain asset. Video speech-to-text and subsequent search can also be used for call centers, either to search archived calls (for training) or to do searches "on-the-fly" (as the customer talks, the search engine already starts looking for answers).

In government and law enforcement, image recognition has long been used for criminal investigations, to identify everything from fingerprint patterns to facial characteristics. In such cases, the findability of nontextual assets is just as important, if not more important, than the textual ones. Perhaps the need to identify and find nontextual assets isn’t as vital for your organization. Still, it’s better to proactively strategize now about how it might be vital in the future.

About the Author

THERESA REGLI is principal with vendor-independent analyst firm CMS Watch, covering digital asset management and enterprise search. She is co-author of "The Digital & Media Asset Management Report 2008." CMSWatch analysts Kas Thomas and Adriaan Bloem contributed to this article.

Back to Contents...

Getting to the point

Concept Searching has released Version 4 of its flagship product, conceptClassifier for SharePoint. The company says features include a new installer that enables installation in a SharePoint environment in less than 20 minutes, requires no programmatic support and all functionality can be turned on or off using standard Microsoft SharePoint controls. Full integration with Microsoft Content Types and greater support for multiple taxonomies are also included in this release.

Content Types can be used to enforce metadata governance, adhere to policies and drive workflows in line with business processes. Included in the new release is the ability to assign taxonomies to specific Content Types. Documents that correspond to the selected Content Types will be classified and documents that do not correspond to a content type or do not include some metadata elements that a specific content type has specified will not be classified.

This capability allows different taxonomies to be assigned to different Content Types; for example, assign the HR taxonomy to all Content Types of type "HR," including any Content Types derived from "HR" and assign the Finance taxonomy to all Content Types of type "Finance," including any Content Types derived from "Finance." The configuration can be performed using a wizard that runs inside SharePoint. The taxonomies will be available for these documents regardless of their location.

Back to Contents...

Google Announces Google Voice

Google has announced plans to offer a free unified phone service. The "Google Voice" service will offer users a universal phone number capable of routing calls through to mobile devices, landlines, and desk phones. Google Voice allows users to listen to voicemail messages and choose to have new voicemails automatically transcribed. This allows users to see transcriptions in their inbox and will be able to search for voicemails. Transcriptions are also included in email and SMS notifications. When somebody sends an SMS to a user's Google Voice number, that SMS will be relayed to forwarding cell phones and stored in their inbox. Users can reply from a computer or from any mobile phone and the conversation will be saved in their inbox. Users can then read through the conversation thread and search for past messages.

(www.google.com)

Back to Contents...

Visual Evidence/E-Discovery Releases VeREVIEW 3.0

Visual Evidence/E-Discovery, LLC, a provider of e-discovery solutions and services, announced the release of VeREVIEW 3.0, a web-based review tool for electronically-stored information (ESI). Advanced searching and reporting capabilities have been incorporated into this new version. Web 2.0 and Texis Thunderstone technology has also been integrated into VeREVIEW 3.0. VeREVIEW 3.0's Boolean full-text search feature allows for concept searching with a built-in thesaurus for customized-search capability. Uni-character technology (UTF-8) provides full multilingual support to VE's upgraded-review platform. Single Pane technology allows reviewers the option to view and code documents in TIFF, HTML, native file or text formats without having to exit the designated review window. Clients can also use VeREVIEW 3.0 as their case-management system.

(www.vevidence.com)

Back to Contents...

MarketResearch.com Launches Profound 2.0

MarketResearch.com, Inc. released the latest version of the Profound market research database with more content and a new search engine. Profound provides searching capabilities and helps users find accurate and targeted research. Marketresearch.com acquired Profound 18 months ago from Thomson Business Intelligence. MarketResearch.com, Inc. has a collection of continuously updated market intelligence and offers business professionals industry-specific insights on publishers and market research.

(www.marketresearch.com)

Back to Contents...

dtSearch Corp. Announces New Product Line

dtSearch Corp., a supplier of enterprise and developer text retrieval software, announced extensions to its 64-bit developer product line. The new release covers both dtSearch's enterprise and developer products, including native 64-bit versions. For the developer products, the new release provides expanded sample code for use with Microsoft's most recent Visual Studio version. For the enterprise products, the new release updates the user interface, providing a greater selection of "look and feel" options for users. Version 7.6 includes dtSearch Desktop with Spider, which searches files on a PC. dtSearch Network with Spider searches across a network. Both search and display, with highlighted hits, a variety of file types, including email messages along with the full text of email attachments. dtSearch Web with Spider publishes a large volume of searchable data to an IIS Internet or Intranet site. dtSearch Publish enables users to easily publish instantly searchable document collections or website content to portable media (CDs, DVDs, external hard drives, etc.).

(www.dtsearch.com)

Back to Contents...

Sherpa Software Launches New Version of Solution for Microsoft Exchange Users

Sherpa Software, an e-discovery and archiving software company, announced the release of version 3.5 of Sherpa Software Archive Attender, its email management solution. The latest version of Archive Attender includes a new console with a graphical interface. Other new features include expanded stub management, archive limits, new search and indexing utilities, as well as improvements to web based archive search functionality. Sherpa Software’s Archive Attender is an administrator-driven solution that assists companies in addressing e-discovery, storage, information security, and compliance requirements related to regulations such as SOX, the new amendments to the Federal Rules of Civil Procedure (FRCP) and others.

(www.sherpasoftware.com)

Back to Contents...

Dialog and QUOSA Team Up

Dialog, a provider of online information services for professional searchers, and QUOSA, will enable shared customers to explore full text and search results from Dialog's Datastar platform, using QUOSA's document management tools. The integration with Datastar is the first step in a larger relationship that will include Dialog platform products later this year. Under the agreement, Dialog's Datastar customers who also subscribe to QUOSA will be able to use QUOSA's Information Manager for article retrieval and full-text search on the corpus of articles.

(www.dialog.com)

Back to Contents...

iStockphoto Adds 30,000 Audio Tracks

iStockphoto, a stock multimedia destination, announced its new audio collection. iStock's Standard Audio collection includes over 11,000 royalty-free, user-generated sound effects and music tracks. iStock also introduced a new Pump Audio collection of over 18,000 single-production music tracks. iStock is now offers stock imagery, video footage, vector illustrations, Flash files, and audio for purchase under a single payment model, on one site. Music and sound effects tracks can be used in a variety of creative projects, from Web video to TV shows, movies, commercials, public performances, and even games.

(www.istockphoto.com)

Back to Contents...

BA-Insight Offers New International Operations

BA-Insight, an enterprise information access company, announced it has opened two offices in Europe. New office locations include London and Copenhagen. These new locations were opened to address and increase in global demand for its Microsoft based enterprise search products.

(www.ba-insight.net)

Back to Contents...

Vivisimo Releases New Multimedia Outreach, Stan

Vivisimo, a provider of enterprise search, introduced Stan, a new multimedia outreach. Through a series of videos, blogs and other social media, Stan illustrates the use of enterprise search. The Meet Stan campaign includes a website (meetstan.com), videos, and social networking sites including Facebook and Twitter. Stan’s videos, available on meetstan.com and YouTube, describe different aspects of enterprise search. After an introductory video, the initial series will focus on the three critical dimensions of search—discovery, personalization and collaboration. Stan also has a blog on the website where he discusses his search and collaboration dilemmas and how enterprise search can help.

(www.vivisimo.com)

Back to Contents...

Search goes to the mountaintop

Endeca has introduced the McKinley release of the Information Access Platform. Named after North America’s highest peak, the release is based upon what Endeca calls a fundamentally new architecture for building standards-based search applications. The new platform is built for rapid development and maintenance of search applications that offer Endeca’s Guided Navigation user experience across the full range of structured and unstructured enterprise data. Endeca has also announced two search-based solutions, the Endeca Commerce Suite and the Endeca Publishing Suite.

Sue Feldman, research VP, Content Technologies Group, IDC, says, "Endeca's new McKinley platform demonstrates the migration of search applications to the next wave of high performance search. Publishers will like the more automatically generated facets that enable browsing on the fly. Online commerce sites will be quick to utilize the parametric searches and the tools for merchandising. This is a good example of a scalable search plus browse platform that can unify access to multiple sources of both data and content. Partners will like the speed with which they can deploy new applications. The system is designed with usability in mind as well. These are the key ingredients that a modern search platform should offer."

Endeca claims the McKinley platform replaces traditional search engines that make building search applications too costly or difficult, and complements traditional enterprise applications by pulling together large volumes of diverse, changing information to support decision-making. The new platform adds more than 100 new features since its previous release, including a redesigned engine that delivers new capabilities and sets the foundation for continuing advancements in simplicity, speed and scale.

Highlights of the McKinley release include:

New MDEX Engine technology for bigger, faster search applications with lower total cost of ownership. Endeca redesigned the core data storage architecture of the engine, leveraging innovative approaches for exploiting 64-bit memory architectures to access massive data volumes at unprecedented speeds. The redesigned engine is optimized to unlock the scale and performance made possible by the disruptive computing power available from today’s rapidly evolving multicore CPU platforms.

Simplicity. Partners are building new plug-in features and applications. The MDEX Engine technology is now extensible through an open, XQuery-based Web services stack, and ships with a WS-I compliant SOAP service for easy enterprise interoperability. These new capabilities allow for easy development of entirely new search applications as well as plug-in UI features, called cartridges.

Speed. Interactive response on diverse, changing data with less hardware. The new Continuous Query feature processes queries and data updates concurrently with no downtime, and Rapid Updates simplify and accelerate updates by allowing near real-time updating of the underlying engine. Together, these offer rich functionality while reducing total cost of ownership.

Scale: Tens of millions of records per box. The new MDEX Engine technology can now handle more data on the same hardware--up to a 100 percent increase, depending on the characteristics of the information. This means sub-second response on millions of documents and tens of millions of records on a single box.

Back to Contents...

Noodle V6.7.10 Released

Vialect announced the release of Noodle V6.7.10. This latest release focuses on allowing the user to categorize, search for, and organize content. Enhancements include: items within Noodle may now have tags assigned to them, tag portlets can now be added to portlet pages to show new and popular tags, the search results page has now been improved to provide more information on search results, local-installation Noodle sites will now have direct CSS access, and the new link manager interface gives users more flexibility for adding links and also for comments and ratings.

(www.vialect.com)

Back to Contents...

Minnesota Uses Vivisimo’s Velocity Search Platform

Minnesota is working with Vivisimo, a provider of enterprise search, in an effort to improve usability and findability in state web sites and to develop a new web hosting service for state agencies. The Minnesota Office of Enterprise Technology’s web site will utilize the integration of Vivisimo’s Velocity Search Platform into a new content management system. The project will show the capabilities of Velocity as well as the content management system, with the ultimate goal of indexing state content, federated search of external web sites, and making the information available through a search box for public-facing sites and on intranet and extranets for state employees and the public. The new site powered by Velocity will allow residents to find and access license applications and renewals online.

(www.vivisimo.com)

Back to Contents...

[Newsletters] [Home]