EnterpriseSearchCenter.com Home
  News   Features   White Papers   Research Reports   Web Events   Conferences  
 
RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES
April 27, 2011

Table of Contents

Antique Search Appliance Roadshow
Antique Search Appliance Roadshow
Enterprise Search Summit Expands to Europe
Web content personalization
Getting a fix on SharePoint governance
Q-Sensei—a new enterprise search engine
Automotive software gets boost from MarkLogic
ZyLAB tackles sound files: E-discovery audio search

Antique Search Appliance Roadshow

About 50 years ago I fell in love with chemistry. I could think of no more interesting subject and spent 3 very enjoyable years at the University of Southampton. It was a time when the department was home to some exceedingly able researchers, and when I crept into the research meetings I heard phrases such as, "We are beginning to think ..." and "It's starting to look as though ..." as new techniques were being developed just down the corridor in the undergraduate labs.

As an information scientist, and not as a chemist, I've done a lot of work in the pharmaceutical sector, but it is only over the last few years, working with the e-delivery team at the Royal Society of Chemistry, that I have really come back to this beloved subject ... and found it almost unrecognizable! Techniques that were emerging in the 1970s are now commonplace and, indeed, almost antique.

There is a tendency in enterprise search, especially among IT managers, to assume that the current generation of search applications is now so powerful that they will never need to buy another one. If only life were that simple. The reality is that enterprise applications are already running out of power as the volume of information continues to increase. There are still so many challenging problems in search it is hard to know where to start. Federated search might be a good place. Any enterprise collection of content will be made up of multiple repositories, and users cannot be expected (though they often are!) to know in which repository the information they need resides.

The challenges are difficult enough when dealing with just text, without adding in business intelligence and other database applications and access to external business and technical information resources. Users do not complain because they do not know what is possible. Even more important, they don't know what will be possible in the near future as new search algorithms from new players in the market offer better solutions.

Another significant problem in search is that of synonyms and related terms. Of course, in theory, you could build vast directories, but the effort to keep them current would be enormous, and the latency of the look up would also be significant. Coming to a desktop near you before long will be search tools based on topic modeling, which use Bayesian statistics and machine learning to infer the relationship between topics in a document. What is fascinating about this technique is that it dates from the development of latent semantic indexing in the late 1980s, which was then refined to give us probabilistic latent semantic indexing a decade later. PDSI is the core technology used by Recommind. Now the buzz is about latent Dirichlet allocation (LDA), which itself has formed the basis for correlated topic models (CTM) and dynamic topic modelling (DTM). Incidentally, Dirichlet was a German mathematician who died more than 150 years ago, which is an interesting reflection on the longevity of mathematical techniques.

I think that's enough tech-speak for now. The point I am making is this: There is a significant amount of research being undertaken to find relevant information in very large collections of documents. Much of this research is being funded and implemented by national security agencies, which might inhibit the speed with which it becomes commercially available, but a check through Google Scholar will show that research teams at Google and Microsoft are doing a lot of work in this area.

Another search challenge is in assessing search engine performance and, in particular, search recall. Precision is a measure of the percentage of retrieved documents that are relevant, and that is relatively easy to determine. Recall is a measure of the percentage of all relevant documents that are retrieved, and that in theory requires knowledge of how many relevant documents there are in a collection. One approach is to use a test collection, but that is not a real-world option. There is now a lot of interest in using crowd-sourcing techniques to assess recall, in particular the Mechanical Turk service developed by Amazon.com (www.mturk.com). Again, this technique is still in the experimental stage, but it could be of significant value both to search vendors seeking to improve search performance and to organizations wishing to compare search applications.

To get some indication of the range of current information retrieval research, go to the ACM SIGIR site at www.sigir.org. Much of this research could be commercially available in the next 3 to 5 years. How could you make use of it, assuming that your current search vendor is able to take advantage of these significant advances in search effectiveness? Certainly a member of your search support team should be tracking and evaluating information retrieval research. In 10 years' time I'm certain that today's search technology will look very antiquated.

Back to Contents...

Antique Search Appliance Roadshow

About 50 years ago I fell in love with chemistry. I could think of no more interesting subject and spent 3 very enjoyable years at the University of Southampton. It was a time when the department was home to some exceedingly able researchers, and when I crept into the research meetings I heard phrases such as, "We are beginning to think ..." and "It's starting to look as though ..." as new techniques were being developed just down the corridor in the undergraduate labs.

As an information scientist, and not as a chemist, I've done a lot of work in the pharmaceutical sector, but it is only over the last few years, working with the e-delivery team at the Royal Society of Chemistry, that I have really come back to this beloved subject ... and found it almost unrecognizable! Techniques that were emerging in the 1970s are now commonplace and, indeed, almost antique.

There is a tendency in enterprise search, especially among IT managers, to assume that the current generation of search applications is now so powerful that they will never need to buy another one. If only life were that simple. The reality is that enterprise applications are already running out of power as the volume of information continues to increase. There are still so many challenging problems in search it is hard to know where to start. Federated search might be a good place. Any enterprise collection of content will be made up of multiple repositories, and users cannot be expected (though they often are!) to know in which repository the information they need resides.

The challenges are difficult enough when dealing with just text, without adding in business intelligence and other database applications and access to external business and technical information resources. Users do not complain because they do not know what is possible. Even more important, they don't know what will be possible in the near future as new search algorithms from new players in the market offer better solutions.

Another significant problem in search is that of synonyms and related terms. Of course, in theory, you could build vast directories, but the effort to keep them current would be enormous, and the latency of the look up would also be significant. Coming to a desktop near you before long will be search tools based on topic modeling, which use Bayesian statistics and machine learning to infer the relationship between topics in a document. What is fascinating about this technique is that it dates from the development of latent semantic indexing in the late 1980s, which was then refined to give us probabilistic latent semantic indexing a decade later. PDSI is the core technology used by Recommind. Now the buzz is about latent Dirichlet allocation (LDA), which itself has formed the basis for correlated topic models (CTM) and dynamic topic modelling (DTM). Incidentally, Dirichlet was a German mathematician who died more than 150 years ago, which is an interesting reflection on the longevity of mathematical techniques.

I think that's enough tech-speak for now. The point I am making is this: There is a significant amount of research being undertaken to find relevant information in very large collections of documents. Much of this research is being funded and implemented by national security agencies, which might inhibit the speed with which it becomes commercially available, but a check through Google Scholar will show that research teams at Google and Microsoft are doing a lot of work in this area.

Another search challenge is in assessing search engine performance and, in particular, search recall. Precision is a measure of the percentage of retrieved documents that are relevant, and that is relatively easy to determine. Recall is a measure of the percentage of all relevant documents that are retrieved, and that in theory requires knowledge of how many relevant documents there are in a collection. One approach is to use a test collection, but that is not a real-world option. There is now a lot of interest in using crowd-sourcing techniques to assess recall, in particular the Mechanical Turk service developed by Amazon.com (www.mturk.com). Again, this technique is still in the experimental stage, but it could be of significant value both to search vendors seeking to improve search performance and to organizations wishing to compare search applications.

To get some indication of the range of current information retrieval research, go to the ACM SIGIR site at www.sigir.org. Much of this research could be commercially available in the next 3 to 5 years. How could you make use of it, assuming that your current search vendor is able to take advantage of these significant advances in search effectiveness? Certainly a member of your search support team should be tracking and evaluating information retrieval research. In 10 years' time I'm certain that today's search technology will look very antiquated.

Back to Contents...

Enterprise Search Summit Expands to Europe

Join us in London this October for two days of plenary and panel sessions, technical and implementation tracks, and case studies from corporate, public sector and not-for-profit organisations, supported by a range of networking opportunities to promote debate and dialogue and help you to learn from your peers.

Topics covered include:

  • multilingual search
  • open source search applications
  • federated search
  • search centres of excellence
  • search business case development
  • mobile search
  • SharePoint search
  • technology trends
  • enterprise search analytics
  • search based applications
Enterprise Search Europe, 24-25 October 2011, Hilton London Olympia.

Back to Contents...

Web content personalization

EPiServer has unveiled a new version of its platform that combines personalization across content, commerce, communication and community. The new version is said to permit interactive marketers to deliver targeted and personalized content on Web sites, commerce sites and online communities based on users’ demographic information and online behavior.

The company adds that unique and relevant content can be served to visitor groups based on virtual roles, such as potential or current customers, new or returning visitors, press, influencer/analyst job seekers and others created and customized by the marketer. Demographic and behavioral data such as location (geo-ip), number of visits, pages viewed, references, etc., search terms are used to identify visitors and assign them within the pre-defined roles.

Back to Contents...

Getting a fix on SharePoint governance

Axceler has released ControlPoint 4.2, its SharePoint administration solution, which includes interactive analysis and reporting, enhanced policy enforcement, as well as new permission and security capabilities.                

Axceler elaborates:

  • new interactive analysis and reporting capabilities based on Microsoft Silverlight, including the ability to manipulate report data, create custom views and generate graphical and tabbed reports;
  • comprehensive analysis of SharePoint 2010 Managed Meta Data usage in lists/libraries;
  • additional ControlPoint policies to intercept user actions like creating content so managers can prevent them before they happen, critical as part of a solid SharePoint governance plan; and
  • enhancements for permissions and security including support for claims-based authentication.

Back to Contents...

Q-Sensei—a new enterprise search engine

Q-Sensei has introduced its Enterprise Search Platform, which has been engineered to give businesses a real-time view of all their data, no matter the source or format, all in one simple interface.

The platform analyzes and processes both structured and unstructured data from any source, be it databases, document servers, SharePoint, CRM or even Internet-based information or social media feeds such as Twitter and Facebook. Q-Sensei says its search capability gives business users a secure 360-degree view of all relevant data for business analytics, statistical analyses, as well as media and market-trend tracking.

Back to Contents...

Automotive software gets boost from MarkLogic

Mitchell 1, a division of Snap-on, wanted a technology that would process information from multiple sources in different formats, while being searchable. The solution would be the database behind Mitchell 1’s OnDemand5 software suite, which provides comprehensive data for vehicle models from 1983 to current ones, and is used by automotive repair shops of all sizes.

The company realized that its relational architecture was too slow and needed to be updated to offer speed and flexibility, as well as the ability to manage the huge amount of unstructured data that Mitchell 1 encounters each day. The company chose MarkLogic to provide a simplified architecture to meet its goals and appropriate performance standards.

Mark Zecca, senior director of IT for Mitchell 1, says, “Originally all of this information used to be kept in books. Initially we digitized it into a relational database system as our unstructured information grew, but the RDBMS couldn’t keep up. That’s why we turned to MarkLogic for the next evolution of our product. The relational database solution we used couldn’t offer the granularity in search that our customers wanted, and the speed just wasn’t there.”

MarkLogic reports that its technology provides Mitchell 1 with a database that is searchable to deliver service, repair and diagnostic information to technicians in seconds. It can also integrate content from different suppliers, with different data types. As a purpose-built database for unstructured information, the solution can load all data “as is,” make it available quickly and allow Mitchell 1 to query the data and trust that the results are accurate, according MarkLogic.

Back to Contents...

ZyLAB tackles sound files: E-discovery audio search

ZyLAB has unveiled its Audio Search Bundle, a desktop software product engineered to identify relevant audio clips from multimedia files and from business tools such as fixed-line telephone, VOIP, mobile and specialist platforms such as Skype or MSN Live. It is designed for technical and non-technical users involved in legal disputes, forensics, law enforcement and lawful data interception to search, review and analyze audio data with the same ease as more traditional forms of electronically stored information (ESI).

ZyLAB says Audio Search Bundle transforms audio recordings into a phonetic representation of the way in which words are pronounced, so that investigators can search for dictionary terms as well as proper names, company names or brands without the need to “re-ingest” the data.

With the ZyLAB Audio Search Bundle, forensic investigators and attorneys can identify and collect audio recordings from various sources with far greater efficiency and effectiveness than was ever possible with manual processing. The software supports multiple search techniques simultaneously, such as Boolean and wildcard, leading to greater accuracy and relevance of results. The fast, iterative search helps to reduce the size of the data set and the costs for review.

The ZyLAB Audio Search Bundle supports all industry-standard audio formats, including G711, GSM6.10, MP3 and WMA, as well as the audio component of video files. The bundle is available with the ZyLAB eDiscovery & Production System, which is fully aligned with the Electronic Discovery Reference Model (EDRM) or any other ZyLAB system.

Back to Contents...
 
[Newsletters] [Home]

Problems with this site? Please contact the webmaster. | About ITI | Privacy Policy