IBMs Approach To Enterprise Search

Developing effective search tools for enterprise-level businesses is not the same as developing search for web-based documents. Websites and other web-related documents contain an inherent structure due to the nature of web links.

However, because corporate documents do not contain a natural structure, meaning they are more unique and un-related, indexing the large amounts of business documents can be an arduous task. It’s this concept that drives Arthur Ciccolo, one of the chief developers of IBM’s Unstructured Information Management Architecture (UIMA) project.

While enterprise search may be considered a niche topic, many of the developments coming from UIMA, if applied to web-based search, could have incredible ramifications. However, this is not IBM’s goal. During a phone interview, Ciccolo stated in no uncertain terms that IBM’s goal for their search technology is the enterprise level, not web search.

To define UIMA and its function, IBM offers this:

Unstructured information represents the largest, most current and fastest growing source of information available to businesses and governments An Unstructured Information Management (UIM) application may be generally characterized as a software system that analyzes large volumes of unstructured information (text, audio, video, images, etc.) to discover, organize and deliver relevant knowledge to the client or application end-user.

In order to accomplish this task, Ciccolo and his team are putting their efforts developing different framework structures to perform text analysis, semantic comprehension, and natural language support. By doing so, IBM’s UIMA utilities can better perform the tasks of indexing and comprehending the different types of enterprise business documents.

To understand why enterprise search can be such a complicated excursion, you must first understand the different types of unstructured data that has to be indexed. With web-based documents, there is a much more narrow focus because document types are more limited (html, pdf, fla, etc.) and they usually contain links, which lends itself to easier indexing.

These off-page attributes provide the structure and makes web indexing less complicated than the unstructured environment of business documents. While web documents are more confined, enterprise documents run the gamut from word processing documents to video and sound files, and the off-page attributes that provide structure are absent.

The capabilities of UIMA, some of which are still in the developmental stages, attempt to address these concerns. For instance, the focal point of UIM is to focus on text analytics and the semantics contained within. By understanding the contents of the text being indexed, developing natural language search capabilities (“What is the formula for product X?”) is an attainable goal.

However, to understand the difficulties involved in developing a natural language search feature that works, consider this: the reason Microsoft will be releasing Longhorn without the search feature has to do with the developing a new file structure that supports natural search queries.

With UIMA, once a document is indexed, searching for it should be easier than it would if it was web indexed. Ciccolo indicated that once an item is indexed, their technology automatically generates editable meta data, which makes discovery much simpler. By integrating UIM technology with IBM’s Intranet search utility WebSphere Information Integrator OmniFind, they are able to provide an exciting era of enterprise-related search.

With regards to the current abilities of UIMA, the potential future developments are quite impressive, and if adopted, could cause huge ripples in the search technology status quo. While the list of possible developments is fairly long, there are two possible developments that could have long-reaching ramifications.

The first area of interest has to do with video search. Ciccolo and his team are developing video search methods that could revolutionize the whole concept. Normally, most video indexing is done by spidering the closed-caption text contained within the film. However, Ciccolo’s vision has to do with actually analyzing the picture to extract whatever relevant content is contained within. This data would then be indexed and have meta data generated (which would be editable using IBM’s service), making retrieval methods even better.

The other area of interest has to do with providing the ability to perform trans-lingual queries. What drives this development is the following concept:

– User A enters a query in English language
– UIMA translates query into target language
– UIMA then searches target language documents and,
– Returns search result in whatever language initiated the query. In this case, English

Arthur indicated that during the testing phases, he discovered search results were actually more relevant after the translation took place, meaning UIMA performs the translation after the query is entered. If something similar to this was adopted by search as a whole, it could and would alter the entire landscape of possibilities.

Other features and ideas of interest include the ability for administration members to write their own search algorithms, which can be implemented on top of existing framework. Speaking of admins, Ciccolo made sure to point out UIMA was developed with IT departments in mind. The technology is purposely made to be easy to install and to tweak. This makes adapting their search technology to fit your business much easier.

Another area that the team is focusing on is the medical industry. The ability to catalog and differentiate between the mountains of journals and documents over-running medical institutions is completely welcomed. Currently, IBM has an agreement with the Mayo Clinic to implement and test UIMA. If the medical field as a whole would adopt such technology, finding patient information, journals on particular illnesses, and pharmaceutical information would be much easier to accomplish. This in turn would undoubtedly improve the medical industry’s ability to treat and care for patients, as well as share information.

To understand the goal of the UIMA project, Ciccolo offers these thoughts, “IBM’s goal for UIMA is that it becomes widely accepted as a new class of middleware for analytics and that it enables the next generation of search: semantic search.”

For much more information about the project and other areas of IBM’s approach to enterprise search, please visit the following areas:

UIMA Homepage
The research proposal
About the authors
Information about OmniFind

While the subject can be tricky to navigate through, I would recommend reading the journals and documentations of Art Ciccolo and his team.

Chris Richardson is a search engine writer and editor for Murdok. Visit Murdok for the latest search news.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top