IBM releases search software to open source

The DARPA-backed UIMA (Unstructured Information Management Architecture) eases searching. WebSphere Information Integrator OmniFind Edition will first carry IBM's efforts to commercialize UIMA.

IBM this week said it would make its new Unstructured Information Management Architecture (UIMA) available through open source for application developers who want to probe text-based documents using more than keyword searching. The UIMA software, which was initially developed based on funding by the U.S. Defense Advanced Research Projects Agency (DARPA), is said to uncover latent meanings and important contextual relationships in documents.

Besides obvious use in surveillance to uncover hidden patterns and possibly identify nefarious activity, the software could be used by commercial organizations to cull e-mails and technician repair notes to identify percolating product quality and safety problems.

It is a tremendous breakthrough in the search space.
Nelson Mattos
Vice President, Information IntegrationIBM
With the latest version of its WebSphere Information Integrator OmniFind Edition, IBM claims the first commercially available software based on the UIMA for processing content. The move can be seen as part of an ongoing effort by the company to use new search technology open the gates to unstructured corporate data, which, in sheer quantity, vastly surpasses structured or relational data.

UIMA-compliant text analytic components can use WebSphere Information Integrator OmniFind Edition to clarify the meaning of terms, and garner useful business information, suggested Nelson Mattos, vice president of Information Integration, IBM. He described UIMA as a framework composed of software components with well-defined interfaces. These components can serve to identify the language of documents, find words and roots of words as traditional keyword-based engines do, identify parts of speech, extract concepts (or "entities") and recognize relationships.

The components can allow software developers to "plug in industry expertise that will help extract valuable metadata about documents," Mattos said. That helps address the age-old problem of semantics: A term such as "rock" can stand for music, stones or motion, depending on the context that surrounds it. [A short-hand definition for 'metadata,' as used here, is: Data about data.]

More on information management

News: Heterogeneous data gets the WebSphere treatment

White paper: Lotus Workplace Web content management

Open-source standards are key to the promise of this form of semantically clever software. In the past, natural language processing faced hurdles, as domain experts were enlisted full-time to keep the engines up to date. PhDs were often required, as was the constant tweaking of components.

"OmniFind can find and understand the semantic meaning of words," said IBM's Mattos. He described UIMA and OmniFind as instances where sophisticated text analysis was going "mainstream."

"Because the system can 'understand' the facts in a document, we can dramatically improve the relevance of results," Mattos said. "That saves time for workers."

He also noted that because components can probe for underlying meaning, the system could, for example, analyze millions of call center records to uncover maintenance contract pricing issue trends.

Asserts Mattos: "It is a tremendous breakthrough in the search space."

Dig Deeper on Domino Resources

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.




  • iSeries tutorials's tutorials provide in-depth information on the iSeries. Our iSeries tutorials address areas you need to know about...

  • V6R1 upgrade planning checklist

    When upgrading to V6R1, make sure your software will be supported, your programs will function and the correct PTFs have been ...

  • Connecting multiple iSeries systems through DDM

    Working with databases over multiple iSeries systems can be simple when remotely connecting logical partitions with distributed ...