Unstructured data like e-mail far outpaces structured data in organizations, and IT departments increasingly have to search that unstructured data. Such schemas got a shot in the arm earlier this month when IBM announced its intent to open source its Unstructured Information Management Architecture, or UIMA. Vendors as well as end users stand to benefit from the technology, according to Big Blue's partners in the project.
Unstructured data lives up to its name. It does not reside on a corporate database, it does not use templates and the information users need is often buried in complex, free-form text. By combining text analytics tools with a common interface, UIMA strives to simply the task of parsing unstructured data and making it structured. It uses those components to identify the language of documents, finds words and roots of words, identify parts of speech, extract concepts and recognize relationships between words.
By open sourcing the UIMA framework, IBM shows its support for a standard way to search, analyze and present data. This means small firms like ClearForest, a Waltham, Mass.-based text analytics tool provider, can focus on their core search technology instead of infrastructure and interoperability.
"What IBM can bring to the table is a standardization for this type of interface. They can muster the partner support and create and forge partnerships," said Jay Henderson, director of product marketing for ClearForest. "It validates the space that we're in to have a big vendor like IBM say that not only is this base emerging but it's important for companies."
Along with ClearForest, IBM's UIMA partners include Attensity, another text analytics firm, business news provider Factiva and business intelligence software maker Cognos.
Business intelligence tools produce both reports and forward-looking analyses. They have traditionally tapped into corporate databases, meaning that "the world of unstructured data was unusable," said Rupert Bonham-Carter, senior director of IBM alliances at Cognos. The tools could query unstructured text, he said, but in "no practical way."
Text analytics tools look at the relationship between an object and natural language, Bonham-Carter said. For example, they separate an e-mail into its sender, its subject and its body. In doing so, the unstructured data gains context and becomes structured.
"There is a lot of corporate knowledge that companies have trapped inside their unstructured data…that would be valuable to making decisions," Bonham-Carter said. With UIMA, he added, "There is a completely unlimited number of ways that people can deconstruct that text. Anything that makes unstructured data more structured is good for us because it's good for our customers."
Greg Gerdy, vice president of channel marketing at Factiva, said applying UIMA to the company's news-searching tools gives users a chance to look for patterns and trends among hundreds of news sources. A corporate communications division, for example, could research its own media placement and compare that to the attention competitors get.
"Because text mining is a relatively new commercial discipline, we're excited because this presents an opportunity to increase awareness and grow the market," Gerdy said.
With the latest version of its WebSphere Information Integrator OmniFind Edition, IBM claims the first commercially available software based on UIMA for content processing.