CAREER
Towards a Text-Centric Database Management System
Panagiotis G. Ipeirotis
Unstructured text data is ubiquitous and, not surprisingly, many users and applications rely on textual data for a variety of tasks. The current paradigm for handling text data, popularized by search engines, is essentially a keyword "lookup" operation, followed by a sophisticated ranking of the results. There is very limited support for "structured" queries, no support for queries that need to combine information from multiple sources, and no support for queries that involve a semantic element. The proposed research aims to design and building algorithms and systems that allow users to ask complicated questions over unstructured text and get concrete answers, thus enabling users to spend less time searching for information and more time analyzing and understanding the results. However, making the leap from small-scale information extraction applications to sophisticated, web-scale query processing requires fundamental advances in a number of areas. The proposed research is based on two interrelated research streams:
-
Cost- and Quality-based Query Optimization for Text-Centric Tasks (SQoUT Project): Having hundreds and thousands of information extraction systems available, allows the execution of many, complex queries over the web. With billions of available documents, it is crucial to find efficient methods for optimizing the execution of such queries and have of a query optimizer that will automatically choose the best execution plan for a given task, returning the desired answer in the fastest way possible. Of course, extracting structured information from unstructured text is inherently a noisy process, and the returned results do not have perfect "precision" and "recall" (i.e., they are neither perfect nor complete). The proposed research will focus on making the quality of the results an integral part of the query optimization process. The goal is to enable users to specify the desired result quality, and the optimizer should choose automatically the appropriate extraction systems), the appropriate configuration for the system, and the appropriate execution plan for the given task.
-
Economic-aware Query Processing over Sentiment Data (EconoMining Project): An important application of information extraction systems is opinion extraction. Deriving the semantic orientation and strength of opinions is an important research topic that attracted significant attention over the last few years. Current approaches ignore the context in which an opinion is evaluated and have trouble estimating the opinion strength. The proposed research will take into consideration the economic context for evaluating the effect of an opinion. The proposed approach will combine established techniques from econometrics with text mining algorithms to identify the ``economic value of text'' and assign a ``dollar value'' to each piece of text, quantifying sentiment objectively, in a context-aware manner.
These projects are supported by the NSF CAREER award IIS-0643846.