Extração de informação não-supervisionada por segmentação de texto
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Amazonas
Resumo
In this work we propose, implement and evaluate a new unsupervised approach for
the problem of Information Extraction by Text Segmentation (IETS). Our approach
relies on information available on pre-existing data to learn how to associate segments
in the input string with attributes of a given domain relying on a very effective
set of content-based features. The effectiveness of the content-based features is also
exploited to directly learn from test data structure-based features, with no previous
human-driven training, a feature unique to our approach. Based on our approach,
we have produced a number of results to address the IETS problem in a unsupervised
fashion. In particular, we have developed, implemented and evaluated distinct IETS
methods, namely ONDUX, JUDIE and iForm. ONDUX (On Demand Unsupervised
Information Extraction) is an unsupervised probabilistic approach for IETS that
relies on content-based features to bootstrap the learning of structure-based features.
Structure-based features are exploited to disambiguate the extraction of certain
attributes through a reinforcement step, which relies on sequencing and positioning
of attribute values directly learned on-demand from the input texts. JUDIE (Joint
Unsupervised Structure Discovery and Information Extraction) aims at automatically
extracting several semi-structured data records in the form of continuous text
and having no explicit delimiters between them. In comparison with other IETS
methods, including ONDUX, JUDIE faces a task considerably harder, that is, extracting
information while simultaneously uncovering the underlying structure of
the implicit records containing it. In spite of that, it achieves results comparable to
the state-of- the-art methods. iForm applies our approach to the task of Web form
filling. It aims at extracting segments from a data-rich text given as input and associating
these segments with fields from a target Web form. The extraction process
relies on content-based features learned from data that was previously submitted to
the Web form. All of these methods were evaluated considering different experimental
datasets, which we use to perform a large set of experiments in order to validate
our approach and methods. These experiments indicate that our proposed approach
yields high quality results when compared to state-of-the-art approaches and that
it is able to properly support IETS methods in a number of real applications.
Descrição
Citação
VILARINHO, Eli Cortez Custódio. Extração de informação não-supervisionada por segmentação de texto. 2012. 173 f. Tese (Doutorado em Informática) - Universidade Federal do Amazonas, Manaus, 2012.
