Extração não supervisionada de dados da web utilizando abordagem independente de formato
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Amazonas
Resumo
In this thesis we propose a new method for extraction data in rich Web pages that
uses only the textual content of these pages. Our method, called FIEX (Format
Independent Web Data Extraction), is based on information extraction techniques
for text segmentation, and can extract data from Web pages where methods of state
of the art based on data alignment techniques fail due to inconsistency between the
logical structure of Web pages and the conceptual structure of the data represented
in them. The FIEX, unlike the methods previously proposed in the literature, is able
to extract data using only the textual content of a Web pages in challenging scenarios
such as severe cases of textual elements compounds, in which various values of interest
for extraction are represented by only one HTML element. To perform the extraction
data of the web pages, FIEX is based on techniques of elimination noise by information
redundancy and an information extraction method for text segmentation known
in the literature as ONDUX (On-Demand Unsupervised Learning for Information Extraction).
In our experiments, we used various Web pages collections of di erent areas
of products and e-commerce stores with goal to extract data from product descriptions.
The choose of this type of Web page, due to the large amount of data these pages
are contained in severe cases of textual elements compounds. According to the results
obtained in our experiments in various areas of products and e-commerce stores, we
validate the hypothesis that the extraction based on only textual features is possible
and e ective.
Descrição
Citação
PORTO, André Luiz Lopes. Extração não supervisionada de dados da web utilizando abordagem independente de formato. 2015. 77 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2015.
