Extração não supervisionada de dados da web utilizando abordagem independente de formato

Porto, André Luiz Lopes

Extração não supervisionada de dados da web utilizando abordagem independente de formato

Arquivos

Dissertação - André Luiz Lopes Porto.pdf (14,17 MB)

Data

2015-11-17

Autores

Porto, André Luiz Lopes

Editor

Universidade Federal do Amazonas

Resumo

In this thesis we propose a new method for extraction data in rich Web pages that uses only the textual content of these pages. Our method, called FIEX (Format Independent Web Data Extraction), is based on information extraction techniques for text segmentation, and can extract data from Web pages where methods of state of the art based on data alignment techniques fail due to inconsistency between the logical structure of Web pages and the conceptual structure of the data represented in them. The FIEX, unlike the methods previously proposed in the literature, is able to extract data using only the textual content of a Web pages in challenging scenarios such as severe cases of textual elements compounds, in which various values of interest for extraction are represented by only one HTML element. To perform the extraction data of the web pages, FIEX is based on techniques of elimination noise by information redundancy and an information extraction method for text segmentation known in the literature as ONDUX (On-Demand Unsupervised Learning for Information Extraction). In our experiments, we used various Web pages collections of di erent areas of products and e-commerce stores with goal to extract data from product descriptions. The choose of this type of Web page, due to the large amount of data these pages are contained in severe cases of textual elements compounds. According to the results obtained in our experiments in various areas of products and e-commerce stores, we validate the hypothesis that the extraction based on only textual features is possible and e ective.

Palavras-chave

Extração de dados , Comércio Eletrônico , Descrições de Produtos , Alinhamento de dados , Data Extraction , E-commerce , Product Description , Data Alignment

Citação

PORTO, André Luiz Lopes. Extração não supervisionada de dados da web utilizando abordagem independente de formato. 2015. 77 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2015.

URI

http://tede.ufam.edu.br/handle/tede/5113

Coleções

Mestrado em Informática

Direitos e licensiamento

Acesso Aberto

Página do item completo

Extração não supervisionada de dados da web utilizando abordagem independente de formato

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

DOI

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

Avaliação

Revisão

Suplementado Por

Referenciado Por

Direitos e licensiamento