Geração semi-automática de extratores de dados da web
considerando contextos fracos

Oliveira, Daniel Pereira de

Geração semi-automática de extratores de dados da web considerando contextos fracos

Arquivos

Daniel Pereira de Oliveira.pdf (1,87 MB)

Data

2006-03-03

Autores

Oliveira, Daniel Pereira de

Editor

Universidade Federal do Amazonas

Resumo

In the current days, the Internet has become the largest information repository available. However, this huge variety of information is mostly represented in textual format and it necessarily requires human intervention to be effectively used. On the other hand, there exists a large set of Web pages that are in fact composed of collections of implicit data objects. For instance, on-line catalogs, digital libraries and e-commerce Web sites in general. Extracting the contents of these pages and identifying the structure of the data objects available allow for more sophisticated forms of processing besides hyperlink browsing and keyword-based searching. The task of extracting data from Web pages is usually executed by specialized programs called wrappers. In the present work we propose and evaluate a new approach to the wrapper development problem. In this approach, the user is only responsible for providing examples for the atomic items that constitute the objects of interest. Based on these examples, our method automatically generates expressions for extracting other atomics items similar to those presented as example and infers a plausible and meaningful structure to organize them. Our method for generating extraction expression uses techniques inherited from solutions for the multiple string alignment problem. The method is able to produce good extraction expressions that can be easily encoded as regular expressions. Inferring a meaningful structure for the objects whose atomic values were extracted is the task of the HotCycles algorithm, that were previously proposed and which we have revised and extended in this work. The algorithm assembles an adjacency graph for these atomic values, and executes a structural analysis over this graph, looking for patterns that resemble structural constructs such as tuples and lists. From such constructs, a complex object type can be assigned to the extracted data. The experiments carried out using 21 collections of real Web pages have demonstrated the feasibility of our extraction method, reaching 94% of effectiveness using no more than 10 examples for each attribute. The HotCycles algorithm was able to infer a meaningful structure for the objects present in all used collections. Its effectiveness, combined with our atom extraction method, reached 97% of structures correctly inferred, also using no more than 10 examples per attribute. The association of these two methods has demonstrated to be extremely feasible. The high number of correctly inferred structures together with the high precision and recall values of the extraction process demonstrates that this new approach is indeed a promising one.

Palavras-chave

Geração semi-automática , Extratores de dados , Contextos fracos , Geração Semi-automática , Extratores de Dados , Semi-automatic generation , Data extractors , Weak contexts

Citação

OLIVEIRA, Daniel Pereira de. Geração semi-automática de extratores de dados da web considerando contextos fracos. 2006. 136 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2006.

URI

http://tede.ufam.edu.br/handle/tede/2936

Coleções

Mestrado em Informática

Direitos e licensiamento

Acesso Aberto

Página do item completo

Geração semi-automática de extratores de dados da web considerando contextos fracos

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

DOI

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

Avaliação

Revisão

Suplementado Por

Referenciado Por

Direitos e licensiamento