Geração semi-automática de extratores de dados da web considerando contextos fracos
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Amazonas
Resumo
In the current days, the Internet has become the largest information repository available. However, this huge variety of information is mostly represented in textual format and it necessarily requires human intervention to be effectively used. On the other hand, there exists a large set of Web pages that are in fact composed of collections of implicit data objects. For instance, on-line catalogs, digital libraries and e-commerce Web sites in general. Extracting the contents of these pages and identifying the structure of the data objects available allow for more sophisticated forms of processing besides hyperlink browsing and keyword-based searching. The task of extracting data from Web pages is usually executed by specialized programs called wrappers. In the present work we propose and evaluate a new approach to the wrapper development problem. In this approach, the user is only responsible for providing examples for the atomic
items that constitute the objects of interest. Based on these examples, our method automatically generates expressions for extracting other atomics items similar to those presented as example and infers a plausible and meaningful structure to organize them. Our method for generating extraction expression uses techniques inherited from solutions for the multiple string alignment problem. The method is able to produce good extraction expressions that can be easily encoded as regular expressions. Inferring a meaningful structure for the objects whose atomic values were extracted is the task of the HotCycles algorithm, that were previously proposed and which we have revised and extended in this work. The algorithm assembles an adjacency graph for these atomic values, and executes a structural analysis over this graph, looking for patterns that resemble structural constructs such as tuples and lists. From such constructs, a complex object type can be assigned to the extracted data. The experiments carried out using 21 collections of real Web pages have demonstrated the feasibility of our extraction method, reaching 94% of effectiveness using no more than 10 examples for each attribute. The HotCycles algorithm was able to infer a meaningful structure for the objects present in all used collections. Its effectiveness, combined with our atom extraction method, reached 97% of structures correctly inferred, also using no more than 10 examples per attribute. The association of these two methods has demonstrated to be extremely feasible. The high number of correctly inferred structures together with the high precision and recall values of the extraction process demonstrates that this new approach is indeed a promising one.
Descrição
Citação
OLIVEIRA, Daniel Pereira de. Geração semi-automática de extratores de dados da web considerando contextos fracos. 2006. 136 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2006.
