Extração automática de dados de páginas HTML utilizando alinhamento em dois níveis

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal do Amazonas

Resumo

There is a huge amount of information in the World Wide Web in pages composed by similar objects. E-commerce Web sites and on-line catalogs, in general, are examples of such data repositories. Although this information usually occurs in semi-structured texts, it is designed to be interpreted and used by humans and not processed by machines. The identification of these objects inWeb pages is performed by external applications called extractors or wrappers. In this work we propose and evaluate an automatic approach to the problem of generating wrappers capable of extracting and structuring data records and the values of their attributes. It uses the Tree Alignment Algorithm to find in the Web page examples of objects of interest. Then, our method generates regular expressions for extracting objects similar to the examples given using the Multiple Sequence Alignment Algorithm. In a final step, the method decomposes the objects in sequences of text using the regular expression and common formats and delimiters, in order to identify the value of the attributes of the data records. Experiments using a collection composed by 128 Web pages from different domains have demonstrated the feasibility of our extraction method. It is evaluated regarding the identification of blocks of HTML source code that contain data records and regarding record extraction and the value of its attributes. It reached a precision of 83% and a recall of 80% when extracting the value of attributes. These values mean a gain in precision of 43.37% and in recall of 68.75% when compared to similar proposals.

Descrição

Citação

PEDRALHO, André de Souza. Extração automática de dados de páginas HTML utilizando alinhamento em dois níveis. 2011. 78 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2011.

Avaliação

Revisão

Suplementado Por

Referenciado Por