Identificando o Tópico de Páginas Web

Lima, Márcia Sampaio

Identificando o Tópico de Páginas Web

Arquivos

DISSERTACAO MARCIA.pdf (775,86 KB)

Data

2009-04-24

Autores

Lima, Márcia Sampaio

Editor

Universidade Federal do Amazonas

Resumo

Textual and structural sources of evidences extracted from web pages are frequently used to improve the results of Information Retrieval (IR) systems. The main topic of a web page is a textual source of evidence that has a wide applicability in IR systems. It can be used as a new source of evidence to improve ranking results, page classification, filtering, among other applications. In this work, we propose to study, develop and evaluate a method to identify the main topic of a web page using a combination of different sources of evidences. We define the main topic of a web page as a set of, at most, five distinct keywords related to the main subject of the page. In general, the proposed method, is divided in four distinct phases: (1) identification of the keywords that describe the web page content, using multiple sources of evidences; (2) use of a genetic algorithm to combine the sources of evidences; (3) definition of the three better keywords of the page; and (4) use of a web directory to identify the page main topic. The results of the experiments show that: (1) the best source of evidence used to describe the keywords of a web page is the content link; (2) the proposed method is efficient to identify the main topic of a web page: 0.9129, in a scale of zero to one; and (3) the proposed method is also efficient to automatic classify web pages within the Google directory, reaching 88%±0.11 of precision in the classification task.

Palavras-chave

Tópico de páginas Web , Algoritmos genéticos , Múltiplas fontes de evidências , Diretórios web , Topic of web page , Genetic algorithm , Multiple sources of evidences , Web directories

Citação

LIMA, Márcia Sampaio.Identificando o Tópico de Páginas Web. 2009. 73 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2009.

URI

http://tede.ufam.edu.br/handle/tede/2957

Coleções

Mestrado em Informática

Direitos e licensiamento

Acesso Aberto

Página do item completo

Identificando o Tópico de Páginas Web

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

DOI

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

Avaliação

Revisão

Suplementado Por

Referenciado Por

Direitos e licensiamento