Uso de região de interesse para tratamento de desbalanceamento de Bases de Dados de monitoramento de tráfego de redes de acesso geradas por adesão voluntária

Silva, Juliana Castro da

Uso de região de interesse para tratamento de desbalanceamento de Bases de Dados de monitoramento de tráfego de redes de acesso geradas por adesão voluntária

Arquivos

Primário Dissertacao_JulianaSilva_PPGI (3,16 MB)

Carta de Autorização de Encaminhamento.pdf (171,83 KB)

Data

2022-11-30

Autores

Silva, Juliana Castro da

Editor

Universidade Federal do Amazonas

Resumo

An unbalanced dataset is characterized by a significant difference among groups of data. These groups have been named the majority group, i.e., it has a large majority number of samples, and the minority group, i.e., it has a small number of samples. This pattern has been observed in datasets from different domains, e.g., finance, weather, and medical diagnostics. More recently, datasets collected using crowdsourcing techniques were put in this basket due to the social and economic profile of gathered volunteers. In general, the process of collecting data is costly and time-consuming which imposes severe restrictions to extend the collecting period or repeat the process to acquire more data or to improve the quality of acquired data. Moreover, the most wanted learning characteristics are misrepresented in the minority group. In this context, data representativeness is a key issue in using those datasets for training Machine Learning models, for instance, to solve classification and prediction problems with significant accuracy. Therefore, strategies for solving the unbalanced dataset problem have been proposed by using an algorithmic approach, i.e., it changes the learning algorithm, or using a data-driven approach, i.e., it changes the data distribution probability. Oversampling is a data-driven approach and works by changing the data distribution through sampling and patching the minority group. This sampling happens based on the concept of a neighborhood which is established by measuring the similarity among samples of the minority group, for instance, using Euclidean distance. The SMOTER, SMOTE for Regression, implements neighborhood-based oversampling and has been widely considered due to its simplicity and acceptable accuracy. The neighborhood-based approaches suffer from the inlay-regions problem, i.e., they ignore the existence of inlay minority regions, which induces the neighborhood-based algorithms to sample data with inappropriate values. For overcoming this problem, the concept of the region of interest is defined and used to guide the sampling. Radial-Based Oversampling - RBO is driven by this concept. It applies a Radial-based kernel function to characterize the regions of interest and induce the sampling. In this work, we present a novel method, named RBO-QS, for unbalanced datasets which overcomes the identified drawbacks of the RBO method. The numerical studies show that the proposed methods can do the sampling in an efficient and accurate way. The quality of data samples was evaluated under different criteria which includes the regression model training. The dataset used to carry out the experimental studies was collected during six years and has over 12 million sensing entries of video streaming sessions.

Citação

SILVA, Juliana Castro da. Uso de região de interesse para tratamento de desbalanceamento de Bases de Dados de monitoramento de tráfego de redes de acesso geradas por adesão voluntária. 2022. 78 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus (AM), 2022.

URI

https://tede.ufam.edu.br/handle/tede/9313

Coleções

Mestrado em Informática

Direitos e licensiamento

Acesso Aberto

Página do item completo

Uso de região de interesse para tratamento de desbalanceamento de Bases de Dados de monitoramento de tráfego de redes de acesso geradas por adesão voluntária

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

DOI

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

Avaliação

Revisão

Suplementado Por

Referenciado Por

Direitos e licensiamento