Representação, classificação e interpretação de sequências proteicas do vírus da dengue

Resumo

The dengue virus is responsible for causing a very common infection in some Latin America and the Western Pacific countries, triggering several symptoms, such as fever, headache, nausea, vomiting and muscle pain. The infection levels can be divided into: fever, hemorrhagic fever and shock syndrome, the last two cases being associated with fatalities. The causes that lead hosts to develop severe infection cases are not completely known. However, the proteins that make up the dengue virus genetic material are a potential source for extracting information, an example of which are the characteristics present in those that allow differentiating the virus between serotypes and genotypes subclasses, in addition to containing phylogenetic information. Therefore, it is acceptable to assume that these structures have characteristics capable of raising the severe dengue understanding. The challenge of working with proteins is the difficulty of capturing interest characteristics, since they occur in patterns forms in small functional regions scattered in sequence. Therefore, proteins representations in structures where patterns can be easily accessed becomes a viable alternative for data treatment of this type. In this research, we propose a methodology to identify patterns in dengue proteins associated with severe dengue in human hosts. The method is based on dengue proteins codon co-occurrence matrices representation. The Random Forests (RF) and Convolutional Neuural Network (CNN) algorithms are used to classify matrices labeled as classic/severe dengue. Subsequently, the classifiers are interpreted by SHAP Values method, which, in turn, shows which co-occurrences increase severe dengue probability in the sample. The interpretations results are grouped into importance plots that make it possible to highlight the codon co-occurrence patterns associated with severe dengue. We independently classify each dengue proteins. Experiments using RF achieved AUC results ranging from 0.70 to 0.83. The best results were obtained from the protein E matrices classification in 25 results (five experiments with five cross-validation folds each), reaching an AUC of 0.83 +- 0.02 with 95% interval trust. The statistical tests of Levene, Shapiro-Wilk, ANOVA and Tukey were used to test whether the metrics averages calculated in the 25 results were different between proteins, thus, it was found that the results of protein E are statistically different from other proteins results, giving evidence that protein E best characterizes severe dengue. Through the proposed method, we obtained new evidence on severe dengue development, directly associating it with frequent codon co-occurrence patterns. Our method made it possible to find the existence of high co-occurrences in protein E that may be associated with the severe dengue onset in the host. In addition, in more granular explorations, we observed co-occurrences groups that increase the severe dengue likelihood for those different four serotypes. These results may play an important role in proposing new treatments, as well as being the subject of debate on new theories regarding the development of severe dengue in human hosts.

Descrição

Citação

SOUZA, Leonardo Rodrigues de. Representação, classificação e interpretação de sequências proteicas do vírus da dengue. 2021. 96 f. Dissertação (Mestrado em Informática) - Universidade Federal do Amazonas, Manaus, 2021.

Avaliação

Revisão

Suplementado Por

Referenciado Por

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Acesso Aberto