Identifying Putative Virulence Factors with Highly Accurate Machine Learning Models

Jack Clark (University of Leicester, UK)

13:15 - 13:30 Tuesday 14 April Morning

+ Add to Calendar

Abstract

Disease caused by opportunistic bacterial pathogens is a leading cause of premature deaths worldwide. A wealth of genomic data is now available thanks to the advent of low-cost high-throughput sequencing technologies and this has presented new opportunities to investigate the genetic basis for switches between virulence and commensality in opportunistic bacterial species. However, the era of ‘big data’ has emerged alongside challenges in handling such large datasets to find biologically relevant signals among the noise. By using a simple random forest classifier model we are able to successfully predict, with 92% validation accuracy, disease state from genome sequence alone in a highly diverse collection of 5751 meningococcal isolates deposited on the PubMLST database. When applied to the same dataset with randomly shuffled disease states, no association is found, suggesting our model is able to reliably distinguish between signal and noise. A list of top genes ranked by their contributions to model accuracy includes a number of known meningococcal virulence factors such as the maf loci, involved in adhesion, and genes involved in lipopolysaccharide biosynthesis. Also present are multiple hypothetical and uncharacterised genes with unknown function. Work is ongoing to apply the model to noncoding sequences to explore the significance of these relatively understudied regions to disease. The simplicity and scalability of this approach, combined with minimal requirements for input data, make it a useful tool for rapidly identifying previously unknown putative virulence elements as high-priority candidates for characterisation studies.

More sessions on Registration