Extreme-scale Automated Re-analysis of Sequencing Experiments via Advanced LLMs—The Microbiome as a Test Case

Noam Shental (The Open University of Israel, Israel)

10:30 - 10:45 Wednesday 15 April Morning

+ Add to Calendar

Abstract

In a 2023 paper, we introduced dbBact, a comprehensive bacterial knowledgebase (Amir et al., Nucleic Acids Research). Over seven years, we carefully read more than 1,100 16S rRNA papers, downloaded their raw sequencing data, re-analyzed it, and uploaded our findings to dbBact, resulting in 1.5 million associations between genotypes and phenotypes. Our goal was to show that a uniform analysis of raw sequencing data from diverse microbial environments could build a comprehensive body of knowledge that would otherwise be impossible to obtain. Although dbBact allows us to generate pan-microbiome hypotheses, it currently covers only 1% to 3% of the 16S rRNA literature. This talk discusses our ongoing efforts to harness state-of-the-art large language models (LLMs) and scalable bioinformatics pipelines to re-analyze and integrate 16S rRNA sequencing data from the entire published microbiome research landscape—estimated at 30,000 to 100,000 papers. We will explain our process, comparing the manual curation in dbBact with our automated system across those 1,100 studies. This analysis shows some challenges LLMs face when interpreting certain studies but also highlights their significant advantages in providing comprehensive, accurate insights that are on par with those of a microbiome expert. We believe expanding dbBact’s approach by a factor of 30 to 100 could transform microbiome research and serve as a model for automated re-analysis in other biomedical fields.

More sessions on Registration