An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval

My name is Ron Caspi. I received my PhD in Marine Microbiology from the Scripps Institution of Oceanography back in 1996. After a postdoc and a few years in biotech I joined Peter Karp’s team at SRI International. This small group, consisting of computer scientists and biologists, develops the BioCyc web portal, which combines >20,000 Pathway/Genome Databases (mostly for bacteria) with a set of powerful bioinformatic tools designed to be used by biologists (no programming skills required). My role in the group is that of a biocurator. I read a LOT of papers and summarise metabolic information into the MetaCyc database, the largest curated database of experimentally elucidated metabolic pathways and enzymes, which are gathered from all domains of life. I am also a member of the IUBMB and IUPAC nomenclature committees, as well as the Enzyme Commission, which classifies enzymes.

Over the years I curated thousands of pathways and enzymes in the MetaCyc database, as well as many organism-specific databases. Curation is a time-consuming process, so when ChatGPT started gathering attention, I was excited about the possibility of using it to enhance my curation. Alas, when I actually tried it, I found that I was often receiving incorrect information. When I mentioned it to my group, my director suggested conducting a more controlled experiment to evaluate the performance of ChatGPT (for my specific purposes). So, as I continued my curation work and new questions emerged, I occasionally consulted ChatGPT, evaluated the response and kept records of my interactions. When Google introduced its own chatbot, Bard (now Gemini), I asked it the same questions that I asked ChatGPT. At some point I felt that I had enough information to share my observations with others.

In the paper, ‘An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval’, I presented my results. Each of the chatbots was presented the same set of 8 questions:

Which enzymes produce 4-methylumbelliferyl glucoside?
What type of quinone is found in staphylococcus?
What is the function of MRT4 in yeast?
What are the three NifS family proteins found in E. coli?
Does E. coli contain any HD-GYP domain-containing proteins?
Who named the curli protein of E. coli?
What is the RbcX protein?
Can you give an example of a cyanobacterial enzyme that contains ubiquinone in its name?

Note that the last question is a bit of tricky, since cyanobacteria do not produce ubiquinone. However, annotation pipelines occasionally mislabel cyanobacterial proteins with ubiquinone, and I was trying to identify such cases.

The performance of the chatbots was evaluated using a simple system: a fully correct answer received 3 points; a mostly correct answer received 2 points; a mostly incorrect answer received 1 point; and a completely incorrect answer got no points at all.

The result: out of a maximum score of 24, ChatGPT scored 5 and Bard scored 13. Not great. The problems included missing information that is readily available on Google or Pubmed, providing incorrect information and sometimes producing a mixture of correct and incorrect information. This makes it difficult for the user to know what could be trusted and what could not. The tools were also inconsistent, providing different answers to the same question when I contested the validity of a given answer. Since the answers provided by these tools cannot be trusted currently, the time a user would need to spend verifying the information would not be significantly less than the time it would take to research a topic by other means.

Perhaps the results are not that surprising, since even ChatGPT agrees. When asked about it, it said “for specific and up-to-date scientific information, established scientific journals, databases, and subject-matter experts remain the preferred avenues for trustworthy data”…

Image: iStock/Supatman

Related categories

Publishing and Journals

Return to listing