Deep fact checking is ignored, but deep learning is praised

The rewards of deep learning are high. If you use a Transformer (a state of the art language model) to train and evaluate on a dataset of 22.2 million enzymes, and then use it for 450 unknown enzymes you can publish in Nature Communications. This is a highly regarded publication. Altmetric (a rating system for online articles) will score your paper as being in the top 5%. Your paper will receive 22,000 views and be rated among the top 5%.

If you spend the time to go through someone else’s work and find hundreds of errors, you can upload a preprint on bioRxiv. This will only receive a fraction of citations and views as the original. This is exactly what happened with these two papers.

Nature Communications

bioRxiv

Tale of two Altmetric scores

These two papers on enzyme function predictions make for an interesting case study on the limitations of AI in Biology and the harms of the current publishing incentives. I will go over some of the details, but I encourage you to read them for yourself. This contrast is a stark example of how difficult it can be to assess the legitimacy of AI results if you don’t have deep domain expertise.

The Problem of Determining the Function of Enzymes

Because enzymes catalyze reactions in living organisms, they are vital for making things happen. Enzyme Commission numbers (EC) provide a hierarchical system of classification for thousands of functions. Can you predict the EC number from a sequence amino acids (the building block of all proteins including enzymes)? This problem seems to be a perfect fit for machine learning. It has clearly defined inputs, and outputs. Moreover, a rich dataset is available, with more than 22 million enzymes listed in the UniProt online database, along with their EC numbers.

An Approach with Transformers AI Model

In a research paper, a deep learning transformer model was used to predict the functions previously unknown enzymes. It looked like a good article! The authors adopted a well-respected neural network architecture (two converter encoders, a linear layer, two convolutional levels, and two convolutional layers) from BERT. They focused on regions that were biologically significant to confirm their significance. This suggests that the model has learned and can be interpreted. They used a standard split of training, validation and testing on a dataset that had millions of entries. The researchers then applied their model to a dataset in which no “ground truth” existed to make 450 new predictions. They randomly selected three of these novel predictions to test in vitro and confirmed that they were accurate.

A transformer model, shown on the left, was used to predict Enzyme Commission numbers for uncharacterized enzymes in E. coli. Three of these were tested in vitro (Fig 1a and Fig 4 from Kim, et al.)

The Errors (19659013) The Transformer model in Nature Communications made hundreds of “novel predictions” that are almost certain to be wrong. The paper followed a standard method of evaluating performance using a held-out testing set, and performed quite well (although later investigations suggest there may have been

Data leakage( The results for enzymes were a mess, as there was no ground truth.

The gene E. coli YjhQ, for example, was predicted to be a thiol synthase. However, thiol isn’t synthesized at all by E. coli! The gene yciO that evolved from the gene TsaC had been shown to have a different function in vivo a decade before but the Nature Communications paper concluded that it did.

Out of the 450 “novel results” given in the Nature Communications paper, 135 were not at all novel; they were already listed on the online database UniProt. Another 148 showed an unreasonably high level of repetition. The same enzyme functions appeared up to 12 times in genes of E. coli ( ), which is biologically implausible.

Most of the “novel” results from the transformer paper were either not novel, unusually repetitious, or incorrect paralogs (Fig 5 from de Crecy, et al.)

The Microbiology Detective (19659019) How did these mistakes come to light? After the model was trained, validated and evaluated using a dataset of millions of entries, 450 new predictions were made, and three of them were tested in vitro. Dr. de Crecy Lagard had studied yciO extensively over a ten-year period. Dr. de Crecy Lagard knew from her many years of experience in the lab that the deep learning prediction that yciO has the same function as another gene, TsaC was incorrect. Her previous research showed that the TsaC is essential for E. colidespite the presence of yciO in the same genome. Kim et al. reported a yciO-related activity. The yciO activity reported by Kim et al. is four orders of magnitude (i.e. The yciO is 10,000 times weaker than TsaC. All of this suggests that TsaC does not serve the same function as yciO.

Two enzymes with a common evolutionary ancestor, but different functions (Fig 7 from de Crecy, et al.)

YciO and TsaC do have structural similarities, and YciO evolved from an ancestor of TsaC. Decades of research on protein and enzyme evolution have shown that new functions often evolve via duplication of an existing gene, followed by diversification of its function. This poses a common pitfall in determining enzyme function, because the genes will have many similarities with the ones they duplicated and then diversified from.

Thus, looking at structural similarities is only one type of evidence for considering enzyme function. It is also crucial to look at other types of evidence, such as neighborhood context of the genes, substrate docking, gene co-occurrence in metabolic pathways, and other features of the enzymes.

It is important to look at multiple types of evidence when classifying enzyme function (Fig 2 from de Crecy, et al.)

Hundreds of Likely Incorrect Results

After spotting this error, de Crecy Lagard and her coauthors decided to look more closely at all the enzymes that were found to have new results in the Kim et al. paper. They found that 135 results were already in the online database which was used to create the training set. A further 148 results had a high level of repetition. The same highly specific functions appeared up to 12 different times. Biases, data unbalance, lack of relevant features or architectural limitations can all cause models to “force”, the most common labels, from the training data.

Other incorrect examples were found via a literature search or biological context. The gene YjhQ, for example, was predicted to be a thiol synthase. However, mycothiol cannot be synthesized in E. coli or . YrhB is predicted to synthesize a compound that was already predicted to have been synthesized by QueD. A mutant form of E. coli () with a QueD mutation was unable synthesize the compound. This showed that YrhB does not perform this function.

Rethinking Enzyme classification and “True Unknowns”.

Identifying the function of enzymes is actually two separate problems that are often conflated.

Propagating known function labels on enzymes within the same functional family.

Discovering truly unknown functions.

According to the authors of the second article, “supervised ML-models can’t be used to predict true unknowns.” There are many different types Erroneous functions have been entered into important online databases like UniProt. This incorrect data can be propagated further if it’s used to train prediction model. This is a growing problem.

Domain expertise is needed

There is no news that AI will be rewarded and supported more than work that integrates deep domain understanding and closely examines the underlying data. The aptly titled “Everyone Wants to do the Model Work, not the Data Work” paper involving dozens of machine learning practitioners working on high-stakes AI projects and found that inadequate-application domain expertise was one of a few key causes of catastrophic failures.

These articles also serve as a good reminder of how difficult (or even impossible) evaluating AI claims outside of our own area. I am not an expert in the enzyme functions of E. coli . For most deep learning papers that I have read, domain experts did not go through the results with an ultra-fine tooth comb to inspect the quality of the output. How many other papers that seem impressive would not hold up to scrutiny? It is important to note that the work of checking hundreds enzyme predictions is not as glamorous as the work of creating the AI model which generated them. How can we better encourage this type of error checking research?

In a time of funding cuts, I believe that we should do the opposite and invest even more in a range scientific and biomedical studies, from different angles. We need to resist an incentive system which is disproportionately focused upon flashy AI solutions, at the expense quality results.

Looking forward to your responses. Create a free GitHub Account to comment below.

Deep fact checking is ignored, but deep learning is praised

Get in Touch

Get in touch

Email

Phone

Social media

Find us

Deep fact checking is ignored, but deep learning is praised

The Problem of Determining the Function of Enzymes

An Approach with Transformers AI Model

Hundreds of Likely Incorrect Results

Rethinking Enzyme classification and “True Unknowns”.

Domain expertise is needed

Related articles

Eufy and EufyMake are Anker’s second act for lawn mowers and UV printers

Hugging Face claims its new robotics model can run on a MacBook.

20 Start-ups will be on display at the Hong Kong Tech Pavilion during VivaTech 2025.

In Canada lake, robots learn to mine without disturbing marine life

Recent articles

Eufy and EufyMake are Anker’s second act for lawn mowers and UV printers

Hugging Face claims its new robotics model can run on a MacBook.

20 Start-ups will be on display at the Hong Kong Tech Pavilion during VivaTech 2025.

In Canada lake, robots learn to mine without disturbing marine life