Cancer Microbiome Reanalysis

A bioXriv preprint contends that a 2020 paper linking bacteria to cancer was based on flawed data analysis

Colorized scanning electron micrograph of spherical Group A Streptococcus (Streptococcus pyogenes) bacteria and a human neutrophil. Credit: NIAID
Colorized scanning electron micrograph of Group A Streptococcus (Streptococcus pyogenes) bacteria (yellow) and a human neutrophil (blue). Credit: NIAID

Introduction

In a bioRxiv preprint posted in July, which Derek Lowe compares to a ‘cruise missile’ on the data analyses employed in a 2020 paper, Gihawi et al. identifies two major data analysis flaws, each of which would have invalidated the 2020 paper’s findings.

Poore et al. claimed in 2020 that a large-scale analysis of DNA and RNA samples from 32 cancer types from the Cancer Genome Atlas (TCGA) identified microbial signatures that were highly predictive of cancer types, perhaps even as a cancer diagnostic tool.

First Data Analysis Flaw

However, when the (next generation) sequencing data was re-analyzed by Gihawi et al., they found draft bacterial genome databases contaminated with human sequences. As some microbial sequences do match short regions of human sequences, this resulted in “false positives that were inflated by many orders of magnitude”. Further, the failure to filter out human or common vector sequences retained those false positives.

Second Data Analysis Flaw

The second flaw arose from the use of (Voom-SNM) normalized data that somehow artificially tagged based on cancer types, which resulted in highly accurate classifier, but were not reflective of the raw data.

It is common to use normalization methods to mitigate batch effects in large-scale studies, however, the normalization process applied by Poore et al. artificially tagged cancer samples based on their types. These ‘tags’ when used to train machine learning models, created near-perfect (95–100%) accuracy.

As an example, the raw (sequencing) read counts for Hepandensovirus in 79 Adrenocortical Carinoma (ACC) samples was 0. After normalization, 90% of the ACC samples were assigned the (tagged) value of 3.078874655 by Poore et al. Out of 17,625 total samples, there were 77 additional samples assigned lower or equivalent values to 3.078874655. This enabled the classifier to accurately identify ACC samples, despite the original raw reads being 0.

Bioinformatics Analysis Blind Spots

Putting aside the biological link between microbes and cancer, this preprint provides a glimpse into a few blind spots that may arise from such bioinformatics analysis:

  • contaminated draft genome database
  • inaccurate normalized data that were used to train machine learning models
  • use of processed (aligned sequence data — BAM format) instead of using raw sequence data
  • lack of additional controls or tests

Indeed, the preprint authors’ conclusions are befitting of a written cruise missile:

Our conclusion after re-analysis is that the near-perfect association between microbes and cancer types reported in the study is, simply put, a fiction.

For help or second opinion (audit) on your bioinformatics analysis, reach out at blindspotbio.

Updated Jul. 19, 2024: paper is officially retracted.

Learn More

Also published on blindspotbio blog.

--

--

blindspotbio: research&bioinformatics as a service

Highlighting publicly available datasets and blind spots found in omics data analysis. Visit https://blindspotbio.com for research & bioinformatics services.