11.05.2026

Dark Proteome Unlocked: HLA-Compass AI Finds 17,082 Cancer Targets

What Is the Dark Proteome?Alitheabio

Dark proteome research has just reached a breakthrough milestone. Alithea Bio used HLA-Compass AI to identify 17,082 non-canonical peptides — including 11,553 tumour-specific targets — in just 7 hours. Here is what this means for TCR drugs and cancer vaccines.

dark proteome cancer targets identified by HLA-Compass AI

What Is the Dark Proteome?

The dark proteome refers to the vast set of proteins and peptides encoded by non-canonical regions of the human genome — sequences that traditional reference databases have long considered “non-coding” or biologically irrelevant. These include upstream open reading frames (uORFs), downstream ORFs (dORFs), long non-coding RNA-derived ORFs (lncRNA-ORFs), and other non-canonical open reading frames (ncORFs) that fall outside the approximately 19,500 canonical protein-coding genes.

For decades, these dark proteome regions were largely ignored in drug discovery. But a growing body of evidence — and now a landmark study published in Nature — confirms that the dark proteome is not empty. It is, in fact, teeming with biologically active molecules that could reshape cancer immunotherapy.

The Nature Paper That Changed Dark Proteome Research

In May 2026, the TransCODE Consortium published “Expanding the Human Proteome with Microproteins and Peptideins” in Nature — one of the most significant dark proteome papers in recent years.

The study, led by researchers including Eric Deutsch, Steven Carr, Jenn Abelin, Pouya Faridi, Nicola Ternette, John Prensner, and Sebastiaan van Heesch, analysed over 7,264 non-canonical open reading frames (ncORFs) across 95,520 proteomics experiments. Their core finding: approximately 25% of those ncORFs produce detectable peptides — confirming translation of sequences that have never appeared in standard protein databases.

The team identified around 1,785 previously unrecognised microproteins, introducing the new conceptual category of peptideins — microproteins under 50 amino acids with unusual properties, some of which are essential for cell survival and implicated in cancer biology. The human proteome, the paper concludes, is nearly 10% larger than previously recognised.

As co-lead Sebastiaan van Heesch noted, thousands of overlooked genetic sequences contribute to a new class of protein-like molecules that had been missed until now. All data were shared open-source to accelerate global research.

External reference: Nature — “Expanding the human proteome with microproteins and peptideins”

How Alithea Bio Searched the Dark Proteome

After reading the Nature paper, Alithea Bio’s scientists — Alisa Fuchs, PhD and Armanas Povilionis — immediately asked a more specific question: What happens when you search the world’s largest HLA peptide database against the full dark proteome catalogue?

Rather than relying on the 7,264-sequence initial catalogue from the Nature study, the team went further. They used a revised, expanded catalogue from Chothani 2026, comprising 28,359 non-canonical ORF sequences — covering lncRNAs, uORFs, dORFs, downstream overlapping ORFs, and more.

Using HLA-Compass AI’s fully automated reanalysis pipeline, the team searched thousands of real-world immunopeptidomics samples from the HLA-Compass database — one of the largest collections of healthy tissue, cancer cell line, and tumour tissue HLA peptidomics data in the world.

The results were ready in just 7 hours.

Dark Proteome Results: 17,082 Non-Canonical Peptides in 7 Hours

The dark proteome numbers are transformative:

MetricResult
Non-canonical peptides identified17,082 unique peptides
Tumour-specific peptides11,553 (not found in any healthy tissue)
ORF types coveredlncRNAs, uORFs, dORFs, and more
Catalogue usedChothani 2026 — 28,359 sequences
Time to results7 hours
PlatformHLA-Compass AI automated reanalysis pipeline

The most striking finding: 11,553 of those dark proteome peptides are tumour-specific — meaning they appear in cancer tissue but are absent from any healthy organ in the database. This represents a largely unexplored reservoir for TCR therapy development, cancer vaccine design, and neoantigen discovery.

Why the Dark Proteome Matters for TCR Drugs and Cancer Vaccines

The discovery of thousands of tumour-specific, non-canonical dark proteome peptides has direct implications for multiple drug development programmes:

TCR Therapy Development

T-cell receptor (TCR) therapies require high-confidence tumour-specific targets that are absent from healthy tissue. Off-target expression in vital organs is a primary safety concern. Dark proteome cancer targets — peptides derived from lncRNAs, uORFs, and dORFs that appear only in tumour tissue — represent a new class of candidates with a naturally favourable safety profile. HLA-Compass AI’s absolute quantification data allows researchers to verify both tumour presence and healthy tissue absence at scale, before committing to expensive experimental validation.

Cancer Vaccine Design

Personalised cancer vaccines depend on identifying neoantigens and tumour-associated antigens that the immune system can be trained to recognise. Non-canonical dark proteome peptides are particularly attractive vaccine targets because they are derived from sequences with no normal protein counterpart — reducing the risk of central tolerance and autoimmunity. The 11,553 tumour-specific peptides identified by Alithea Bio’s pipeline expand the universe of vaccine candidates dramatically.

Neoantigen Discovery

Traditional neoantigen pipelines focus on somatic mutations in canonical coding sequences. The dark proteome opens a parallel pathway: non-canonical sequences that are transcribed and translated specifically in cancer cells, independently of mutation status. This matters especially for tumour types with low mutational burden, where canonical neoantigen identification is difficult.

Neoantigens | NeoZOOM | HLA-Compass | Off-Target Toxicity Prediction

What Are lncRNAs, uORFs, and dORFs in the Dark Proteome?

The dark proteome comprises several categories of non-canonical open reading frames:

lncRNA-ORFs — Open reading frames embedded within long non-coding RNAs. These transcripts were historically assumed to have regulatory roles without producing proteins. Evidence now shows many are translated, producing small proteins with functional relevance in cancer biology.

uORFs (Upstream Open Reading Frames) — Short ORFs located in the 5′ untranslated region upstream of a canonical coding sequence. They can act as translational regulators and, when expressed aberrantly in cancer, produce peptides that may be presented on HLA molecules.

dORFs (Downstream Open Reading Frames) — ORFs located downstream of or overlapping with the main coding sequence. Like uORFs, they are translated in specific cellular contexts and may generate tumour-associated peptides invisible to standard proteomics approaches.

Together, these non-canonical sequences constitute the dark proteome — a parallel protein landscape that has been systematically overlooked until now.

How HLA-Compass AI Navigates the Dark Proteome

HLA-Compass AI is Alithea Bio’s SaaS platform — the world’s largest quantitative HLA peptide database, combined with an AI-powered analysis engine purpose-built for dark proteome discovery. Its key capabilities include:

  • Database depth: 1.4 million unique peptides, 10,000 cancer-specific epitopes, derived from over 17 million identifications across 4,160 high-quality HLA peptidomics samples
  • HLA allele mapping: All peptides mapped to cognate HLA alleles via Alithea’s proprietary binding predictors
  • Automated reanalysis: The fully automated pipeline can accept any protein sequence database and return results the next day
  • Quantification: Relative quantification across thousands of samples, enabling direct comparison of tumour vs. healthy tissue presentation
  • Off-target tools: Integrated tools to identify off-target risks from candidate peptide lists

The pipeline’s ability to complete a full dark proteome reanalysis against 28,359 sequences in just 7 hours is a direct result of this AI-powered infrastructure.

Validate Your Own Dark Proteome Catalogue Overnight

One of the most powerful outcomes of this work is that HLA-Compass AI’s pipeline can be applied to any dark proteome sequence database — not only Alithea’s own catalogue.

If your team has developed a proprietary catalogue of non-canonical ORF sequences, you no longer need to wait months for experimental mass spectrometry validation. Submit your catalogue to Alithea Bio’s pipeline and receive results against thousands of real-world healthy and cancer samples — delivered the next day.

  • Identify which dark proteome sequences produce HLA-presented peptides in relevant cancer types
  • Screen for healthy tissue expression before progressing candidates
  • Prioritise targets with the highest tumour-specificity before committing experimental resources

To learn more, reach out via the contact page, or meet the team at PEGS Boston and the TCR T Cell Congress.

Frequently Asked Questions About the Dark Proteome

What is the dark proteome?

The dark proteome is the collection of proteins and peptides encoded by non-canonical regions of the genome — including lncRNAs, uORFs, and dORFs — absent from standard protein reference databases. A 2026 Nature study confirmed that approximately 25% of 7,264 such regions produce detectable proteins, expanding the known human proteome by nearly 10%.

What are peptideins?

Peptideins are a newly defined class of dark proteome microproteins — under 50 amino acids — encoded by non-canonical open reading frames. Formally defined in the 2026 TransCODE Consortium paper in Nature, many are now being investigated as drug targets and disease biomarkers.

How many tumour-specific dark proteome peptides did Alithea Bio identify?

Alithea Bio identified 11,553 tumour-specific dark proteome peptides — present in cancer tissue but absent from all healthy tissue samples in the HLA-Compass database — out of 17,082 non-canonical peptides detected in total.

How quickly can Alithea Bio screen a dark proteome catalogue?

Using HLA-Compass AI’s automated pipeline, Alithea Bio can search any dark proteome sequence database against thousands of real-world immunopeptidomics samples and deliver results the next day.

Can I submit my own sequence catalogue for analysis?

Yes. Alithea Bio has optimised its pipeline to accept any protein sequence database. Contact the team via alithea-bio.com/contact to discuss your project.

What is HLA-Compass AI?

HLA-Compass AI is Alithea Bio’s SaaS platform for navigating the human immunopeptidome and dark proteome. It combines the world’s largest quantitative HLA peptide database with AI-powered tools for peptide-HLA binding prediction, off-target toxicity assessment, and automated database reanalysis.

Connect with Alithea Bio

Meet the team at PEGS Boston and the TCR T Cell Congress — or get in touch to discuss how HLA-Compass AI can accelerate your dark proteome programme.

Related reading: HLA Presented Peptides | Neoantigens | Immunopeptidomics Technology | Proteogenomics

Read more

Ovarian Cancer Awareness: Alithea Bio’s Commitment on WOCD 2026

Read
Joining the European Immunotherapy Community at CIMT 2026 alitheabio

Alithea Bio to Present HLA-Compass AI Breakthroughs at CIMT 2026 in Mainz

Read

5 Key Advancements in Cervical Carcinoma Research via the HLA-Compass Database

Read