Database Summary & Statistics#

This page provides a comprehensive overview of the Iggytop database created by the create_anndata.py pipeline. It reports details on the distribution of data sources, species, MHC information, and receptor quality.

This report summarizes the processed and deduplicated AIRR data used for downsteam analysis and knowledge graph construction. If you want to reporodice this, please run create_anndata.py to generate the data.

Downloading data from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/metadata.json' to file '/home/docs/.cache/scirpy/iggytop/4499f0c466215a3145f40d888b30ee26-metadata.json'.
SHA256 hash of downloaded file: e15c02466c190eb0ba365fb3ffd6d232e9d4646f196a5e1f5fe12d2ea7b9e001
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.
Downloading file 'deduplicated_anndata.h5ad' from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/deduplicated_anndata.h5ad' to '/home/docs/.cache/scirpy/iggytop/2026.05.27.120918'.
/home/docs/checkouts/readthedocs.org/user_builds/iggytop/envs/fruelingsputz/lib/python3.13/site-packages/anndata/utils.py:362: ExperimentalFeatureWarning: Support for Awkward Arrays is currently experimental. Behavior may change in the future. Please report any issues you may encounter!
  warnings.warn(msg, category, stacklevel=stacklevel)
Downloading file 'merged_anndata.h5ad' from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/merged_anndata.h5ad' to '/home/docs/.cache/scirpy/iggytop/2026.05.27.120918'.

Executive Summary#

A snapshot of the current database size and content after cross-database deduplication.

The database version used here is: data-2026.05.27.120918.

Total Records

556,866

Deduplicated Records

315,699

Reduction by 43.3%

Unique Epitopes

8,571

Publications

4,083

TCR Records

308,601

BCR Records

7,098

Note: If you find this report to be outdated or encounter any issues with the data, please open an issue on GitHub.

Source and Organism Distributions#

We track where the data originates and the species diversity. Many entries are found in multiple databases simultaneously.

../_images/5462093ffbf63a93bf74037667955ca40c57b19d584129977a892e3e7c698447.svg ../_images/6f2abd5d3a97dc4996e82846288b04fed4eafd50129f818835db530e4cdce5ba.svg ../_images/0d60ffda70b0757d7ae7a9e7dfc6e6f7962928928bb67cc8a9f0e4e4ec9da08c.svg ../_images/4c3611635b8ec1738a31797df98a45d527418828e2220ee6be1d45548c110aba.svg

MHC Context Availability#

Availability of MHC class and gene information varies significantly by database origin. We visualize the proportion of records containing MHC Class I or Class II information.

Missing MHC class (unknown I/II): 8,630. Missing MHC gene (MHC_gene_1): Class I = 0, Class II = 0.

../_images/10f79ea29c0bdcae36879bec8cc48ecaf4a991a3b9999a760cd5924152be5c38.svg
Source counts for 'HLA class I' entries:
source
IEDB          87078
CEDAR|IEDB       35
Name: count, dtype: int64
../_images/4cf64fc19a58daeab3bde5b0483f7e016ef9474c7bf8fcc8d5aac5e50af11695.svg

TCR Chain Configuration (QC)#

Using the scirpy.tl.chain_qc tool, we categorize the structural quality of the receptors. A high proportion of “Productive” pairs (Alpha+Beta) indicates better biological quality for structural modeling.

../_images/482ccaa8bc362c576e367caeb5d250584a73fb02e244545bd2c0535294965893.svg

Sequence Distributions and Coverage#

CDR3 (junction_aa) and Epitope lengths for biological consistency validation. Consistent distributions across sources suggest cross-database compatibility.

../_images/6b2baff3963623469aeebf18d62e5eec89f5e1b084d5d6aef66d87abfb110c64.svg ../_images/f81fc2278751f5475b960d43e49a120f9f0f90dabdf1bea7ab283c0acfb36498.svg

Junction Sequence Logos by Receptor Type#

Because junction_aa sequences vary in length, logos are computed on fixed windows of 8 amino acids from the left (start) and from the right (end). For each receptor class (TCR, BCR), we show four logos side by side in a 2x2 layout: VJ-left, VJ-right, VDJ-left, and VDJ-right.

../_images/cbf3f71d07d2d756d64de5146181fa58b608c45da98c634997aa454de5c5c05a.png ../_images/b3fb3fa26bfb9518596f9ac8607a46edb13ed0260df9450e637b78000b7fe63c.png

V/J Gene Presence

Chain Gene Unique genes Records
0 VDJ_1 J 103 264369
1 VDJ_1 V 760 272607
2 VJ_1 J 193 119271
3 VJ_1 V 713 127590

Records with complete annotations for both chains
(V gene + J gene + CDR3 amino-acid sequence for VJ_1 and VDJ_1):

81,930 / 315,699 (25.95%)