Database Summary & Statistics#
This page provides a comprehensive overview of the Iggytop database created by the create_anndata.py pipeline. It reports details on the distribution of data sources, species, MHC information, and receptor quality.
This report summarizes the processed and deduplicated AIRR data used for downsteam analysis and knowledge graph construction. If you want to reporodice this, please run create_anndata.py to generate the data.
Downloading data from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/metadata.json' to file '/home/docs/.cache/scirpy/iggytop/4499f0c466215a3145f40d888b30ee26-metadata.json'.
SHA256 hash of downloaded file: e15c02466c190eb0ba365fb3ffd6d232e9d4646f196a5e1f5fe12d2ea7b9e001
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.
Downloading file 'deduplicated_anndata.h5ad' from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/deduplicated_anndata.h5ad' to '/home/docs/.cache/scirpy/iggytop/2026.05.27.120918'.
/home/docs/checkouts/readthedocs.org/user_builds/iggytop/envs/fruelingsputz/lib/python3.13/site-packages/anndata/utils.py:362: ExperimentalFeatureWarning: Support for Awkward Arrays is currently experimental. Behavior may change in the future. Please report any issues you may encounter!
warnings.warn(msg, category, stacklevel=stacklevel)
Downloading file 'merged_anndata.h5ad' from 'https://github.com/biocypher/iggytop/releases/download/data-2026.05.27.120918/merged_anndata.h5ad' to '/home/docs/.cache/scirpy/iggytop/2026.05.27.120918'.
Executive Summary#
A snapshot of the current database size and content after cross-database deduplication.
The database version used here is: data-2026.05.27.120918.
Total Records
556,866
Deduplicated Records
315,699
Reduction by 43.3%Unique Epitopes
8,571
Publications
4,083
TCR Records
308,601
BCR Records
7,098
Note: If you find this report to be outdated or encounter any issues with the data, please open an issue on GitHub.
Source and Organism Distributions#
We track where the data originates and the species diversity. Many entries are found in multiple databases simultaneously.
MHC Context Availability#
Availability of MHC class and gene information varies significantly by database origin. We visualize the proportion of records containing MHC Class I or Class II information.
Missing MHC class (unknown I/II): 8,630. Missing MHC gene (MHC_gene_1): Class I = 0, Class II = 0.
Source counts for 'HLA class I' entries:
source
IEDB 87078
CEDAR|IEDB 35
Name: count, dtype: int64
TCR Chain Configuration (QC)#
Using the scirpy.tl.chain_qc tool, we categorize the structural quality of the receptors. A high proportion of “Productive” pairs (Alpha+Beta) indicates better biological quality for structural modeling.
Sequence Distributions and Coverage#
CDR3 (junction_aa) and Epitope lengths for biological consistency validation. Consistent distributions across sources suggest cross-database compatibility.
Junction Sequence Logos by Receptor Type#
Because junction_aa sequences vary in length, logos are computed on fixed windows of 8 amino acids from the left (start) and from the right (end). For each receptor class (TCR, BCR), we show four logos side by side in a 2x2 layout: VJ-left, VJ-right, VDJ-left, and VDJ-right.
V/J Gene Presence
| Chain | Gene | Unique genes | Records | |
|---|---|---|---|---|
| 0 | VDJ_1 | J | 103 | 264369 |
| 1 | VDJ_1 | V | 760 | 272607 |
| 2 | VJ_1 | J | 193 | 119271 |
| 3 | VJ_1 | V | 713 | 127590 |
Records with complete annotations for both chains
(V gene + J gene + CDR3 amino-acid sequence for VJ_1 and VDJ_1):
81,930 / 315,699 (25.95%)