Knowledge Graph Data structure#

Iggytop is built on top of BioCypher by providing a set of adapters as well as an ontology to generate knowledge graphs for TCR-epitope datasets.

The outputs aim to be compatible with the AIRR standards.

We aim to be integrated into scirpy, which offers a standardized way of analyzing T cell receptor (TCR) or B cell receptor (BCR) repertoires.

Data Generation Process#

The generation process relies on the create_knowledge_graph.py script. The pipeline follows these key steps:

1. Source Data Harmonization#

Similar to the tabular data structure, IggyTop leverages BioCypher adapters to read and harmonize data:

  • Mapping: Source formats are mapped to internal registry keys.

  • Gene Normalization: V(D)J genes are aligned with IMGT standards.

  • Sequence Processing: Harmonization of CDR3 and epitope sequences.

2. Graph Construction#

Instead of just stacking tables, the pipeline uses the BioCypher framework to:

  • Instantiate Nodes for sequences (TRA, TRB, IGH, IGL) and epitopes based on the Ontology.

  • Create Edges representing the associations between receptors and epitopes.

  • The resulting graph can be exported to various formats (Neo4j, NetworkX, GraphML).

3. Processing Options#

Users can customize the graph generation using several flags in create_knowledge_graph.py:

  • Receptor Types: Specify which receptors to include (e.g., --receptors TCR BCR).

  • Adapter Selection: Choose specific databases to include in the graph.

  • 10X Data Filtering: To address concerns regarding the confidence of some large-scale datasets, users can use the --filter-10x flag to exclude data originating from the 10X Genomics dataset. This will remove records stored in the source databases which stem from this dataset. This flag is also set for the released dataset (deduplicated_anndata.h5ad)

    Note The ITRAP dataset contains data from this dataset. The ITRAP data are the (5k out of 60k) pairs that have passed the ITRAP qc filtering and are therefore considered high quality. These records are not filtered out. If you want to completely exclude 10X data, consider excluding ITRAP from the pipeline.

Design Choices#

Ontology#

BioCypher uses the Biolink ontology and allows custom modifications. This is done using configuration files. The ontology used for iggytop is defined in config/schema_config.yaml. This includes defining the node and edge types and their relationships (hierarchy).

entity
├── association
│   ├── alpha sequence to beta sequence association
│   ├── b cell receptor sequence to epitope association
│   ├── heavy sequence to light sequence association
│   └── t cell receptor sequence to epitope association
└── named thing
    └── biological entity
        └── polypeptide
            ├── epitope
            └── immune receptor sequence
                ├── b cell receptor sequence
                │   ├── igh sequence
                │   └── igl sequence
                └── t cell receptor sequence
                    ├── tra sequence
                    └── trb sequence

Node and Edge Types#

Nodes#

  • tra sequence

  • trb sequence

  • igh sequence

  • igl sequence

  • epitope

Edges#

  • alpha sequence to beta sequence association

  • heavy sequence to light sequence association

  • t cell receptor sequence to epitope association

  • b cell receptor sequence to epitope association

Uniqueness#

Immune receptor sequences are represented as nodes labeled according to their type (tra, trb, igh, igl): CDR3 sequence: and if available their V gene (see base-adapter).

Example node ID: trb:CASSFTDTQYF:TRBV6-2

Epitopes are represented as nodes labeled according to their type (epitope): (iedb: IRI if available or seq: amino acid sequence else) see harmonize_sequences(). The IRIs are retrieved using the IEDB Database Query API, see get_iedb_ids_batch().

Example node IDs: epitope:iedb:37257, epitope:seq:SLSNRLYYL

Edges link between two nodes; their ID is: source node - target node ID.

Example edges: tra:CAVTTDSWGKLQF:TRAV12-2-trb:CASRPGLAGGRPEQYF:TRBV6-5, tra:CAVTTDSWGKLQF:TRAV12-2-epitope:iedb:37257

Output Formats and Availability#

The knowledge graph can be exported in several ways:

  1. Neo4j: Optimized for graph database queries. Check out the Docker guide in the README.

  2. NetworkX / GraphML: Useful for Python-based graph analysis and visualization in tools like Cytoscape.

  3. AIRR JSON: While natively a graph, output can be converted back to the AIRR format (tabular).

Bimonthly Releases#

Knowledge graph exports (e.g., in GraphML) are not yet provided in bimonthly releases.

Creating Your Own Graph#

You can run the graph generation locally to create custom subsets or use specific versions of the data:

python create_knowledge_graph.py --adapters VDJDB MCPAS --filter-10x

Note that some parameters are defined in the config/biocypher_config.yaml. Check out this file and change it for more control (eg defining output type).

Assumptions#

During construction of the graph, redundant data can be neglected (e.g., pairs reported in multiple databases); however, some information is also lost. See this issue.