iggytop.adapters.utils#

This module contains utility functions for harmonizing data for iggytop

Functions

aggregate_unique_joined(series[, separator])

Helper function to aggregate unique values into a joined string.

deduplicate_and_aggregate(adata, ...[, ...])

Deduplicates AnnData based on subset_cols and aggregates values in agg_cols.

get_file_checksum(file_path)

Calculates the SHA256 checksum of a file.

get_iedb_ids_batch(bc, epitopes[, chunk_size])

Retrieve IEDB IDs for multiple epitopes using batched requests.

get_mhc_class(allele)

Find MHC class information from the MHC gene (allele) names.

get_pmids_batch(bc, reference_urls[, chunk_size])

Retrieve PubMed IDs for multiple IEDB reference IDs using batched requests.

get_previous_release_metadata([repo_name])

Fetches metadata from the latest GitHub release of iggytop.

get_tissue_source(tissue)

Standardize tissue source information while staying close to the original values.

harmonize_sequences(bc, table)

Preprocesses CDR3 sequences, epitope sequences, and gene names in a harmonized way.

normalize_table_strings(table)

Normalize table values to strings or None only.

save_airr_cells_csv(airr_cells, directory)

Convert list of AirrCell objects to CSV format and save as compressed file.

save_airr_cells_json(airrcells, directory[, ...])

Save a list of AirrCell objects to a compressed JSON file with auto-generated filename.

iggytop.adapters.utils._get_epitope_data(bc, epitopes, base_url, match_type='exact')#

Get epitope data.

Parameters:
  • bc (BioCypher) – Biocypher instance for the donwnload

  • epitopes (list[str]) – List of epitope sequences to query

  • base_url (str) – Base URL for the API endpoint

  • match_type (str) – Type of matching to perform (“exact” or “substring”)

Return type:

list[dict]

Returns:

List of epitope data dictionaries

iggytop.adapters.utils._get_reference_data(bc, reference_ids, base_url)#

Get reference data for PubMed ID mapping.

Parameters:
  • bc (BioCypher) – Biocypher instance for the download

  • reference_ids (list[int]) – List of IEDB reference IDs to query

  • base_url (str) – Base URL for the API endpoint

Return type:

list[dict]

Returns:

List of reference data dictionaries

iggytop.adapters.utils._is_ig_locus(locus)#

Return True when a chain locus corresponds to BCR/IG chains.

Return type:

bool

iggytop.adapters.utils._process_cdr3_sequence(seq, is_igh=False)#
Return type:

str | None

iggytop.adapters.utils._process_cdr3_with_j_gene(cdr3, species, j_symbol, is_igh)#

Standardize CDR3 with tidytcells, but tolerate malformed J symbols.

Return type:

str | None

iggytop.adapters.utils._process_epitope_sequence(seq)#

Remove flanking residues in epitope sequences.

Return type:

str | None

iggytop.adapters.utils._process_gene(gene, species, is_ig=False)#
Return type:

str | None

iggytop.adapters.utils._process_mhc(gene, species, is_ig=False)#
Return type:

str | None

iggytop.adapters.utils._set_up_config(output_format, cache_dir)#
iggytop.adapters.utils._set_up_schema(cache_dir)#
iggytop.adapters.utils.aggregate_unique_joined(series, separator='|')#

Helper function to aggregate unique values into a joined string. Warns if string ‘nan’ are found.

iggytop.adapters.utils.deduplicate_and_aggregate(adata, subset_cols, agg_cols, separator='|')#

Deduplicates AnnData based on subset_cols and aggregates values in agg_cols. Uses scirpy airr_context to access TCR-specific columns if needed.

iggytop.adapters.utils.get_file_checksum(file_path)#

Calculates the SHA256 checksum of a file.

Return type:

str | None

iggytop.adapters.utils.get_iedb_ids_batch(bc, epitopes, chunk_size=150)#

Retrieve IEDB IDs for multiple epitopes using batched requests.

First tries exact matches, then falls back to substring matches for unmatched epitopes.

Parameters:
  • bc (BioCypher) – Biocypher instance for the donwnload

  • epitopes (list[str]) – List of epitope sequences to query

  • chunk_size (int) – Size of chunks to break epitopes into (to avoid URL length limits)

Return type:

dict[str, int]

Returns:

Dictionary mapping epitope sequences to their IEDB IDs (0 if not found)

iggytop.adapters.utils.get_mhc_class(allele)#

Find MHC class information from the MHC gene (allele) names.

Parameters:

allele (str | None) – MHC allele name.

Return type:

str | None

Returns:

MHC class (I or II) or None if not found.

iggytop.adapters.utils.get_pmids_batch(bc, reference_urls, chunk_size=150)#

Retrieve PubMed IDs for multiple IEDB reference IDs using batched requests.

Parameters:
  • bc (BioCypher) – BioCypher instance for the download

  • reference_urls (list[int]) – List of IEDB reference URLs with IDs to query

  • chunk_size (int) – Size of chunks to break reference IDs into (to avoid URL length limits)

Return type:

dict[int, str]

Returns:

Dictionary mapping IEDB reference IDs to their PubMed IDs (None if not found)

iggytop.adapters.utils.get_previous_release_metadata(repo_name='iggytop/iggytop')#

Fetches metadata from the latest GitHub release of iggytop.

Return type:

dict | None

iggytop.adapters.utils.get_tissue_source(tissue)#

Standardize tissue source information while staying close to the original values. Could be improved

Parameters:

tissue (str | None) – Original tissue source from database.

Return type:

str | None

Returns:

Standardized tissue name or original value.

iggytop.adapters.utils.harmonize_sequences(bc, table)#

Preprocesses CDR3 sequences, epitope sequences, and gene names in a harmonized way. The following steps are performed: 1. Clean epitope sequences (remove flanking residues) 2. Add IEDB IRI and corresponding antigen information (species and antigen name) where missing 3. Harmonize species terms for antigen species and receptor chain species 4. Normalize VDJ-gene names to IMGT standards 5. Clean CDR3 sequences (normalizes junction_aas) 6. Convert MHC gene names to IMGT (for human)

Return type:

DataFrame

iggytop.adapters.utils.normalize_table_strings(table)#

Normalize table values to strings or None only.

Return type:

DataFrame

iggytop.adapters.utils.save_airr_cells_csv(airr_cells, directory)#

Convert list of AirrCell objects to CSV format and save as compressed file.

Parameters:
  • airr_cells (List) – List of AirrCell objects

  • directory (str) – Directory path where to save the CSV file (e.g., “../data”)

Return type:

None

iggytop.adapters.utils.save_airr_cells_json(airrcells, directory, filename=None, metadata=None)#

Save a list of AirrCell objects to a compressed JSON file with auto-generated filename.

Parameters:
  • airrcells (List[AirrCell]) – List of AirrCell objects to save.

  • directory (str) – Directory path where to save the JSON file (e.g., “../data”).

  • filename (str, optional) – Filename without extension.

  • metadata (dict, optional) – Metadata to include in the JSON file.

Return type:

None