DePTH

DePTH provides neural network models for sequence-based TCR and HLA association prediction.

View the Project on GitHub Sun-lab/DePTH

DePTH Tutorial

DePTH 1.0 Authors: Si Liu (sliu3@fredhutch.org), Philip Bradley (pbradley@fredhutch.org), Wei Sun (wsun@fredhutch.org), Fred Hutchinson Cancer Center

DePTH 2.0 Authors: Fumin Li (lifm6@uw.edu), University of Washington, Si Liu (sliu3@fredhutch.org), Wei Sun (wsun@fredhutch.org), Fred Hutchinson Cancer Center

Maintainer: Si Liu (sliu3@fredhutch.org)

Latest revision: 06/20/2025

Introduction

A T cell relies on its T cell receptor (TCR) to recognize foreign antigens presented by a human leukocyte antigen (HLA), which is the human version of major histocompatibility complex (MHC). Knowing the association between TCR and HLA can be helpful for inferring past infection, studying T cell-related treatments for certain diseases. We develop DePTH, a Python package that provides neural network models to predict the association of given TCR and HLA based on their amino acid sequences.

The figure above shows the structure of DePTH model. The HLA is translated into its pseudo sequence (the sequence of amino acids on certain important positions). The TCR comes in the format of beta chain V gene and CDR3, and the V gene is translated into CDR1, CDR2, and CDR2.5 amino acid sequences. There are two separate dense layers, one for HLA and one for TCR. The outputs of these two layers are concatenated together and passed through one or two dense layers before getting to the final prediction.

The direct output given by our models is a score between 0 and 1, with a higher score corresponding to higher predicted chance of association.

Installation

Start a conda virtual environment (DePTH requires python>=3.9):

conda create -n DePTH_env python=3.9

Activate the conda environment:

conda activate DePTH_env

Install DePTH package in the activated conda environment:

python -m pip install DePTH

Once the installation is finished, can try the following command:

DePTH -h

The following console output should show up:

usage: DePTH [-h] {train,predict,cv} ...

DePTH: a neural network model for sequence-based TCR and HLA association prediction

positional arguments:
  {train,predict,cv}  whether to train or predict
    train             train a DePTH model
    predict           load a trained DePTH model to make prediction
    cv                cross-validation for a specific hyperparameter setting

optional arguments:
  -h, --help          show this help message and exit

Running the command lines below will show more information on the input arguments for each option. The information will be explained in more details in sections below.

DePTH predict -h
DePTH train -h
DePTH cv -h

Get prediction from DePTH 2.0 default models

DePTH default models were updated to DePTH 2.0 on Dec. 27, 2024. The updated default models were trained on data sets richer than those used for DePTH 1.0. DePTH 1.0 default models are kept as legacy and can be called by command lines specified at the end of this tutorial.

By default, DePTH provides two sets of models, one for TCR-HLA pairs involving HLA-I alleles, and one for those involving HLA-II alleles, respectively. Each set contains 20 model trained on different sets of random seeds. For each TCR-HLA pair in test file, the prediction score is the average of scores given by 20 models.

The required input are:

Format requirement on file of test pairs:

Example command line for HLA_I:

DePTH predict --test_file test_HLA_I_pairs.csv \
              --hla_class HLA_I \
              --output_dir test_HLA_I_output

Example command line for HLA_II:

DePTH predict --test_file test_HLA_II_pairs.csv \
              --hla_class HLA_II \
              --output_dir test_HLA_II_output

The output file named “predicted_scores.csv” will be created in the folder specified for output_dir. This file follows the format of the file of test pairs, with an additional column providing the prediction scores.

For example, in the case of HLA_I, if the file of test pairs is this one, the first few lines of this file is:

tcr,hla_allele
"TRBV9*01,CASSEGQKETQYF",HLA-A*03:01
"TRBV5-1*01,CASSLVGVTDTQYF",HLA-B*07:02
"TRBV27*01,CASSSGTSGNNEQFF",HLA-B*27:05
"TRBV7-9*01,CASSLGSSYEQYF",HLA-A*24:02
"TRBV5-1*01,CASSLATEGDTQYF",HLA-B*08:01
"TRBV5-8*01,CASSLGRENSPLHF",HLA-B*08:01

The first few lines of the output file test_HLA_I_output/predicted_scores.csv will be:

tcr,hla_allele,score
"TRBV9*01,CASSEGQKETQYF",HLA-A*03:01,0.27827388704754413
"TRBV5-1*01,CASSLVGVTDTQYF",HLA-B*07:02,0.1244928405387327
"TRBV27*01,CASSSGTSGNNEQFF",HLA-B*27:05,0.9632433295249939
"TRBV7-9*01,CASSLGSSYEQYF",HLA-A*24:02,0.03911237970751245
"TRBV5-1*01,CASSLATEGDTQYF",HLA-B*08:01,0.3890836928039789
"TRBV5-8*01,CASSLGRENSPLHF",HLA-B*08:01,0.2891890250146389

Train new models

Alternatively, user can also train new models based on the training and validation data files in this folder for HLA-I and this folder for HLA-II, or new data files.

The required inputs are:

DePTH allows hyperparameters in multiple aspects:

DePTH sets three random seeds for training:

Example command line for training a new model (HLA-I):

DePTH train --hla_class HLA_I \
            --data_dir data/HLA_I_all_match/train_valid \
            --model_dir saved_models/HLA_I/HLA_I_model_5779_7821_6367 \
            --enc_method one_hot \
            --lr 0.0001 \
            --n_dense 2 \
            --n_units_str [64,16] \
            --dropout_flag True \
            --p_dropout 0.2 \
            --rseed 5779 \
            --np_seed 7821 \
            --tf_seed 6367

Possible issue with Mac and solutions:

When running the command lines above on Mac, Mac may show error message:

zsh: no matches found: [64,16]

There are two solutions, one is to put “\” before each of the brackets in the value specified for n_units_str:

DePTH train --hla_class HLA_I \
            --data_dir data/HLA_I_all_match/train_valid \
            --model_dir saved_models/HLA_I/HLA_I_model_5779_7821_6367 \
            --enc_method one_hot \
            --lr 0.0001 \
            --n_dense 2 \
            --n_units_str \[64,16\] \
            --dropout_flag True \
            --p_dropout 0.2 \
            --rseed 5779 \
            --np_seed 7821 \
            --tf_seed 6367

The other one is to put the command lines in a .sh file, for example, named train_HLA_I.sh, with the following content:

#!/bin/bash

DePTH train --hla_class HLA_I \
            --data_dir data/HLA_I_all_match/train_valid \
            --model_dir saved_models/HLA_I/HLA_I_model_5779_7821_6367 \
            --enc_method one_hot \
            --lr 0.0001 \
            --n_dense 2 \
            --n_units_str [64,16] \
            --dropout_flag True \
            --p_dropout 0.2 \
            --rseed 5779 \
            --np_seed 7821 \
            --tf_seed 6367

and then do:

chmod +x train_HLA_I.sh
./train_HLA_I.sh

Cross-validation

To help with choosing hyperparameter setting, DePTH offers an option of doing 5-fold cross-validation under given hyperparameters.

In each fold, the training and validation TCR-HLA pairs are combined together and randomly split again, such that the number of new training positive pairs equals that of the original training positive pairs, and the number of new training positive pairs equals that of the original training negative pairs. Each fold has a validation AUC, and the final output of cross-validation is the average of five validation AUCs from five folds.

The required inputs are:

The usage of hyperparameter options are the same as those listed for training a new model.

Example command line for cross-validation under certain hyperparameters for HLA-I:

DePTH cv --hla_class HLA_I \
         --data_dir data/HLA_I_all_match/train_valid \
         --average_valid_dir cv_average_valid/HLA_I/one_hot_5779_7821_6367 \
         --enc_method one_hot  \
         --lr 0.0001 \
         --n_dense 2 \
         --n_units_str [64,16] \
         --dropout_flag True \
         --p_dropout 0.2 \
         --rseed 5779 \
         --np_seed 7821 \
         --tf_seed 6367

The ways to solve the issue of potential error message

zsh: no matches found: [64,16]

from Mac are the same as those in the case of training a new model.

The output will be a csv file average_validation_auc_roc.csv under the specified directory for average_valid_dir, and the format of the content will be similar to this one below:

average_valid_auc
0.7897524952888488

The exact value may change across different platforms and different tensorflow versions.

Get prediction from a new model

Once a model is trained, it can be loaded to make prediction on test data.

The required inputs are:

Example command line for getting prediction scores from a new model (HLA-I):

DePTH predict --test_file test_HLA_I_pairs.csv \
              --hla_class HLA_I \
              --output_dir results/HLA_I/HLA_I_5779_7821_6367 \
              --default_model False \
              --model_dir saved_models/HLA_I/HLA_I_model_5779_7821_6367 \
              --enc_method one_hot

The output file of prediction scores will follow the same format as in the case of getting prediction from default models, except that the scores will be based on one single model instead.

Get prediction from legacy DePTH 1.0 default models

DePTH 1.0 default models are kept as legacy. They can be loaded to make predictions by additionally specifying the option:

--default_model legacy

under the same file format requirements as illustrated for the case of using DePTH 2.0 default models to make predictions. The full example command lines are as shown below:

Example command line for HLA_I:

DePTH predict --test_file test_HLA_I_pairs.csv \
              --hla_class HLA_I \
              --default_model legacy \
              --output_dir test_HLA_I_output

Example command line for HLA_II:

DePTH predict --test_file test_HLA_II_pairs.csv \
              --hla_class HLA_II \
              --default_model legacy \
              --output_dir test_HLA_II_output