BigMHC is a deep learning tool for predicting MHC-I (neo)epitope presentation and immunogenicity.
See the article for more information:
All data used in this research can be freely downloaded here.
git clone https://github.com/karchinlab/bigmhc.git
The repository is about 5GB, so installation generally takes about 3 minutes depending on internet speed.
Execution is OS agnostic and does not require GPUs.
Training models with large batch sizes (e.g. 32768) requires significant GPU memory (about 94 GB total). Transfer learning requires minimal GPU memory and can be reasonably conducted on a CPU.
All methods were tested on Debian 11 using Linux 5.10.0-19-amd64, AMD EPYC 7443P, and four RTX 3090 GPUs.
Software depenencies are listed below (the versions used in the paper are parenthesized).
- scipy (1.7.3)
- scikit-learn (1.0.2)
- matplotlib (3.5.3)
- seaborn (0.12.1)
- py3dmol (2.0.0.post2)
- logomaker (0.8)
- openpyxl (3.1.1)
There are two executable Python scripts in src: predict.py and train.py.
predict.pyis used for making predictions using BigMHC EL and BigMHC IMtrain.pyallows you to train or retrain (transfer learning) BigMHC on new data
Both scripts, which can be run from any directory, offer help text.
python predict.py --helppython train.py --help
From within the src dir, you can execute the below examples:
python predict.py -i=../data/example1.csv -m=el -t=2 -d="cpu"
python predict.py -i=../data/example2.csv -m=el -a=HLA-A*02:02 -p=0 -c=0 -d="cpu"
Predictions will be written to example1.csv.prd and example2.csv.prd in the data folder. Execution takes a few seconds. Compare your output with example1.csv.cmp and example2.csv.cmp respectively.
BigMHC only supports MHC-I. In order to handle different MHC naming schemes, BigMHC will perform fuzzy string matching to find the nearest MHC by name. For example, HLA-A*02:01, A*02:01, HLAA0201, and A0201 are all considered valid and equivalent allele names. Additionally, synonymous substitutions and noncoding fields are handled, so HLA-A*02:01:01 should be mapped to HLA-A*02:01.
We do not validate allele names. BigMHC will make predictions even if given nonsense or MHC-II input, as it will find the nearest valid MHC name to the provided invalid allele name. The list of alleles used in our multiple sequence alignment, to which input is mapped, can be found in the pseudosequences data file.
-ior--inputinput CSV file- Columns are zero-indexed
- Must have a column of peptides
- Can also have a column of of MHC-I allele names
-mor--modelBigMHC model to loadelorbigmhc_elto load BigMHC ELimorbigmhc_imto load BigMHC IM- Can be a path to a BigMHC model directory
- Optional for
train.py(if a model dir is specified, then transfer learn)
-tor--tgtcolcolumn index of target values- Elements in this column are considered ground truth values.
-oor--outoutput directory- Directory to save model parameters for each epoch
- Optional for transfer learning (defaults to
modelarg)
-aor--alleleallele name or allele column- If
alleleis a column index, then a single MHC-I allele name must be present in each row
- If
-por--pepcolpeptide column- Is the column index of a CSV file containing one peptide sequence per row.
-cor--hdrcntheader count- Skip the first
hdrcntrows before consuminginput
- Skip the first
-oor--outoutput file or directory- If using
predict.py, save CSV data to this file- Defaults to
input.prd
- Defaults to
- If using
train.py, save the retrained BigMHC model to this directory- If transfer learning, defaults to the base model dir
- If using
-zor--saveattboolean indicating whether to save attention values- Only available for
predict.py - Use
1for true and0for false
- Only available for
-dor--devicesdevices on which to run BigMHC- Set to
allto utilize all GPUs - To use a subset of available GPUs, provide a comma-separated list of GPU device indices
- Set to
cputo run on CPU (not recommended for large datasets)
- Set to
-vor--verbosetoggle verbose printing- Use
1for true and0for false
- Use
-jor--jobsNumber of workers for parallel data loading- These workers are persistent throughout the script execution
-for--prefetchNumber of batches to prefetch per data loader worker- Increasing this number can help prevent GPUs waiting on the CPU, but increases memory usage
-bor--maxbatMaximum batch size- Turn this down if running out of memory
- If using
predict.py, defaults to a value that is estimated to fully occupy the device with the least memory - If using
train.py, defaults to32
-sor--pseudoseqsCSV file mapping MHC to one-hot encoding-lor--lrAdamW optimizer learning rate- Only available for
train.py
- Only available for
-eor--epochsnumber of epochs for transfer learning- Only available for
train.py
- Only available for
@Article{Albert2023,
author={Albert, Benjamin Alexander and Yang, Yunxiao and Shao, Xiaoshan M. and Singh, Dipika and Smith, Kellie N. and Anagnostou, Valsamo and Karchin, Rachel},
title={Deep neural networks predict class I major histocompatibility complex epitope presentation and transfer learn neoepitope immunogenicity},
journal={Nature Machine Intelligence},
year={2023},
month={Jul},
day={20},
issn={2522-5839},
doi={10.1038/s42256-023-00694-6},
url={https://doi.org/10.1038/s42256-023-00694-6}
}
See the LICENSE file