DPLM-2 Structure Tokenizer
This repository contains the structure tokenizer used by DPLM-2, a multimodal diffusion protein language model for joint protein sequence and structure modeling. The tokenizer converts protein backbone/atom coordinates into discrete structure tokens and can decode structure tokens back into protein structures. DPLM-2 uses these tokens to support sequence-structure co-generation, forward folding, inverse folding, and motif scaffolding.
For the official implementation, installation instructions, DPLM-2 generation scripts, and evaluation utilities, see the bytedance/dplm repository.
Model Details
- Checkpoint:
airkingbd/struct_tokenizer - Files:
config.yaml,dplm2_struct_tokenizer.ckpt - Model class:
byprot.models.structok.structok_lfq.VQModel - Tokenizer type: LFQ-based discrete protein structure tokenizer
- Codebook size: 8,192 structure tokens (
2^13) - Codebook embedding dimension: 13
- Encoder: GVP-based structure encoder
- Decoder: ESMFold-style structure decoder with decoder input dimension 128
- License: Apache-2.0
- Paper: DPLM-2: A Multimodal Diffusion Protein Language Model
Quick Start
Install the official DPLM codebase and dependencies:
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm
conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
Load the released structure tokenizer:
from byprot.models.utils import get_struct_tokenizer
struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()
The helper downloads this repository from Hugging Face, reads config.yaml,
constructs VQModel, and loads dplm2_struct_tokenizer.ckpt.
Tokenize PDB Structures
The official repository provides src/byprot/utils/protein/tokenize_pdb.py for
converting PDB files into structure-token FASTA files:
python src/byprot/utils/protein/tokenize_pdb.py \
--input_pdb_folder /path/to/input/pdbs \
--output_dir /path/to/output/tokenized_protein
The script processes *.pdb files in the input folder and writes:
struct_seq.fasta: tokenized structure sequencesaa_seq.fasta: amino-acid sequences extracted from the same structures
The structure sequences can be used as DPLM-2 structure-conditioning inputs.
For example, pass the generated structure-token FASTA file to
generate_dplm2.py --task inverse_folding --input_fasta_path ....
Use with DPLM-2
DPLM-2 checkpoints load this tokenizer through their struct_tokenizer property.
For example:
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer
The DPLM-2 configs point to this repository with:
struct_tokenizer:
exp_path: airkingbd/struct_tokenizer
Citation
If you use this tokenizer, please cite the DPLM and DPLM-2 papers:
@inproceedings{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2024}
}
@inproceedings{wang2025dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Learning Representations},
year={2025}
}
@inproceedings{hsieh2025dplm2_1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2025}
}
Acknowledgements
DPLM builds on and acknowledges prior work and resources including ByProt, ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. See the official repository for complete acknowledgements and implementation details.
- Downloads last month
- 80