DPLM-2 Structure Tokenizer

This repository contains the structure tokenizer used by DPLM-2, a multimodal diffusion protein language model for joint protein sequence and structure modeling. The tokenizer converts protein backbone/atom coordinates into discrete structure tokens and can decode structure tokens back into protein structures. DPLM-2 uses these tokens to support sequence-structure co-generation, forward folding, inverse folding, and motif scaffolding.

For the official implementation, installation instructions, DPLM-2 generation scripts, and evaluation utilities, see the bytedance/dplm repository.

Model Details

  • Checkpoint: airkingbd/struct_tokenizer
  • Files: config.yaml, dplm2_struct_tokenizer.ckpt
  • Model class: byprot.models.structok.structok_lfq.VQModel
  • Tokenizer type: LFQ-based discrete protein structure tokenizer
  • Codebook size: 8,192 structure tokens (2^13)
  • Codebook embedding dimension: 13
  • Encoder: GVP-based structure encoder
  • Decoder: ESMFold-style structure decoder with decoder input dimension 128
  • License: Apache-2.0
  • Paper: DPLM-2: A Multimodal Diffusion Protein Language Model

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the released structure tokenizer:

from byprot.models.utils import get_struct_tokenizer

struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()

The helper downloads this repository from Hugging Face, reads config.yaml, constructs VQModel, and loads dplm2_struct_tokenizer.ckpt.

Tokenize PDB Structures

The official repository provides src/byprot/utils/protein/tokenize_pdb.py for converting PDB files into structure-token FASTA files:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein

The script processes *.pdb files in the input folder and writes:

  • struct_seq.fasta: tokenized structure sequences
  • aa_seq.fasta: amino-acid sequences extracted from the same structures

The structure sequences can be used as DPLM-2 structure-conditioning inputs. For example, pass the generated structure-token FASTA file to generate_dplm2.py --task inverse_folding --input_fasta_path ....

Use with DPLM-2

DPLM-2 checkpoints load this tokenizer through their struct_tokenizer property. For example:

from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer

The DPLM-2 configs point to this repository with:

struct_tokenizer:
  exp_path: airkingbd/struct_tokenizer

Citation

If you use this tokenizer, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. See the official repository for complete acknowledgements and implementation details.

Downloads last month
80
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train airkingbd/struct_tokenizer

Collection including airkingbd/struct_tokenizer

Paper for airkingbd/struct_tokenizer