DPLM-2 Structure Tokenizer

This repository contains the structure tokenizer used by DPLM-2, a multimodal diffusion protein language model for joint protein sequence and structure modeling. The tokenizer converts protein backbone/atom coordinates into discrete structure tokens and can decode structure tokens back into protein structures. DPLM-2 uses these tokens to support sequence-structure co-generation, forward folding, inverse folding, and motif scaffolding.

For the official implementation, installation instructions, DPLM-2 generation scripts, and evaluation utilities, see the bytedance/dplm repository.

Model Details

Checkpoint: airkingbd/struct_tokenizer
Files: config.yaml, dplm2_struct_tokenizer.ckpt
Model class: byprot.models.structok.structok_lfq.VQModel
Tokenizer type: LFQ-based discrete protein structure tokenizer
Codebook size: 8,192 structure tokens (2^13)
Codebook embedding dimension: 13
Encoder: GVP-based structure encoder
Decoder: ESMFold-style structure decoder with decoder input dimension 128
License: Apache-2.0
Paper: DPLM-2: A Multimodal Diffusion Protein Language Model

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the released structure tokenizer:

from byprot.models.utils import get_struct_tokenizer

struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()

The helper downloads this repository from Hugging Face, reads config.yaml, constructs VQModel, and loads dplm2_struct_tokenizer.ckpt.

Tokenize PDB Structures

The official repository provides src/byprot/utils/protein/tokenize_pdb.py for converting PDB files into structure-token FASTA files:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein

The script processes *.pdb files in the input folder and writes:

struct_seq.fasta: tokenized structure sequences
aa_seq.fasta: amino-acid sequences extracted from the same structures

The structure sequences can be used as DPLM-2 structure-conditioning inputs. For example, pass the generated structure-token FASTA file to generate_dplm2.py --task inverse_folding --input_fasta_path ....

Use with DPLM-2

DPLM-2 checkpoints load this tokenizer through their struct_tokenizer property. For example:

from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer

The DPLM-2 configs point to this repository with:

struct_tokenizer:
  exp_path: airkingbd/struct_tokenizer

Citation

If you use this tokenizer, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. See the official repository for complete acknowledgements and implementation details.

Downloads last month: 80

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train airkingbd/struct_tokenizer

Collection including airkingbd/struct_tokenizer

DPLM-2

Collection

DPLM-2: A Multimodal Diffusion Protein Language Model (https://arxiv.org/abs/2410.13782) • 5 items • Updated 10 days ago

Paper for airkingbd/struct_tokenizer

DPLM-2: A Multimodal Diffusion Protein Language Model

Paper • 2410.13782 • Published Oct 17, 2024 • 22