Instructions to use Swindl/GLUS-S with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Swindl/GLUS-S with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Swindl/GLUS-S")# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("Swindl/GLUS-S") model = AutoModelForCausalLM.from_pretrained("Swindl/GLUS-S") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Swindl/GLUS-S with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Swindl/GLUS-S" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Swindl/GLUS-S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Swindl/GLUS-S
- SGLang
How to use Swindl/GLUS-S with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Swindl/GLUS-S" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Swindl/GLUS-S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Swindl/GLUS-S" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Swindl/GLUS-S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Swindl/GLUS-S with Docker Model Runner:
docker model run hf.co/Swindl/GLUS-S
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
[Project Page] [arXiv] [GitHub]
Overview
RefVOS in complex scenarios places high demands on models' video understanding and fine-grained localization capabilities. Recently, numerous models leveraging MLLM-based comprehension and reasoning abilities have been proposed to address this challenge. Our GLUS advances further along this methodological path.
π GLUS is principled. It utilizes global-local reasoning to combine holistic video understanding with detailed frames understanding, unleashing the potential of fine-grained segmentation in complex scenarios.
β¨ GLUS is powerful. It unifies the methods of memory bank, object contrastive learning and key frame selection to tackle the problems of mask inconsistency and object obfuscation, achieving state-of-the-art performance in complex-scenario RefVOS tasks.
π GLUS is simple. It elegantly integrates the approach for complex-scenario RefVOS tasks within a single MLLM framework, eliminating the necessity of utilizing other independent modules.
Installation
git clone git@github.com:GLUS-video/GLUS.git && cd GLUS
pip install -r requirements.txt
pip install ./model/segment-anything-2
pip install flash-attn==2.6.2 --no-build-isolation
Model Zoo
For more convenient following, we provide the checkpoints of GLUS without object contrastive learning.
| Model | Training Datasets | Methods | Download | MeViS J&F | Ref-Youtube-VOS J&F |
|---|---|---|---|---|---|
| GLUSSpartial | MeViS, Ref-Youtube-VOS | GLU + MB | HuggingFace, ModelScope | 49.5 | 65.2 |
| GLUSS | MeViS, Ref-Youtube-VOS | GLU + MB + OC + KFS | HuggingFace, ModelScope | 50.3 | 66.6 |
| GLUSA | + RefDAVIS17, ReVOS, LVVIS | GLU + MB | HuggingFace, ModelScope | 51.3 | 67.3 |
Notes: βGLUβ: Global-local unification, βMBβ: End-to-end memory bank, βOCβ: Object contrastive loss, βKFSβ: key frame selection. GLUSS refers to the model trained on a subset of existing RefVOS datasets (Mevis and Ref-Youtube-VOS), while GLUSA denotes the model trained on the full set of available datasets.
We recommend to download and store the pretrained weights at GLUS_ROOT/checkpoints.
Training and Validation
1. Data Preparation
Please follow the below architecture to prepare the datasets. We recommend to set DATASET_ROOT to GLUS_ROOT/data.
- RefVOS Datasets: MeViS, Refer-YouTube-VOS, Ref-DAVIS17.
- Reasoning VOS Datasets: ReVOS, ReasonVOS
- Open-Vocabulary Video Instance Segmentation Dataset: LV-VIS.
Datasets Architecture
DATASET_ROOT
βββ mevis
β βββ train
β β βββ JPEGImages
β β βββ mask_dict.json
β β βββ meta_expressions.json
β βββ valid
β β βββ JPEGImages
β β βββ meta_expressions.json
β βββ valid_u
β βββ JPEGImages
β βββ mask_dict.json
β βββ meta_expressions.json
βββ Refer-YouTube-VOS
β βββ meta_expressions
β β βββ train/meta_expressions.json
β β βββ valid/meta_expressions.json
β βββ train
β β βββ JPEGImages
β β βββ Annotations
β βββ valid
β βββ JPEGImages
βββ DAVIS17
β βββ meta_expressions
β β βββ train/meta_expressions.json
β β βββ valid/meta_expressions.json
β βββ train
β β βββ JPEGImages
β β βββ Annotations
β βββ valid
β βββ JPEGImages
β βββ Annotations
βββ LVVIS
β βββ train
β β βββ JPEGImages
β βββ mask_dict.json
β βββ meta_expressions.json
βββ ReVOS
β βββ JPEGImages
β βββ mask_dict.json
β βββ mask_dict_foreground.json
β βββ meta_expressions_train_.json
β βββ meta_expressions_valid_.json
βββ ReasonVOS
β βββ JPEGImages
β βββ Annotations
β βββ meta_expressions.json
2. Model Weights Preparation
Follow the guidance to prepare for the pretrained weights of LISA and SAM-2 for training GLUS:
- Download the pretrained weights of LISA from LISA-7B-v1.
- Download the pretrained weights of SAM-2 from sam2_hiera_large.
Then organize them in the following architecture:
WEIGHTS_ROOT
βββ LISA-7B-v1
βββ sam2_hiera_large.pt
We recommend to set WEIGHTS_ROOT to GLUS_ROOT/checkpoints.
3. Training
Set the paths in the scripts and then run scripts/train_glus_s.sh or scripts/train_glus_a.sh. The scripts will automatically start the training, and transform the saved checkpoint into hugging-face format when the training finished.
Key Frame Selection
For the usage of key frame selection, please refer to the KFS_README.
4. Evaluation
Set the paths, val_set and set_name in scripts/inference.sh, and then run it. It will detect the available GPUs firstly and then individually run parallelizable inference on each gpu.
Evaluation with Key Frame Selection
Set the args use_kf and kf_path in scripts/inference_kf.sh, and then run it. We provide our json file on Mevis and Refyoutube-VOS for GLUSS on the google drive.
After the masks are generated completely, run the corresponding evalaution python file in utils. You may need to set the groundtruth mask path, predicted mask path and expressions json file path. Please refer to the eval files to see the help on arguments.
An example:
python utils/eval_mevis.py \\
--mevis_exp_path=\'$GLUS_ROOT/data/mevis/valid_u/meta_expressions.json\' \\
--mevis_mask_path=\'$GLUS_ROOT/data/mevis/valid_u/mask_dict.json\'
--mevis_pred_path=\'$GLUS_ROOT/generated\'
Specially, to evaluate the performance on Refer-YouTube-VOS Valid or MeViS Valid benchmarks, you may need to submit the predicted masks results following the guidance at MeViS-Evaluation-Server or RefYoutube-Evaluation-Server.
Inference and Demo
Please refer to demo.ipynb to inference on your own videos and referrings.
For more examples, please refer to our Project Page.
Citation
If you find this work useful in your research, please consider citing:
@inproceedings{lin2025glus,
title={GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation},
author={Lin, Lang and Yu, Xueyang and Pang, Ziqi and Wang, Yu-Xiong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
Acknowledgement
We thank the contributors to the following open-source projects. Our project is impossible without the inspirations from these excellent researchers.
- Downloads last month
- 10