You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ViX-Ray — Fine-tuned Medical Vision-Language Models

Fine-tuned weights for Vietnamese chest X-ray report generation across 3 clinical tasks and 6 model architectures.

Best overall performance: Qwen2-VL-7B across all 3 tasks.

Tasks

#	Task	Description
1	`finding`	Generate radiology findings from a chest X-ray image
2	`impression`	Generate the clinical impression (final diagnosis) from a chest X-ray image
3	`multi`	Multi-turn dialogue — findings → impression via conversation history

Models

Key	Base model	Size
`Intern`	InternVL2.5-1B	1B
`Vintern`	Vintern-1B-v3.5	1B
`Qwen2B`	Qwen2-VL-2B-Instruct	2B
`Qwen7B`	Qwen2-VL-7B-Instruct ⭐	7B
`MiniCPM`	MiniCPM-V-2_6	8B
`LaVy`	LaVy-Instruct	7B

Quick Start

1. Install

pip install huggingface_hub transformers torch torchvision pillow

For Qwen models, also install:

pip install qwen-vl-utils

For Intern / Vintern models, also install:

pip install decord

For MiniCPM, pin versions:

pip install Pillow==10.1.0 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 decord

2. Download a model zip

# task  : finding | impression | multi
# model : Intern | Vintern | Qwen2B | Qwen7B | MiniCPM | LaVy

huggingface-cli download presencesw/ViX-Ray <task>/<Model>.zip \
    --repo-type model --local-dir ./

Example — download the best model for finding:

huggingface-cli download presencesw/ViX-Ray finding/Qwen7B.zip \
    --repo-type model --local-dir ./

Download all models at once:

huggingface-cli download presencesw/ViX-Ray \
    --repo-type model --local-dir ./vix_ray_models

3. Unzip

unzip <task>/<Model>.zip -d ./models/<task>/
# result: ./models/<task>/<Model>/

Or in Python:

import zipfile
with zipfile.ZipFile("<task>/<Model>.zip") as zf:
    zf.extractall("./models/<task>/")

4. Load & infer

Set model_path = "./models/<task>/<Model>" then use the snippet for your model family.

Qwen2-VL (Qwen2B / Qwen7B)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_path = "./models/<task>/<Model>"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "your_image.jpg"},
            {"type": "text",  "text": "Mô tả hình ảnh X-quang ngực này."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs,
                   padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])

InternVL / Vintern (Intern / Vintern)

import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

model_path = "./models/<task>/<Model>"

model = AutoModel.from_pretrained(
    model_path, torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)

MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
transform = T.Compose([
    T.Lambda(lambda img: img.convert("RGB")),
    T.Resize((448, 448), interpolation=InterpolationMode.BICUBIC),
    T.ToTensor(),
    T.Normalize(mean=MEAN, std=STD),
])

pixel_values = transform(Image.open("your_image.jpg")).unsqueeze(0).to(torch.bfloat16).cuda()
response = model.chat(tokenizer, pixel_values, "<image>\nMô tả hình ảnh X-quang ngực này.",
                      dict(max_new_tokens=512, do_sample=True))
print(response)

MiniCPM-V

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model_path = "./models/<task>/<Model>"

model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    attn_implementation="sdpa", torch_dtype=torch.bfloat16
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

image = Image.open("your_image.jpg").convert("RGB")
msgs = [{"role": "user", "content": [image, "Mô tả hình ảnh X-quang ngực này."]}]
print(model.chat(image=None, msgs=msgs, tokenizer=tokenizer))

LaVy

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "./models/<task>/<Model>"

model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

inputs = processor(
    images=Image.open("your_image.jpg").convert("RGB"),
    text="Mô tả hình ảnh X-quang ngực này.",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Multi-turn (Task 3)

For the multi task, pass conversation history between turns:

# Turn 1 — findings
response1 = ...  # run inference as above

# Turn 2 — impression (append assistant turn then ask)
messages.append({"role": "assistant", "content": [{"type": "text", "text": response1}]})
messages.append({"role": "user",      "content": [{"type": "text", "text": "Kết luận bệnh gì?"}]})
response2 = ...  # run inference again with updated messages

See readme/<task>_<Model>.md for the full per-model multi-turn example.

Full Model Table

Task	Model	Base	Zip path
finding	Intern	InternVL2.5-1B	`finding/Intern.zip`
finding	Vintern	Vintern-1B-v3.5	`finding/Vintern.zip`
finding	Qwen2B	Qwen2-VL-2B	`finding/Qwen2B.zip`
finding	Qwen7B ⭐	Qwen2-VL-7B	`finding/Qwen7B.zip`
finding	MiniCPM	MiniCPM-V-2_6	`finding/MiniCPM.zip`
finding	LaVy	LaVy-Instruct	`finding/LaVy.zip`
impression	Intern	InternVL2.5-1B	`impression/Intern.zip`
impression	Vintern	Vintern-1B-v3.5	`impression/Vintern.zip`
impression	Qwen2B	Qwen2-VL-2B	`impression/Qwen2B.zip`
impression	Qwen7B ⭐	Qwen2-VL-7B	`impression/Qwen7B.zip`
impression	MiniCPM	MiniCPM-V-2_6	`impression/MiniCPM.zip`
impression	LaVy	LaVy-Instruct	`impression/LaVy.zip`
multi	Intern	InternVL2.5-1B	`multi/Intern.zip`
multi	Vintern	Vintern-1B-v3.5	`multi/Vintern.zip`
multi	Qwen2B	Qwen2-VL-2B	`multi/Qwen2B.zip`
multi	Qwen7B ⭐	Qwen2-VL-7B	`multi/Qwen7B.zip`
multi	MiniCPM	MiniCPM-V-2_6	`multi/MiniCPM.zip`
multi	LaVy	LaVy-Instruct	`multi/LaVy.zip`

Per-model details (installation, full inference code) are in readme/<task>_<Model>.md.

Citation

If you use these models or the ViX-Ray dataset in your research, please cite:

@article{nguyen2026vix,
  title={ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models},
  author={Nguyen, Duy Vu Minh and Truong, Chinh Thanh and Tran, Phuc Hoang and Le, Hung Tuan and Dat, Nguyen Van-Thanh and Pham, Trung Hieu and Van Nguyen, Kiet},
  journal={arXiv preprint arXiv:2603.15513},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for presencesw/ViX-Ray

ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

Paper • 2603.15513 • Published Mar 16