Instructions to use ReCAP-Agent/ReCAP-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ReCAP-Agent/ReCAP-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ReCAP-Agent/ReCAP-32B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ReCAP-Agent/ReCAP-32B")
model = AutoModelForImageTextToText.from_pretrained("ReCAP-Agent/ReCAP-32B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ReCAP-Agent/ReCAP-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ReCAP-Agent/ReCAP-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ReCAP-Agent/ReCAP-32B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ReCAP-Agent/ReCAP-32B

SGLang

How to use ReCAP-Agent/ReCAP-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ReCAP-Agent/ReCAP-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ReCAP-Agent/ReCAP-32B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ReCAP-Agent/ReCAP-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ReCAP-Agent/ReCAP-32B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ReCAP-Agent/ReCAP-32B with Docker Model Runner:
```
docker model run hf.co/ReCAP-Agent/ReCAP-32B
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ReCAP-32B

ReCAP-32B is a vision-language model fine-tuned from
Qwen/Qwen3-VL-32B-Thinking, designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.

This model is introduced in “CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”.

🚀 Overview

ReCAP-32B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.

It operates end-to-end:

Input: raw screenshots
Output: reasoning + executable GUI actions (click, type, drag)

✨ Key Features

Unified agent: Handles both CAPTCHA and general GUI tasks
Reasoning-action modeling: Learns both decisions and execution
Self-correction: Improves robustness by learning from failures
Efficient interaction: Generates multiple actions per step

🧠 Capabilities

Supports diverse CAPTCHA types:

Text / OCR
Icon selection & matching
Image grid reasoning
Slider / drag tasks
Multi-step interaction challenges

Core skills:

Visual understanding
Spatial reasoning
Continuous control
Multi-step planning

📊 Performance

~81.0% success rate on synthetic CAPTCHA benchmark
Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
Maintains strong performance on general GUI benchmarks

🔒 Ethical Considerations

This model is released for research purposes only.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.

Downloads last month: -

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for ReCAP-Agent/ReCAP-32B

Base model

Qwen/Qwen3-VL-32B-Thinking

Finetuned

(9)

this model

Collection including ReCAP-Agent/ReCAP-32B

ReCAP Agent

Collection

ReCAP is a framework for training and evaluating CAPTCHA-capable GUI agents using dynamic tasks, benchmarks, and unified evaluation. • 3 items • Updated Mar 23