TESS-Computer/tess-agentnet
Viewer β’ Updated β’ 329k β’ 103
TESS is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).
import torch
from PIL import Image
# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model
from test_checkpoint import load_model, predict
# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")
# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")
print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
Mouse actions:
{
'action_type': 'mouse',
'xy': [x, y], # Normalized coordinates (0-1)
'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}
Keyboard actions:
{
'action_type': 'keyboard',
'action': 'type' | 'press' | 'hotkey',
'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}
Screenshot + Instruction β SmolVLM2 β Shared MLP β Router
β
βββββββββββββββββ΄ββββββββββββββββ
β β
Mouse Branch Keyboard Branch
(XY + Click heads) (VLM text generation)
Apache 2.0
@misc{tess2025,
title={TESS: A Vision-Language-Action Model for Computer Use},
author={Hussein Lezzaik},
year={2025},
url={https://github.com/husseinlezzaik/TESS}
}
Base model
HuggingFaceTB/SmolLM2-360M