Instructions to use cckevinn/SeeClick with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cckevinn/SeeClick with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cckevinn/SeeClick", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("cckevinn/SeeClick", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use cckevinn/SeeClick with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cckevinn/SeeClick" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cckevinn/SeeClick", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cckevinn/SeeClick
- SGLang
How to use cckevinn/SeeClick with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cckevinn/SeeClick" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cckevinn/SeeClick", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cckevinn/SeeClick" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cckevinn/SeeClick", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use cckevinn/SeeClick with Docker Model Runner:
docker model run hf.co/cckevinn/SeeClick
Clarification towards the different models
Hey Kanzhi Cheng,
I am a researcher which is trying to use you models for UI automation tasks. I wanted to first say that your model is really helpful and impressive.
Can you please help me to clarify the different models found in your account?
- What is their difference (base/mind2web/aitw/miniwob)?
- what each of them were trained on (only pretraining? pretraining + trainining of the dataset from its name only?)
- are the models improvements over each other? or each was trained on different data?
- were all of them fine tuned on ScreenSpot?
Thank you
Hi, sorry for a bit late.
- The SeeClick (base) is a model obtained through our proposed GUI grounding pre-training, possessing general GUI localization capacity. And mind2web, aitw and miniwob are models based on SeeClick-base, fine-tuned on three different downstream tasks (web, android, simplified web) respectively, enabling them to perform tasks within GUI environments.
- The SeeClick-base is only pre-training, and mind2web/aitw/miniwob is based on SeeClick on fine-tuned on different downstream tasks.
- No model was fine-tuned on ScreenSpot, nor should it be. This is because ScreenSpot serves as an evaluation benchmark for testing zero-shot GUI grounding capabilities. Our paper reports the results of SeeClick-base on ScreenSpot.
For more details, please check our paper and github https://huggingface.co/papers/2401.10935.
If you have any other questions, don't hesitate to tell me.
Thank you very much!
This is very helpful.
So just to clarify, the mind2web/aitw/miniwob models are the versions of the base model trained on the respective training datasets of each one?
Moreover, i wanted to ask if it is possible to ask for the SeeClick model to output more than 1 candidate?
Thanks again!
the mind2web/aitw/miniwob models are the versions of the base model trained on the respective training datasets of each one?
Yes. In fact, these models are uploaded to allow one to reproduce the results of the agent task in the paper, or directly apply them directly to similar GUI agent scenarios.
output more than 1 candidate?
The current model should only generate multiple candidates by some decoding methods like beam search, which you can refer to the origin Qwen-VL repo. But I think continual fine-tuning SeeClick with a small amount of data may give it the ability to generate multiple candidates.