Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
7
3
16
eoe
eoe
Follow
webxos's profile picture
21world's profile picture
2 followers
ยท
81 following
AI & ML interests
None yet
Recent Activity
reacted
to
anakin87
's
post
with โค๏ธ
about 16 hours ago
How LLM training with RL Environments works? It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline ๐ Repeat For a deep dive, check out ๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
liked
a model
8 days ago
DunnBC22/vit-base-patch16-224-in21k_Human_Activity_Recognition
commented
on
a paper
3 months ago
Shallow-ฯ: Knowledge Distillation for Flow-based VLAs
View all activity
Organizations
None yet
models
3
Sort:ย Recently updated
eoe/Qwen2-0.5B-Instruct-Q4_0-GGUF
Text Generation
โข
0.5B
โข
Updated
Nov 4, 2024
โข
7
eoe/Qwen2-0.5B-Instruct-Q2_K-GGUF
Text Generation
โข
0.5B
โข
Updated
Nov 4, 2024
โข
2
eoe/mobilenetv2
Updated
Jul 23, 2023
datasets
0
None public yet