Sleeping Agents Lost-in-Thought Benchmark π§ Run a benchmark to see how reasoning steps affect retrieval accuracy
Running Agents Master Key Capability Demo π Show expected accuracy boost for a math problem via steering
Running Agents Agentic World Model Explorer π Explore world model levels, laws, and rollouts interactively
Runtime error Agents COMPASS-Inspired Semantic Sampling for Sudanese Arabic Dialect Understanding π―
Sleeping Agents CoT Spatial Reasoning Degradation π§ Show how step-by-step prompts affect visual puzzle answers
Running Agents CoT Spatial Reasoning Degradation π Generate spatial puzzles and compare direct vs CoT reasoning
Sleeping Agents Weak Supervision Reasoning Explorer π¬ Explore reasoning performance under weak supervision
Sleeping Agents Sudanese Arabic Navigable RAG Demo π§ Compare Sudanese Arabic phrase retrieval methods
Sleeping Agents Interleaved Retrieval-Reasoning Benchmark π Compare Standard vs Interleaved RAG with simulated benchmarks
Running Agents Agent Architecture Visualizer π Simulate and visualize AI agent loops with permissions
Sleeping Agents 1 TESSY Reasoning Demo - Sudanese Arabic π§ Analyze Sudanese Arabic samples with standard vs TESSY reasoning
Paused Agents Sudanese Arabic SWE-AGILE Reasoning Benchmark π§ Run Sudanese Arabic reasoning benchmark with context strategies