Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference Paper • 2505.13770 • Published Mar 4 • 1
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science Paper • 2603.19005 • Published Mar 19 • 6
Can Agentic AI Match the Performance of Human Data Scientists? Paper • 2512.20959 • Published Dec 24, 2025 • 1
view article Article Jupyter Agents: training LLMs to reason with notebooks +1 baptistecolle, hannayukhymenko, lvwerra • Sep 10, 2025 • 64
view article Article DABStep: Data Agent Benchmark for Multi-step Reasoning +5 eggie5, martinigoyanes, frisokingma, andreumora, lvwerra, thomwolf, m-ric • Feb 4, 2025 • 128
An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems Paper • 2505.18397 • Published May 23, 2025 • 1
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science Paper • 2506.13992 • Published May 25, 2025 • 1