view article Article How to Train Your LLM Web Agent: A Statistical Diagnosis By ppEmiliano • Jul 8 • 14
view article Article DABStep: Data Agent Benchmark for Multi-step Reasoning By eggie5 and 5 others • Feb 4 • 102
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks Paper • 2407.05291 • Published Jul 7, 2024 • 2
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content Paper • 2406.11811 • Published Jun 17, 2024 • 16