This conference talk introduces TurkingBench, a novel benchmark for evaluating how well multi-modal AI models can perform complex web-based tasks. Discover how researchers from Johns Hopkins University's Center for Language & Speech Processing created a benchmark using natural HTML pages originally designed for crowdsourcing workers rather than artificially synthesized web pages. Learn about the benchmark's composition of 32.2K instances across 158 tasks, and the evaluation framework that connects chatbot responses to specific web page actions like text box modifications and radio button selections. Explore the performance results of cutting-edge models including GPT4 and InternVL, which show that while current models outperform random chance, significant improvement opportunities remain for web-based agents.
TurkingBench: A Challenge Benchmark for Web Agents - NAACL 2025
Center for Language & Speech Processing(CLSP), JHU via YouTube
Overview
Syllabus
TurkingBench: A Challenge Benchmark for Web Agents --- NAACL 2025
Taught by
Center for Language & Speech Processing(CLSP), JHU