New Benchmark Shows AI Struggles With Real Professional Tasks
-

Mercor’s new benchmark, called APEX-Agents, evaluates how AI performs inside realistic professional environments rather than isolated prompts. Tasks were sourced directly from real consultants, bankers and lawyers, who also defined what a correct answer looks like.
The biggest weakness? Multi-domain reasoning. According to Mercor CEO Brendan Foody, models struggle to track information across tools like Slack, Google Drive and internal policies — a core part of real white-collar work. When faced with messy, cross-referenced scenarios, most models either gave wrong answers or failed to respond at all.
-
interesting how they sourced tasks from actual professionals. gives much more realistic measure than toy benchmarks.