New Benchmark Shows AI Struggles With Real Professional Tasks

madmax

Mercor’s new benchmark, called APEX-Agents, evaluates how AI performs inside realistic professional environments rather than isolated prompts. Tasks were sourced directly from real consultants, bankers and lawyers, who also defined what a correct answer looks like.

The biggest weakness? Multi-domain reasoning. According to Mercor CEO Brendan Foody, models struggle to track information across tools like Slack, Google Drive and internal policies — a core part of real white-collar work. When faced with messy, cross-referenced scenarios, most models either gave wrong answers or failed to respond at all.

tradelikepro

interesting how they sourced tasks from actual professionals. gives much more realistic measure than toy benchmarks.

Earn up to 50 UDS per post

Spin your Wheel of Fortune!

Paired Staking

Buy UDS!

New Benchmark Shows AI Struggles With Real Professional Tasks