Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Collapse
Brand Logo
UDS UDS: $2.2107
24h: -1.33%
Trade UDS
Gate.io
Gate.io
UDS / USDT
KuCoin
KuCoin
UDS / USDT
MEXC
MEXC
UDS / USDT
BingX
BingX
UDS / USDT
BitMart
BitMart
UDS / USDT
LBank
LBank
UDS / USDT
XT.COM
XT.COM
UDS / USDT
Uniswap v3
Uniswap v3
UDS / USDT
Biconomy.com
Biconomy.com
UDS / USDT
WEEX
WEEX
UDS / USDT
PancakeSwap v3
PancakeSwap v3
UDS / USDT
Pionex
Pionex
UDS / USDT
COINSTORE
COINSTORE
UDS / USDT
Sushiswap v3
Sushiswap v3
UDS / USDT
Picol
Picol
UDS / USDT

Earn up to 50 UDS per post

Post in Forum to earn rewards!

Learn more
UDS Right

Spin your Wheel of Fortune!

Earn or purchase spins to test your luck. Spin the Wheel of Fortune and win amazing prizes!

Spin now
Wheel of Fortune
selector
wheel
Spin

Paired Staking

Stake $UDS
APR icon Earn up to 50% APR
NFT icon Boost earnings with NFTs
Earn icon Play, HODL & earn more
Stake $UDS
Stake $UDS
UDS Left

Buy UDS!

Buy UDS with popular exchanges! Make purchases and claim rewards!

Buy UDS
UDS Right

Post in Forum to earn rewards!

UDS Rewards
  1. Home
  2. Freelancing/Online work exchange
  3. New Benchmark Shows AI Struggles With Real Professional Tasks

New Benchmark Shows AI Struggles With Real Professional Tasks

Scheduled Pinned Locked Moved Freelancing/Online work exchange
2 Posts 2 Posters 8 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
This topic has been deleted. Only users with topic management privileges can see it.
  • madmaxM Offline
    madmaxM Offline
    madmax
    wrote last edited by
    #1

    667a9a29-7d91-435d-b5b5-3c12c5f944c8-image.png

    Mercor’s new benchmark, called APEX-Agents, evaluates how AI performs inside realistic professional environments rather than isolated prompts. Tasks were sourced directly from real consultants, bankers and lawyers, who also defined what a correct answer looks like.

    The biggest weakness? Multi-domain reasoning. According to Mercor CEO Brendan Foody, models struggle to track information across tools like Slack, Google Drive and internal policies — a core part of real white-collar work. When faced with messy, cross-referenced scenarios, most models either gave wrong answers or failed to respond at all.

    1 Reply Last reply
    0
    • tradelikeproT Offline
      tradelikeproT Offline
      tradelikepro
      wrote last edited by
      #2

      interesting how they sourced tasks from actual professionals. gives much more realistic measure than toy benchmarks.

      1 Reply Last reply
      0


      • Login or register to search.
      Powered by NodeBB Contributors
      • First post
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • World
      • Users
      • Groups