The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical ...
AI life science benchmark LifeSciBench, published June 17 by OpenAI with 173 PhD scientists, shows frontier models clear only ...