Benchmarking Hierarchy Diagram

News

New 1.5B router model achieves 93% accuracy without costly retraining

Katanemo Labs' new LLM routing framework aligns with human preferences and adapts to new models without retraining.

This benchmark used Reddit’s AITA to test how much AI models suck up ...

The new benchmark, called Elephant, makes it easier to spot when AI models are being overly sycophantic—but there’s no current fix.

SiliconANGLE1mon

LMArena raises $100M at $600M valuation to expand AI benchmarking ...

LMArena, the company behind artificial intelligence testing service Chatbot Arena, has raised $100 million in initialj funding, marking one of the largest seed rounds in the AI sector to date ...

MIT Technology Review2mon

How to build a better AI benchmark | MIT Technology Review

But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

TechCrunch3mon

AI benchmarking platform Chatbot Arena forms a new company

Chatbot Arena, the crowdsourced AI benchmarking project, is forming a company called Arena Intelligence Inc., reports Bloomberg.

insideHPC3mon

Alice & Bob Selected by DARPA for the Quantum Benchmarking Initiative

Boston and Paris – April 3, 2025 – Fault-tolerant quantum computing company Alice & Bob today announces its selection as a performer in the U.S. Defense Advanced Research Projects Agency’s (DARPA) ...

CIO4mon

LLM benchmarking: How to find the right AI model - CIO

Benchmarks can be used to put large language models to the test. Read on for some tips on how to do it right.

TechCrunch4mon

Anthropic used Pokémon to benchmark its newest AI model

Anthropic used Pokémon to benchmark its newest AI model. Yes, really. In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic ...

Streaming Media Magazine4mon

Meta's David Ronca Talks Benchmarking and Deploying AV1 in the Android ...

Ronca goes on to explain that in the course of benchmarking AV1 integration into Android, Meta has developed VCAT (Video Codec Acid Tests), a new tool for benchmarking hardware and software decoders ...

Health Affairs6mon

Improving CMS Financial Benchmarking: Lessons Learned By The Innovation ...

The Innovation Center is committed to an ongoing cycle of designing, refining, and testing new benchmarking methodologies, particularly as we learn from ongoing model tests. This <i>Forefront</i ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results