News
Thank you for your valuable contribution to the research community. It's truly benchmarks like yours that drive the improvement of LLMs! I would appreciate learning more about your methodology for ...
I saw this in a RISC-V video (the latest one from Explaining computers, it's linked by someone in the RISC-V thread). Similar to Kraken, it's a browser-based ray tracer written in Javascript ...
We present a multi-way parallel corpus of Math Word Problems (MWPs) in nine languages, including six low-resource languages. To date, this is the largest multilingual MWP dataset available. We utilize ...
ChatGPT 4.1 is now rolling out, and it's a significant leap from GPT 4o, but it fails to beat the benchmark set by Google's most powerful model, Gemini.
Every benchmark has its merit, and ARC-AGI is a promising step in that broader conversation. The AI Impact Series Returns to San Francisco - August 5 The next phase of AI is here - are you ready?
OpenAI launches BrowseComp, a highly challenging benchmark to measure AI web search capabilities This article, originally posted in Japanese on 13:41 Apr 11, 2025, may contains some machine ...
A visualization website for comparing the performance of various LLMs across different context window sizes based on the Fiction.LiveBench benchmark.
Similarweb (NYSE:SMWB), a leading digital market intelligence company, and HypeAuditor, an influencer marketing platform for brands and agencies, today relea ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results