News

Thank you for your valuable contribution to the research community. It's truly benchmarks like yours that drive the improvement of LLMs! I would appreciate learning more about your methodology for ...
I saw this in a RISC-V video (the latest one from Explaining computers, it's linked by someone in the RISC-V thread). Similar to Kraken, it's a browser-based ray tracer written in Javascript ...
We present a multi-way parallel corpus of Math Word Problems (MWPs) in nine languages, including six low-resource languages. To date, this is the largest multilingual MWP dataset available. We utilize ...
ChatGPT 4.1 is now rolling out, and it's a significant leap from GPT 4o, but it fails to beat the benchmark set by Google's most powerful model, Gemini.
Every benchmark has its merit, and ARC-AGI is a promising step in that broader conversation. The AI Impact Series Returns to San Francisco - August 5 The next phase of AI is here - are you ready?
OpenAI launches BrowseComp, a highly challenging benchmark to measure AI web search capabilities This article, originally posted in Japanese on 13:41 Apr 11, 2025, may contains some machine ...
A visualization website for comparing the performance of various LLMs across different context window sizes based on the Fiction.LiveBench benchmark.
Similarweb (NYSE:SMWB), a leading digital market intelligence company, and HypeAuditor, an influencer marketing platform for brands and agencies, today relea ...