Ai Benchmarks for Code

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Kimi K2.7-Code claims 30% fewer thinking tokens and a drop-in API swap path, but independent benchmarks show kernel ...

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks

The persistent memory system addresses a real and widely felt pain point in agentic development workflows — one that ...

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark ...

NewsBytes

Xiaomi says its new AI coding model beats Claude Code

Xiaomi has launched Mimo Code v0.1.0, an open-source AI coding tool that reportedly outperforms Anthropic's Claude Code on ...

10d

Gomboc AI Publishes First Open Benchmark for AI Code Remediation

15 cloud scenarios. 43 merge-ready fixes. 100% loop closure. 12 minutes and $17 to author once; seconds and zero-cost ...

21h

Xiaomi’s latest AI coding tool claims to outperform Claude Code on complex tasks

The open-source AI coding assistant is designed for long-running software projects and, according to Xiaomi's own benchmarks ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

How Anthropic’s Fable 5 Beat ChatGPT 5.5 by 20% in Coding Benchmarks

Anthropic has launched Claude Fable 5, a Mythos-class AI model that outperforms GPT 5.5 in coding and vision tasks despite ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

29d

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

22hon MSN

The AI PC era has a benchmarking problem

I like numbers. Data feels sure. What could be better than measurable progress, a way of quantifying the world to stop ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results