Cohere open-sources a coding agent that runs on a single H100

Read full story on VentureBeat
Share
Cohere open-sources a coding agent that runs on a single H100
AI disclosure

Summary

<p>Engineering teams building agentic coding pipelines now have a concrete open-source alternative to managed models like <a href="https://venturebeat.com/technology/anthropic-brings-mythos-to-the-masses-with-claude-fable-5-its-most-powerful-generally-available-model-ever">Claude Fable 5</a> — one that runs on a single H100. The tradeoff: Cohere&#x27;s North Mini Code, which launched Tuesday, generated three times the output tokens of comparable models in independent testing, a verbosity cost that compounds in high-volume production workloads.</p><p>The new open-source model is a 30 billion parameter mixture-of-experts (MoE) model with 3 billion parameters active per token, built for agentic software engineering including sub-agent orchestration, architecture mapping, code review and terminal work. The model supports a 256,000 token context window with a 64,000 token maximum generation length, and is available on<a href="https://huggingface.co/CohereLabs/North-Mini-Code-1.0"> Hugging Face</a> under an Apache 2.0 license.</p><h2>What North Mini Code can do</h2><p>North Mini Code targets the full agentic coding stack. Here is what the model does and what it runs on.</p><p><b>Software engineering.</b> Cohere built North Mini Code specifically for agentic software engineering, not adapted from a general-purpose base. It has integrated tool-use capabilities and supports interleaved thinking, which Cohere says improves performance across multi-step agentic work.</p><p><b>Architecture mapping and code review.</b> North Mini Code can analyze and map systems architecture, surface dependencies and perform code review across large codebases. With a 256,000 token context window, it can hold substantial multi-file projects in a single context pass.</p><p><b>Terminal-based agentic task</b>s. The model is trained for terminal environments, handling shell interactions, package scripts and command-line tooling. Cohere benchmarked it on Terminal-Bench v2, which tests agents in real terminal environments rather than synthetic code generation tasks.</p><h2>How it was built</h2><p>North Mini Code is a sparse mixture-of-experts model with 128 experts, of which 8 activate per token. The compute requirement at inference time is closer to a 3 billion parameter model despite 30 billion total parameters. Nick Frosst, co-founder of Cohere, <a href="https://x.com/cohere/status/2064378058329526556">demoed it running on a Mac Studio</a> via MLX at around 20 gigabytes of RAM, the same machine he uses for his own local coding work.</p><p>Cohere trained the model through two stages of supervised fine-tuning followed by reinforcement learning with verifiable rewards across more than 70,000 verifiable tasks spanning approximately 5,000 repositories, deduplicated against SWE-Bench. </p><p>Rather than optimizing against a single agent scaffold, Cohere trained across three. SWE-Agent uses a rich CLI with specialized commands. Mini-SWE-Agent uses a single bash tool with raw shell output. OpenCode uses individually typed tools returning structured JSON. Cohere reports a 10 percentage point gain on OpenCode evaluation from the multi-harness approach while maintaining SWE-Agent performance.</p><h2>Where it fits</h2><p>North Mini Code enters a market that now includes Mistral Devstral Small 2, GitHub Copilot, Cursor, and Claude Fable 5 — each with distinct cost and deployment tradeoffs.</p><p>Cohere&#x27;s primary benchmark comparison is against<a href="https://venturebeat.com/ai/mistral-launches-powerful-devstral-2-coding-model-including-open-source"> Mistral Devstral Small 2</a>, a 24 billion parameter dense model. In vendor-reported internal tests, Cohere claims 2.8x higher output throughput and a 30% inter-token latency advantage over Devstral Small 2 in internal tests under identical hardware configurations. Cohere also claims, in its<a href="https://huggingface.co/blog/CohereLabs/introducing-north-mini-code"> Hugging Face technical post</a>, that North Mini Code outperforms open-source models up to four times its parameter count on its reported benchmarks, including models at 120 billion parameters. </p><p><a href="https://artificialanalysis.ai/models/north-mini-code">Artificial Analysis</a> independently ranks it eighth of 127 comparable open-weight models on output speed at 210 tokens per second, with a time to first token of 0.25 second against a class median of 1.95 seconds. It places 18th of 127 on the Artificial Analysis Intelligence Index. One flag from the same data: the model generated 75 million output tokens to complete the Intelligence Index against a class median of 25 million. In high-volume agentic pipelines, that verbosity compounds into inference cost and latency.</p><p>&quot;Suddenly people are thinking like hey, am I getting enough economic value out of the tokens from a model?&quot; Frosst said during the launch video. &quot;Local deployment is one way of empowering people and making AI really something that works for them.&quot;</p><p>GitHub Copilot, Cursor and Claude Code operate on per-usage or subscription pricing with no on-premises option. Anthropic&#x27;s Claude Fable 5, now the most capable publicly available managed coding model, runs at $50 per million output tokens. For Frosst, the model is the polar opposite of Fable.</p><p>&quot;Its small, cost effective, apache 2.0, and locally deployable. This is the way LLMs should go. small, open source, transparent and sovereign, vs large, expensive, proprietary and hegemonic,&quot; Frosst wrote in a<a href="https://x.com/nickfrosst/status/2064396337404096809?s=20"> post on X</a>.</p><div></div><h2>What this means for enterprises</h2><p>For teams building production agentic coding pipelines, North Mini Code&#x27;s release clarifies a set of decisions that have been forming for months.</p><p><b>Purpose-built agentic training is now a baseline to evaluate against.</b> The distinction between models fine-tuned for code and models trained specifically for agentic workflows, with verified tool calls and multi-harness robustness, is now a material factor in pipeline decisions. Any model vendor claiming agentic coding capability should be able to answer whether its training used verifiable agentic tasks or was adapted from a general-purpose base.</p><p><b>Verbosity is a hidden pipeline cost that benchmarks do not surface.</b> Artificial Analysis measured North Mini Code generating three times the output tokens of comparable models. That verbosity compounds across inference cost and latency in high-volume pipelines. Throughput testing against actual workload volume is the evaluation step the benchmark rankings skip.</p><p><b>The frontier pricing split is now a real architectural decision.</b> Fable 5 at $50 per million output tokens and North Mini Code on a single H100 represent a genuine tradeoff between cost control and data residency on one side, and managed infrastructure overhead on the other. Teams running high-volume agentic coding pipelines should model both cost paths against their actual workload before committing to either.</p>

Original reporting

Open original source

Related coverage

Read full article on VentureBeat

Get the AFBytes Brief

Major stories, AI-assisted analysis, and what to watch next. Free, monthly, unsubscribe anytime.