Cohere launched its first speech recognition product, entering a competitive market with a model it says tops every open- and closed-source rival on a widely used accuracy benchmark — including OpenAI's Whisper, currently one of the most widely deployed transcription tools in enterprise software.
The model, called Cohere Transcribe, is available today as a free, open-source download on Hugging Face under an Apache 2.0 license. It can be run locally, accessed via Cohere's API or deployed at scale through the company's managed cloud platform.
| Average Word Error Rate | Languages Supported | Model Size |
|---|---|---|
| 5.42% | 14 (inc. English, Arabic, Japanese) | 2B parameters |
Table of Contents
- How It Performs Against the Competition
- What's Under the Hood
- How to Access Cohere Transcribe
- What Comes Next
How It Performs Against the Competition
Cohere Transcribe currently ranks first on the HuggingFace Open ASR Leaderboard, a standardized benchmark that measures word error rate (WER) — the percentage of transcribed words that don't match the original spoken content — across a variety of real-world audio datasets. Lower scores indicate higher accuracy.
| Model | Avg. WER | Multi-speaker (AMI) | Diverse accents (Voxpopuli) |
|---|---|---|---|
| Cohere Transcribe | 5.42% | 8.13% | 5.87% |
| Zoom Scribe v1 | 5.47% | 10.03% | 5.37% |
| IBM Granite 4.0 1B Speech | 5.52% | 8.44% | 5.84% |
| NVIDIA Canary Qwen 2.5B | 5.63% | 10.19% | 5.66% |
| ElevenLabs Scribe v2 | 5.83% | 11.86% | 6.80% |
| OpenAI Whisper Large v3 | 7.44% | 15.95% | 9.54% |
Source: HuggingFace Open ASR Leaderboard, as of March 26, 2026. AMI tests multi-speaker, boardroom-style audio; Voxpopuli tests diverse accents.
The margin over OpenAI's Whisper — long the default choice for developers building transcription into products — is notable. Cohere Transcribe's 5.42% average WER compares to Whisper's 7.44%, a roughly 27% relative improvement in accuracy.
Cohere also ran human evaluations alongside the automated benchmarks, in which trained reviewers compared transcripts across real-world audio for accuracy, coherence and usability. In English-language pairwise comparisons, Cohere Transcribe was preferred over Whisper Large v3 in 64% of comparisons, and over ElevenLabs Scribe v2 in 51% of comparisons. Results were closer in some non-English languages — German and Spanish preference scores hovered around 50%, while Japanese showed a strong advantage at 66–70% against tested rivals.
What's Under the Hood
The speed is exceptional — turning minutes of audio into usable transcripts in seconds — and it immediately unlocks new possibilities for real-time products and workflows. In our testing, the model handled everyday speech very well and delivered strong, reliable transcription quality."
- Paige Dickie
Vice President, Radical Ventures
Cohere's model uses a conformer-based encoder-decoder architecture: a large Conformer encoder extracts acoustic features from audio, while a lightweight Transformer decoder converts those features into text. The company says it was trained from scratch — not fine-tuned from an existing model — with a deliberate focus on minimizing word error rate under production conditions.
At 2 billion parameters, it is large enough to achieve state-of-the-art accuracy while remaining practical for real-world GPU deployment or local use. Supported languages include:
- European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
- Asia-Pacific: Chinese, Japanese, Korean, Vietnamese
- Middle East / North Africa: Arabic
How to Access Cohere Transcribe
Cohere Transcribe is available through three routes:
- Open source: Download directly from Hugging Face and run locally, including on edge hardware.
- API access: Free experimentation via Cohere's API dashboard, subject to rate limits. See the documentation for integration details.
- Model Vault: Production deployment through Cohere's managed private-cloud platform, with dedicated infrastructure, no rate limits and per-hour pricing.
What Comes Next
Cohere describes the launch as its "zero to one" in enterprise speech — a starting point rather than a finished product. The company says it is working toward deeper integration of Transcribe with North, its AI agent orchestration platform, with the goal of expanding from transcription into broader speech intelligence capabilities such as real-time customer support and speech analytics.
For a company that has largely competed on the strength of its large language models and retrieval tools, the move into speech signals a push to cover more of the enterprise AI stack — and a direct challenge to incumbents like OpenAI and ElevenLabs on a modality that is increasingly central to automated business workflows.