OpenAI Whisper changed audio transcription. It produces near-human accuracy on clean audio in 99 languages, handles accents and crosstalk gracefully, and costs roughly $0.006 per minute. Generic STT models that charge $1+ per minute are now nearly obsolete.
But "use Whisper" isn't a tool — it's a model. There are at least a dozen ways to actually get a transcript out of it, and the right one depends on your volume, latency needs, and how much engineering time you want to spend.
TL;DR by use case
- Single transcript, one-off: paste into MacWhisper or use OpenAI's web playground. 2-minute job.
- Weekly podcast workflow: a tool with built-in Whisper like ClipForge, Riverside, or Descript handles transcription + downstream work in one pass.
- Building a product: OpenAI Whisper API ($0.006/min) is the right starting point. Move to self-hosted only if you cross ~$500/mo or have privacy constraints.
- Real-time captions: skip Whisper. Use Deepgram or AssemblyAI streaming. Whisper's strength is batch quality, not latency.
- Privacy-critical (medical, legal): self-hosted Whisper Large-v3 on a single H100 box. Open-source models match the API for English; lag slightly on rare languages.
1. The OpenAI Whisper API — the default
If you're building anything, start here. The API costs $0.006/minute, has zero infrastructure cost, and ships with model upgrades you don't manage. ClipForge uses this directly for podcast and audio ingestion — total transcription cost per user per month is typically under $0.50 even at heavy usage.
Limits to know: 25MB file size cap (chunk longer files), no live streaming, and you're handing audio to OpenAI (matters for some regulated industries).
2. MacWhisper — best desktop app
MacWhisper is a $59 one-time Mac app that runs Whisper locally. No API costs, no cloud, transcripts never leave your machine. Quality is excellent (uses Whisper Large-v3 by default), and on Apple Silicon it's faster than the API.
Best for: lawyers, doctors, journalists, and anyone doing one-off transcription where privacy beats throughput.
3. ClipForge / Descript / Riverside — Whisper inside a workflow
These tools wrap Whisper inside a larger product. You upload audio, you get back not just a transcript but the downstream artifacts: clips, captions, repurposed content, editable transcripts. The transcription itself is invisible — you're paying for the workflow.
Pick by what comes after: ClipForge if you're repurposing the audio into other content, Descript if you want a full audio/video editor, Riverside if you're recording the podcast in the same tool.
4. Self-hosted Whisper — the cost-control option
If you're processing more than ~80 hours of audio per month, self-hosting Whisper on a GPU box can become cheaper than the API. The open-source weights (small, base, medium, large-v3) are available on Hugging Face. For English, Large-v3 matches the API; on rare languages it lags by a few percent WER (word error rate).
Engineering cost is real: GPU provisioning, queue management, retries, monitoring. Skip this until your API bill genuinely justifies it.
5. Deepgram / AssemblyAI — when you actually need real-time
Whisper is a batch model. If your product needs live captioning, courtroom-style transcription, or low-latency voice agents, Deepgram and AssemblyAI ship streaming APIs that produce results in under 300ms. They cost more than Whisper but solve a fundamentally different problem.
Common mistakes
- Using the cheap STT models (Google Speech-to-Text default, generic open-source models) for podcast transcription. Word error rates of 8–15% destroy any downstream AI generation that builds on the transcript.
- Self-hosting Whisper before you have meaningful volume. Your time is worth more than the API savings.
- Forgetting to chunk audio over 25MB. The OpenAI API will reject the request — use ffmpeg to split into 10-minute chunks before upload.
- Skipping speaker diarization. Whisper doesn't separate speakers natively. Pair it with pyannote-audio (open-source) or a tool that builds diarization on top.
Whisper-powered repurposing in one upload
Drop a podcast or video — ClipForge transcribes with Whisper and generates a full content pack in 10 seconds.
Try it free

