Report Bug / Feature Request

Transcribe with Vosk

Name: Vosk
Author: Alpha Cephei

Works with publicly available audio & video. DRM-protected content is not supported.

Upgrade for Enhanced

Private transcript

Chat with transcript

Unlock with Pro →

Drop file here or click to browse

MP3, WAV, M4A, FLAC, MP4, MKV, MOV, WebM — up to 2GB

Batch upload multiple files with Pro

Upgrade for Enhanced

Private transcript

Chat with transcript

Unlock with Pro →

Upgrade for Enhanced

Real-time speech to text. AI auto-corrects as you speak — accuracy improves with longer speech.

Test your microphone first

10 free min/day 600 min free with signup No credit card Encrypted

12.0%

WER

Languages

100.0x

Speed

Apache 2.0

License

About Vosk

Vosk is an offline speech recognition toolkit that works without an internet connection. It supports 20+ languages with compact models that can run on mobile devices, Raspberry Pi, and any platform. Built on Kaldi and Zipformer architectures.

Languages Supported by Vosk

English

Spanish

French

German

Chinese (Mandarin)

Japanese

Korean

Portuguese

Arabic

Hindi

Russian

Italian

Dutch

Turkish

Polish

Swedish

Indonesian

Vietnamese

Czech

Greek

Model Info

ProviderAlpha Cephei
Architecture-
LicenseApache 2.0
UpdatedMar 2026

Related Models

3.2% WER

4.2% WER

5.1% WER

3.5% WER

7.8% WER

Frequently Asked Questions

Vosk is a speech-to-text model by Alpha Cephei. STT.ai hosts Vosk on our GPU infrastructure so you can use it without provisioning your own hardware — upload audio or video and pick Vosk from the model picker.

On standard benchmarks, Vosk achieves around 12.0% Word Error Rate. Real-world accuracy depends on audio quality, accent, and language; for noisy or accented recordings, expect a few percentage points higher WER.

Vosk runs on STT.ai's free tier — every visitor gets 600 minutes/month at no cost. Paid plans add longer per-file limits, private transcripts, and priority queueing.

Vosk is released under Apache 2.0, a permissive open-source license. You can self-host Vosk on your own hardware or use our hosted version — both are commercially usable.

Vosk supports 20 languages. Auto-detection picks the right language for most audio; you can also specify it manually for a small accuracy lift.

Vosk processes audio at about 100.0x real-time on our GPUs. A 1-hour audio file finishes in under 1 minutes; longer files queue and notify by email when done.

Vosk has 50M parameters. Larger models tend to be more accurate but slower; STT.ai hosts Vosk on GPU so the parameter count doesn't affect your client-side performance.

Vosk accepts every format STT.ai supports — MP3, WAV, M4A, FLAC, OGG, MP4, MKV, MOV, WebM, AVI, and others. Output as TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes. Speaker diarization runs alongside Vosk for every transcription — each speaker is labeled and you can rename them in the editor afterwards.

Yes. Vosk runs in our managed environment — audio is processed and deleted by default and never used for training without explicit opt-in. Pro plans add client-side encryption for transcripts at rest.

Use the compare-stt tool to run Vosk against any other supported model on the same audio — you'll see WER, segment count, speaker labels, and confidence scores side-by-side. The Vosk vs Whisper Large V3 comparison is the most commonly run.

Yes. Specify "vosk" as the model parameter on the /v1/transcribe endpoint. Python and Node.js SDKs include Vosk examples. Free API tier includes 100 minutes/month.

Yes. Because Vosk is Apache 2.0-licensed, you can self-host it. STT.ai's open-source page lists the project repo and weights. Most production teams use our hosted version to skip GPU procurement, model swaps, and ops.

Transcribe with Vosk

About Vosk

Languages Supported by Vosk

Model Info

Related Models

Frequently Asked Questions

What is Vosk?

How accurate is Vosk?

Is Vosk free to use?

What license does Vosk use?

How many languages does Vosk support?

How fast is Vosk?

How big is the Vosk model?

What audio formats can Vosk transcribe?

Does Vosk detect multiple speakers?

Is my data private when using Vosk?

How does Vosk compare to other STT models?

Can I use Vosk via the API?

Can I run Vosk on my own server?