Report Bug / Feature Request

Transcribe with SenseVoice

Name: SenseVoice
Author: FunAudioLLM

Works with publicly available audio & video. DRM-protected content is not supported.

Upgrade for Enhanced

Private transcript

Chat with transcript

Unlock with Pro →

Drop file here or click to browse

MP3, WAV, M4A, FLAC, MP4, MKV, MOV, WebM — up to 2GB

Batch upload multiple files with Pro

Upgrade for Enhanced

Private transcript

Chat with transcript

Unlock with Pro →

Upgrade for Enhanced

Real-time speech to text. AI auto-corrects as you speak — accuracy improves with longer speech.

Test your microphone first

10 free min/day 600 min free with signup No credit card Encrypted

5.5%

WER

Languages

50.0x

Speed

MIT

License

About SenseVoice

SenseVoice is a speech foundation model from FunAudioLLM that goes beyond transcription. It supports 50+ languages and includes capabilities for emotion recognition, audio event detection, and inverse text normalization in a single model.

Languages Supported by SenseVoice

English

Spanish

French

German

Chinese (Mandarin)

Japanese

Korean

Portuguese

Arabic

Hindi

Russian

Italian

Dutch

Turkish

Polish

Swedish

Indonesian

Thai

Vietnamese

Czech

Greek

Romanian

Hungarian

Hebrew

Danish

Finnish

Norwegian

Ukrainian

Malay

Bengali

Model Info

ProviderFunAudioLLM
Architecture-
LicenseMIT
UpdatedMar 2026

Related Models

3.2% WER

4.2% WER

5.1% WER

3.5% WER

7.8% WER

Frequently Asked Questions

SenseVoice is a speech-to-text model by FunAudioLLM. STT.ai hosts SenseVoice on our GPU infrastructure so you can use it without provisioning your own hardware — upload audio or video and pick SenseVoice from the model picker.

On standard benchmarks, SenseVoice achieves around 5.5% Word Error Rate. Real-world accuracy depends on audio quality, accent, and language; for noisy or accented recordings, expect a few percentage points higher WER.

SenseVoice runs on STT.ai's free tier — every visitor gets 600 minutes/month at no cost. Paid plans add longer per-file limits, private transcripts, and priority queueing.

SenseVoice is released under MIT, a permissive open-source license. You can self-host SenseVoice on your own hardware or use our hosted version — both are commercially usable.

SenseVoice supports 50 languages. Auto-detection picks the right language for most audio; you can also specify it manually for a small accuracy lift.

SenseVoice processes audio at about 50.0x real-time on our GPUs. A 1-hour audio file finishes in under 1 minutes; longer files queue and notify by email when done.

SenseVoice has 234M parameters. Larger models tend to be more accurate but slower; STT.ai hosts SenseVoice on GPU so the parameter count doesn't affect your client-side performance.

SenseVoice accepts every format STT.ai supports — MP3, WAV, M4A, FLAC, OGG, MP4, MKV, MOV, WebM, AVI, and others. Output as TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes. Speaker diarization runs alongside SenseVoice for every transcription — each speaker is labeled and you can rename them in the editor afterwards.

Yes. SenseVoice runs in our managed environment — audio is processed and deleted by default and never used for training without explicit opt-in. Pro plans add client-side encryption for transcripts at rest.

Use the compare-stt tool to run SenseVoice against any other supported model on the same audio — you'll see WER, segment count, speaker labels, and confidence scores side-by-side. The SenseVoice vs Whisper Large V3 comparison is the most commonly run.

Yes. Specify "sensevoice" as the model parameter on the /v1/transcribe endpoint. Python and Node.js SDKs include SenseVoice examples. Free API tier includes 100 minutes/month.

Yes. Because SenseVoice is MIT-licensed, you can self-host it. STT.ai's open-source page lists the project repo and weights. Most production teams use our hosted version to skip GPU procurement, model swaps, and ops.

Transcribe with SenseVoice

About SenseVoice

Languages Supported by SenseVoice

Model Info

Related Models

Frequently Asked Questions

What is SenseVoice?

How accurate is SenseVoice?

Is SenseVoice free to use?

What license does SenseVoice use?

How many languages does SenseVoice support?

How fast is SenseVoice?

How big is the SenseVoice model?

What audio formats can SenseVoice transcribe?

Does SenseVoice detect multiple speakers?

Is my data private when using SenseVoice?

How does SenseVoice compare to other STT models?

Can I use SenseVoice via the API?

Can I run SenseVoice on my own server?