说话人检测与分离

自动识别和标记音频和视频转录中的不同说话人。准确知道谁说了什么。

使用公开的音频和视频工作。 DRM 保护的内容不支持 。

增强的升级
Private transcript
与笔录聊天
以 Pro 解锁 →
在此拖放文件或单击以浏览文件
MP3、WAV、M4A、FLAC、MP4、MKV、MOV、WebM-至多2GB
增强的升级
Private transcript
与笔录聊天
以 Pro 解锁 →
增强的升级
录音: 0:00
实时 伏( 即时)
增强 耳语( 准确)
公共链接:24小时,仅文本 · 签名签名 7d+音频 · 职业 用于私人链接的私人链接

文本的实时演讲。 AI 自动校正, 使用较长的演讲, 准确性会提高 。

先测试一下麦克风
❤️ 爱你的STT. AI 告诉你的朋友!
你用的是免费的抄本

免费报名每月获得600分钟,或升级无限制的抄本。

每天10分钟免费 600分钟免费,有注册 无信用卡 已加密
免费签名 →

什么是说话人分离?

说话人分离是将音频流按照说话人身份分割为不同片段的过程。简单来说,它回答了“谁在什么时候说了话?”这个问题。 This is essential for multi-speaker recordings like meetings, interviews, podcasts, conference calls, and legal proceedings where knowing who said what is just as important as what was said.

STT.ai uses advanced neural speaker diarization models that can detect and label speakers in real time. The system creates speaker embeddings -- numerical representations of each voice's unique characteristics -- and clusters them to distinguish between different people. This works even when speakers have similar voices or frequently interrupt each other.

说话人检测的工作原理

1. 语音活动检测

系统首先识别音频中哪些片段包含语音,区分静音、音乐或背景噪音。

2. 说话人嵌入

每个语音片段被转换为说话人嵌入——一个捕捉说话人独特声音特征的紧凑向量。

3. 聚类与标记

对嵌入进行聚类,将同一说话人的片段分组,然后为每个聚类分配标签(说话人1、说话人2等)。

说话人检测的使用场景

会议转录
自动标记会议录音中的每位参与者。生成清晰标注谁说了什么的会议纪要。
播客转录
区分播客节目中的主持人和嘉宾。创建带有正确说话人标注的节目笔记。
访谈转录
分离采访者和受访者的回答,用于研究、新闻和招聘文档。
法律与合规
创建带有清晰说话人标识的庭审记录、听证会和合规通话的官方记录。

STT.ai上的说话人检测

Speaker detection is available on all paid plans. When you transcribe audio or video with speaker detection enabled, the transcript will include speaker labels inline with the text. You can also export speaker-labeled transcripts in all supported formats including SRT, VTT, DOCX, JSON, and PDF.

Speaker 1 [00:00:01]: Welcome to the meeting, everyone. Let's start with the quarterly review. Speaker 2 [00:00:05]: Thanks. I have the numbers ready. Revenue is up 23% quarter over quarter. Speaker 1 [00:00:12]: That's great news. Can you walk us through the breakdown?

The system can detect up to 20 distinct speakers in a single recording. For best results, ensure each speaker has at least a few seconds of solo speech. Overlapping speech is handled but may reduce accuracy in heavily cross-talked segments.

立即体验说话人检测

上传多人录音,自动标记说话人。

免费开始转录

常见问题

speaker detection runs in your browser: paste a URL, upload a file, or record from your mic. STT.ai picks the AI model and returns the transcript in under 5 minutes. Export as TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes — every visitor gets 600 free minutes/month on STT.ai, usable for speaker detection the same as any other workflow. Paid plans starting at $5/month unlock longer files, private transcripts, and priority queueing.

speaker detection runs on the same AI models as the rest of STT.ai — our best models reach 95-97% accuracy on clean speech (3-5% Word Error Rate on benchmarks). Switch models on the fly if the first pass is below your target.

speaker detection can run on any of STT.ai's 10+ models — STT.ai Enhanced (most accurate), Whisper Large V3 (99 languages), NVIDIA Canary (#1 WER on supported langs), Whisper Turbo (fast), Moonshine (lightweight), and more.

Yes. Every transcript exports as SRT or VTT — works with YouTube, Vimeo, TikTok, VLC, and every major video player. The burn-subtitles tool overlays them onto video as hardsubs.

Yes. Speaker diarization automatically labels each voice (Speaker 1, Speaker 2, ...) and you can rename them in the built-in editor. Works across all models and languages.

Most speaker detection jobs finish in under 5 minutes. A 1-hour audio file typically completes in 2-3 minutes with our fastest models. Speed depends on chosen model and current GPU load.

speaker detection accepts 20+ formats — MP3, WAV, M4A, FLAC, OGG, MP4, MKV, MOV, WebM, AVI, and more. Output to TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes. Audio files submitted to speaker detection are processed and deleted by default. Pro plans add client-side encryption — even if STT.ai's database is breached, your transcripts are unreadable without your key. Data is never used for model training without explicit opt-in.

Yes. STT.ai offers a REST API with Python and Node.js SDKs, plus an MCP server for Claude and Cursor — all usable for speaker detection workflows. Free API tier includes 100 minutes/month.

Yes. Every transcript opens in the built-in editor where you can correct words, rename speakers, adjust timestamps, and add notes. All changes save automatically.

Every transcript gets a unique shareable URL. Export to DOCX or PDF for email. Pro plans add password-protected and permanent links — useful for client work.

STT.ai handles 1,300+ platforms including YouTube, Vimeo, TikTok, SoundCloud, Zoom, Google Meet, podcast hosts, and more. URL transcription works with publicly-available content only — DRM-protected sources can't be transcribed.