说话人检测与分离
自动识别和标记音频和视频转录中的不同说话人。准确知道谁说了什么。
什么是说话人分离?
说话人分离是将音频流按照说话人身份分割为不同片段的过程。简单来说,它回答了“谁在什么时候说了话?”这个问题。 This is essential for multi-speaker recordings like meetings, interviews, podcasts, conference calls, and legal proceedings where knowing who said what is just as important as what was said.
STT.ai uses advanced neural speaker diarization models that can detect and label speakers in real time. The system creates speaker embeddings -- numerical representations of each voice's unique characteristics -- and clusters them to distinguish between different people. This works even when speakers have similar voices or frequently interrupt each other.
说话人检测的工作原理
1. 语音活动检测
系统首先识别音频中哪些片段包含语音,区分静音、音乐或背景噪音。
2. 说话人嵌入
每个语音片段被转换为说话人嵌入——一个捕捉说话人独特声音特征的紧凑向量。
3. 聚类与标记
对嵌入进行聚类,将同一说话人的片段分组,然后为每个聚类分配标签(说话人1、说话人2等)。
说话人检测的使用场景
STT.ai上的说话人检测
Speaker detection is available on all paid plans. When you transcribe audio or video with speaker detection enabled, the transcript will include speaker labels inline with the text. You can also export speaker-labeled transcripts in all supported formats including SRT, VTT, DOCX, JSON, and PDF.
The system can detect up to 20 distinct speakers in a single recording. For best results, ensure each speaker has at least a few seconds of solo speech. Overlapping speech is handled but may reduce accuracy in heavily cross-talked segments.