버그 보고 / 기능 요청

화자 감지 및 분리

오디오 및 비디오 전사에서 서로 다른 화자를 자동으로 식별하고 라벨링하세요. 누가 무엇을 말했는지 정확히 파악하세요.

공개적으로 사용 가능한 오디오 및 비디오와 함께 작동합니다. DRM 보호 콘텐츠는 지원되지 않습니다.

향상된 업그레이드

Private transcript

녹음본과 채팅

Pro로 잠금 해제 →

파일을 여기에 드롭하거나 클릭하여 찾아보십시오.

MP3, WAV, M4A, FLAC, MP4, MKV, MOV, WebM — 최대 2GB

여러 파일 일괄 업로드 프로와 함께

향상된 업그레이드

Private transcript

녹음본과 채팅

Pro로 잠금 해제 →

향상된 업그레이드

실시간 음성 텍스트로. AI가 말하는 동안 자동으로 수정합니다.

먼저 마이크 테스트

10 무료 분/일 가입 시 600분 무료 신용카드 필요 없음 암호화됨

무료로 가입하세요 →

화자 분리란?

화자 분리는 오디오 스트림을 화자의 정체성에 따라 세그먼트로 분할하는 과정입니다. 간단히 말해, '누가 언제 말했는가?'라는 질문에 답합니다. This is essential for multi-speaker recordings like meetings, interviews, podcasts, conference calls, and legal proceedings where knowing who said what is just as important as what was said.

STT.ai uses advanced neural speaker diarization models that can detect and label speakers in real time. The system creates speaker embeddings -- numerical representations of each voice's unique characteristics -- and clusters them to distinguish between different people. This works even when speakers have similar voices or frequently interrupt each other.

화자 감지 작동 방식

1. 음성 활동 감지

시스템이 먼저 오디오의 어느 세그먼트에 음성이 포함되어 있고, 어디가 침묵, 음악 또는 배경 소음인지 식별합니다.

2. 화자 임베딩

각 음성 세그먼트가 화자 임베딩으로 변환됩니다 — 화자의 고유한 음성 특성을 포착하는 컴팩트한 벡터입니다.

3. 클러스터링 및 라벨링

임베딩을 클러스터링하여 같은 화자의 세그먼트를 그룹화한 후, 각 클러스터에 라벨을 할당합니다(화자 1, 화자 2 등).

화자 감지 활용 사례

회의 전사

회의 녹음에서 각 참가자를 자동으로 라벨링하세요. 누가 무엇을 말했는지 명확하게 표시된 회의록을 생성하세요.

팟캐스트 전사

팟캐스트 에피소드에서 호스트와 게스트를 구분하세요. 정확한 화자 귀속이 포함된 쇼노트를 작성하세요.

인터뷰 전사

연구, 저널리즘, 채용 문서를 위해 인터뷰어와 피인터뷰어의 응답을 분리하세요.

법률 및 컴플라이언스

명확한 화자 식별이 포함된 증언, 청문회, 컴플라이언스 통화의 공식 기록을 작성하세요.

STT.ai의 화자 감지

Speaker detection is available on all paid plans. When you transcribe audio or video with speaker detection enabled, the transcript will include speaker labels inline with the text. You can also export speaker-labeled transcripts in all supported formats including SRT, VTT, DOCX, JSON, and PDF.

Speaker 1 [00:00:01]: Welcome to the meeting, everyone. Let's start with the quarterly review. Speaker 2 [00:00:05]: Thanks. I have the numbers ready. Revenue is up 23% quarter over quarter. Speaker 1 [00:00:12]: That's great news. Can you walk us through the breakdown?

The system can detect up to 20 distinct speakers in a single recording. For best results, ensure each speaker has at least a few seconds of solo speech. Overlapping speech is handled but may reduce accuracy in heavily cross-talked segments.

지금 화자 감지를 시도하세요

다중 화자 녹음을 업로드하고 화자가 자동으로 라벨링되는 것을 확인하세요.

무료 전사 시작

자주 묻는 질문

speaker detection runs in your browser: paste a URL, upload a file, or record from your mic. STT.ai picks the AI model and returns the transcript in under 5 minutes. Export as TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes — every visitor gets 600 free minutes/month on STT.ai, usable for speaker detection the same as any other workflow. Paid plans starting at $5/month unlock longer files, private transcripts, and priority queueing.

speaker detection runs on the same AI models as the rest of STT.ai — our best models reach 95-97% accuracy on clean speech (3-5% Word Error Rate on benchmarks). Switch models on the fly if the first pass is below your target.

speaker detection can run on any of STT.ai's 10+ models — STT.ai Enhanced (most accurate), Whisper Large V3 (99 languages), NVIDIA Canary (#1 WER on supported langs), Whisper Turbo (fast), Moonshine (lightweight), and more.

Yes. Every transcript exports as SRT or VTT — works with YouTube, Vimeo, TikTok, VLC, and every major video player. The burn-subtitles tool overlays them onto video as hardsubs.

Yes. Speaker diarization automatically labels each voice (Speaker 1, Speaker 2, ...) and you can rename them in the built-in editor. Works across all models and languages.

Most speaker detection jobs finish in under 5 minutes. A 1-hour audio file typically completes in 2-3 minutes with our fastest models. Speed depends on chosen model and current GPU load.

speaker detection accepts 20+ formats — MP3, WAV, M4A, FLAC, OGG, MP4, MKV, MOV, WebM, AVI, and more. Output to TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes. Audio files submitted to speaker detection are processed and deleted by default. Pro plans add client-side encryption — even if STT.ai's database is breached, your transcripts are unreadable without your key. Data is never used for model training without explicit opt-in.

Yes. STT.ai offers a REST API with Python and Node.js SDKs, plus an MCP server for Claude and Cursor — all usable for speaker detection workflows. Free API tier includes 100 minutes/month.

Yes. Every transcript opens in the built-in editor where you can correct words, rename speakers, adjust timestamps, and add notes. All changes save automatically.

Every transcript gets a unique shareable URL. Export to DOCX or PDF for email. Pro plans add password-protected and permanent links — useful for client work.

STT.ai handles 1,300+ platforms including YouTube, Vimeo, TikTok, SoundCloud, Zoom, Google Meet, podcast hosts, and more. URL transcription works with publicly-available content only — DRM-protected sources can't be transcribed.

화자 감지 및 분리

화자 분리란?

화자 감지 작동 방식

1. 음성 활동 감지

2. 화자 임베딩

3. 클러스터링 및 라벨링

화자 감지 활용 사례

STT.ai의 화자 감지

지금 화자 감지를 시도하세요

자주 묻는 질문

How does speaker detection work on STT.ai?

Is speaker detection free?

How accurate is speaker detection?

What AI models can I use for speaker detection?

Can I get subtitles from speaker detection?

Does speaker detection detect different speakers?

How long does speaker detection take?

What input formats does speaker detection support?

Is my audio private when I use speaker detection?

Is there a speaker detection API?

Can I edit a speaker detection transcript after?

How do I share what speaker detection produces?

What other platforms work beyond speaker detection?