Laporkan Permintaan Kutu / Fitur

Speaker Detection & Diarization

Automatically identify and label different speakers in your audio and video transcriptions. Know exactly who said what.

Bekerja dengan audio & video yang tersedia di publik. Isi yang dilindungi DRM tidak didukung.

Tingkatkan untuk Diperbarui

Transkrip pribadi

Percakapan dengan transkrip

Buka Kunci dengan Pro →

Jatuhkan berkas di sini atau klik untuk diramban

MP3, WAV, M4A, FLAC, MP4, MKV, MOV, WebM sembari 2GB

Batch mengunggah beberapa berkas dengan Pro

Tingkatkan untuk Diperbarui

Transkrip pribadi

Percakapan dengan transkrip

Buka Kunci dengan Pro →

Tingkatkan untuk Diperbarui

Pidato real-time dengan teks. AI auto-treksis saat Anda berbicara dengan akurasi meningkatkan dengan pidato yang lebih panjang.

Uji mikrofon Anda terlebih dahulu

10 menit/hari bebas 600 menit gratis dengan signup Tidak ada kartu kredit Terenkripsi

Bebas mendaftar →

What is Speaker Diarization?

Speaker diarization is the process of partitioning an audio stream into segments according to the identity of the speaker. In simpler terms, it answers the question "who spoke when?" This is essential for multi-speaker recordings like meetings, interviews, podcasts, conference calls, and legal proceedings where knowing who said what is just as important as what was said.

STT.ai uses advanced neural speaker diarization models that can detect and label speakers in real time. The system creates speaker embeddings -- numerical representations of each voice's unique characteristics -- and clusters them to distinguish between different people. This works even when speakers have similar voices or frequently interrupt each other.

How Speaker Detection Works

1. Voice Activity Detection

The system first identifies which segments of audio contain speech versus silence, music, or background noise.

2. Speaker Embedding

Each speech segment is converted into a speaker embedding -- a compact vector that captures the unique vocal characteristics of the speaker.

3. Clustering & Labeling

Embeddings are clustered to group segments from the same speaker together, then each cluster is assigned a label (Speaker 1, Speaker 2, etc.).

Use Cases for Speaker Detection

Meeting Transcription

Automatically label each participant in meeting recordings. Generate minutes with clear attribution of who said what.

Podcast Transcription

Distinguish between host and guests in podcast episodes. Create show notes with proper speaker attribution.

Interview Transcription

Separate interviewer and interviewee responses for research, journalism, and hiring documentation.

Legal & Compliance

Create official records of depositions, hearings, and compliance calls with clear speaker identification.

Speaker Detection on STT.ai

Speaker detection is available on all paid plans. When you transcribe audio or video with speaker detection enabled, the transcript will include speaker labels inline with the text. You can also export speaker-labeled transcripts in all supported formats including SRT, VTT, DOCX, JSON, and PDF.

Speaker 1 [00:00:01]: Welcome to the meeting, everyone. Let's start with the quarterly review. Speaker 2 [00:00:05]: Thanks. I have the numbers ready. Revenue is up 23% quarter over quarter. Speaker 1 [00:00:12]: That's great news. Can you walk us through the breakdown?

The system can detect up to 20 distinct speakers in a single recording. For best results, ensure each speaker has at least a few seconds of solo speech. Overlapping speech is handled but may reduce accuracy in heavily cross-talked segments.

Try speaker detection now

Upload a multi-speaker recording and see speakers automatically labeled.

Start Transcribing Free

Pertanyaan yang Sering Diajukan

deteksi speaker runs in your browser: paste a URL, upload a file, or record from your mic. STT.ai picks the AI model and returns the transcript in under 5 minutes. Export as TXT, SRT, VTT, DOCX, JSON, or PDF.

Yes — every visitor gets 600 free minutes/month on STT.ai, usable for deteksi speaker the same as any other workflow. Paid plans starting at $5/month unlock longer files, private transcripts, and priority queueing.

deteksi speaker runs on the same AI models as the rest of STT.ai — our best models reach 95-97% accuracy on clean speech (3-5% Word Error Rate on benchmarks). Switch models on the fly if the first pass is below your target.

deteksi speaker can run on any of STT.ai's 10+ models — STT.ai Enhanced (most accurate), Whisper Large V3 (99 languages), NVIDIA Canary (#1 WER on supported langs), Whisper Turbo (fast), Moonshine (lightweight), and more.

Setiap transkrip ekspor sebagai SRT atau VTT в bekerja dengan YouTube, Vimeo, TikTok, VLC, dan setiap pemutar video utama. alat subtitles terbakar overlays mereka ke video sebagai hardsubs.

Diaraisasi pembicara secara otomatis menandai setiap suara (Speaker 1, Speaker 2,...) dan Anda dapat mengubah nama mereka dalam editor bawaan.

Kebanyakan deteksi speaker pekerjaan selesai dalam waktu kurang dari 5 menit. Berkas audio 1 jam biasanya selesai dalam 2-3 menit dengan model tercepat kita. Kecepatan tergantung pada model pilihan dan muatan GPU saat ini.

deteksi speaker menerima 20+ format ▪ MP3, WAV, M4A, FLAC, OGG, MP4, MKV, MOV, WebM, AVI, dan lebih. Keluaran ke TXT, SRT, VTT, DOCX, JSON, atau PDF.

Yes. Audio files submitted to deteksi speaker are processed and deleted by default. Pro plans add client-side encryption — even if STT.ai's database is breached, your transcripts are unreadable without your key. Data is never used for model training without explicit opt-in.

Yes. STT.ai offers a REST API with Python and Node.js SDKs, plus an MCP server for Claude and Cursor — all usable for deteksi speaker workflows. Free API tier includes 100 minutes/month.

Ya. setiap transkrip dibuka di dalam editor yang dibangun dimana anda dapat memperbaiki kata, mengubah nama speaker, menyesuaikan penanda waktu, dan menambahkan catatan. Semua perubahan disimpan secara otomatis.

Setiap transkrip mendapat URL yang unik untuk dibagi. Ekspor ke DOCX atau PDF untuk email. Rencana pro menambahkan password-proteksi dan sambungan permanen å berguna untuk klien kerja.

STT.ai handles 1,300+ platforms including YouTube, Vimeo, TikTok, SoundCloud, Zoom, Google Meet, podcast hosts, and more. URL transcription works with publicly-available content only — DRM-protected sources can't be transcribed.

Speaker Detection & Diarization

What is Speaker Diarization?

How Speaker Detection Works

1. Voice Activity Detection

2. Speaker Embedding

3. Clustering & Labeling

Use Cases for Speaker Detection

Speaker Detection on STT.ai

Try speaker detection now

Pertanyaan yang Sering Diajukan

How does deteksi speaker work on STT.ai?

Apakah deteksi speaker bebas?

Seberapa akuratkah deteksi speaker?

Model AI apa yang bisa saya gunakan untuk deteksi speaker?

Bisakah saya mendapatkan subtitel dari deteksi speaker?

Apakah deteksi speaker mendeteksi speaker yang berbeda?

Berapa lama waktu yang dibutuhkan deteksi speaker?

Format masukan apa yang didukung deteksi speaker?

Apakah audio saya pribadi ketika saya menggunakan deteksi speaker?

Apakah ada deteksi speaker API?

Dapatkah saya mengedit deteksi speaker transkrip setelah?

Bagaimana cara berbagi apa yang dihasilkan deteksi speaker?

Peron apa yang bekerja di luar deteksi speaker?