Arabic TTS Dataset Creator
Automated pipeline that downloads YouTube videos, transcribes Arabic speech, and produces aligned audio segments for training TTS models. Built for the VoiceGuard Arabic extension.
Pipeline
9-step automated pipeline from raw YouTube audio to TTS-ready dataset.
YouTube Download
Tool: yt-dlpDownloads audio from a list of YouTube URLs. Supports batch processing across multiple videos.
Audio Preprocessing
Tool: ffmpeg + librosaConverts audio to 16kHz mono WAV format, normalizes volume for consistent input quality.
Speaker Diarization
Tool: pyannote/speaker-diarization-3.1 (HuggingFace)Identifies different speakers in the audio so single-speaker segments can be isolated for TTS training.
Transcription & Word-Level Alignment
Tool: WhisperX large-v2 (language: Arabic)Transcribes Arabic speech and produces precise word-level timestamps. Runs on GPU when available, falls back to CPU. Supports multi-GPU parallelism for large batches.
Voice Activity Detection
Tool: WebRTC VAD (aggressiveness: 2)Filters out silence, background noise, and music segments. Configurable music_threshold (default: 0.5) controls how aggressively non-speech is removed.
Segment Splitting & Overlap Detection
Tool: Custom logicSplits long audio into TTS-sized chunks. Handles oversized segments with force-split, detects overlapping speech from multiple speakers, and tracks stats for each category.
Quality Control — WER Scoring
Tool: WhisperX re-transcription + jiwerEach segment is re-transcribed and compared to the original transcript using Word Error Rate. Segments above wer_threshold (default: 0.3) are moved to manual_review/ instead of being discarded.
Arabic Text Normalization
Tool: Custom normalizationConverts numerals to Arabic words using a full dictionary, removes punctuation, and normalizes Arabic script for clean TTS labels.
Dataset Export
Output: CSV + WAV segmentsOutputs aligned (audio segment, transcript) pairs ready for TTS model fine-tuning, organized across 5 directories.
Output Structure
arabic_dataset/