AI / Open Source

Arabic TTS Dataset Creator

Automated pipeline that downloads YouTube videos, transcribes Arabic speech, and produces aligned audio segments for training TTS models. Built for the VoiceGuard Arabic extension.

PythonWhisperXPyAnnoteArabic NLPTTS

View on GitHub

Pipeline

9-step automated pipeline from raw YouTube audio to TTS-ready dataset.

YouTube Download

Tool: yt-dlp

Downloads audio from a list of YouTube URLs. Supports batch processing across multiple videos.

Audio Preprocessing

Tool: ffmpeg + librosa

Converts audio to 16kHz mono WAV format, normalizes volume for consistent input quality.

Speaker Diarization

Tool: pyannote/speaker-diarization-3.1 (HuggingFace)

Identifies different speakers in the audio so single-speaker segments can be isolated for TTS training.

Transcription & Word-Level Alignment

Tool: WhisperX large-v2 (language: Arabic)

Transcribes Arabic speech and produces precise word-level timestamps. Runs on GPU when available, falls back to CPU. Supports multi-GPU parallelism for large batches.

Voice Activity Detection

Tool: WebRTC VAD (aggressiveness: 2)

Filters out silence, background noise, and music segments. Configurable music_threshold (default: 0.5) controls how aggressively non-speech is removed.

Segment Splitting & Overlap Detection

Tool: Custom logic

Splits long audio into TTS-sized chunks. Handles oversized segments with force-split, detects overlapping speech from multiple speakers, and tracks stats for each category.

Quality Control — WER Scoring

Tool: WhisperX re-transcription + jiwer

Each segment is re-transcribed and compared to the original transcript using Word Error Rate. Segments above wer_threshold (default: 0.3) are moved to manual_review/ instead of being discarded.

Arabic Text Normalization

Tool: Custom normalization

Converts numerals to Arabic words using a full dictionary, removes punctuation, and normalizes Arabic script for clean TTS labels.

Dataset Export

Output: CSV + WAV segments

Outputs aligned (audio segment, transcript) pairs ready for TTS model fine-tuning, organized across 5 directories.

Output Structure

arabic_dataset/

├── raw_audio/downloaded YouTube audio

├── diarization/pyannote speaker diarization results

├── alignment/WhisperX word-level alignments

├── segments/final clean audio + transcript pairs

├── manual_review/segments that failed WER threshold (for human review)

Brewing ideas, coding intelligence