Brewing ideas, coding intelligence

0%

VoiceGuard

Deepfake Audio Detection System

0%

Accuracy

0%

AUC

0

Models in Ensemble

PythonPyTorchAWSReactArabic NLP

Overview

VoiceGuard is a deepfake audio detection project I built to identify whether a speech recording is real or AI-generated. As voice cloning tools get better, synthetic speech is becoming harder to catch and easier to misuse. To improve detection, I used four different audio representations (spectrograms) instead of relying on just one, then combined predictions from five models in an ensemble. This setup reached 97% accuracy and 97% AUC.

Why Four Spectrograms?

Traditional deepfake detection systems analyze audio through a single spectrogram — capturing only a limited view of the signal. VoiceGuard uses four simultaneously, each revealing different patterns and artifacts that synthetic speech leaves behind.

Time →Freq →

MFCC

Spectral envelope (timbre) & vocal tract characteristics

Captures the overall shape of the speech spectrum related to the vocal tract. Useful for spotting unnatural timbral patterns or over-smoothed characteristics that can appear in synthetic or cloned voices.

System Architecture

Deployment Architecture

Chosen for this project - predictable cost, full control, ~$0.07/hr on t3.medium

Beyond English: Arabic Deepfake Detection

Extending VoiceGuard to detect Arabic deepfakes - building a first-of-its-kind dataset.

1

Collect Arabic Audio

Gather non-copyright Arabic speech recordings from public sources

2

Record Volunteer Speech

Record clean speech from volunteers with Multiple Arabic dialects, starting from Saudi dialects.

3

Fine-tune TTS Models

Fine-tune 4 models: F1-TTS, XTTS, FishAudio, Qwen3

4

Generate Synthetic Audio

Generate Arabic deepfake audio, starting with Saudi dialects and expanding to other dialects

5

Train Arabic Detector

Train detection model on combined real + synthetic Arabic dataset

The dataset is currently being expanded to include dialects from 11 Arab countries.

Results

F1 Score

97%

AUC Score

97%

Inference Time

< 30s per file

Benchmark

Outperforms several peer-reviewed papers

Robustness

Works across voices & recording quality

Research paper in progress.