VoiceGuard

Overview

VoiceGuard is a deepfake audio detection project I built to identify whether a speech recording is real or AI-generated. As voice cloning tools get better, synthetic speech is becoming harder to catch and easier to misuse. To improve detection, I used four different audio representations (spectrograms) instead of relying on just one, then combined predictions from five models in an ensemble. This setup reached 97% accuracy and 97% AUC.

Why Four Spectrograms?

Traditional deepfake detection systems analyze audio through a single spectrogram — capturing only a limited view of the signal. VoiceGuard uses four simultaneously, each revealing different patterns and artifacts that synthetic speech leaves behind.

Time →Freq →

MFCC

Spectral envelope (timbre) & vocal tract characteristics

Captures the overall shape of the speech spectrum related to the vocal tract. Useful for spotting unnatural timbral patterns or over-smoothed characteristics that can appear in synthetic or cloned voices.

System Architecture

Audio Input

Preprocess

Convert to 16kHz mono WAV

Segment

Split audio into fixed-length chunks

Generate Spectrograms

MFCC

Log-Mel

STFT

CQT

Stack 4-Channel Tensor

Combine into single input tensor

Ensemble

Custom (trained from scratch)

CNNBi-RNNTransformer

Fine-tuned pretrained

MobileNetConvNeXt

Soft Voting

Weighted probability averaging

Real / Fake

Deployment Architecture

Client

Uploads audio recording

API GatewayAWS

REST endpoint - routes audio to processing

Lambda #1AWS

Validate audio, convert to 16kHz, generate spectrograms & stack the tensor

MFCC

Mel

STFT

CQT

EC2 InstanceAWS

Load the 5 models and run ensemble inference

Lambda #2AWS

Parse and format the inference JSON response

S3 BucketAWS

Store audio, spectrograms & results

API Response

Returns prediction: Real or Fake with confidence score

Chosen for this project - predictable cost, full control, ~$0.07/hr on t3.medium

Beyond English: Arabic Deepfake Detection

Extending VoiceGuard to detect Arabic deepfakes - building a first-of-its-kind dataset.

1

Collect Arabic Audio

Gather non-copyright Arabic speech recordings from public sources

2

Record Volunteer Speech

Record clean speech from volunteers with Multiple Arabic dialects, starting from Saudi dialects.

3

Fine-tune TTS Models

Fine-tune 4 models: F1-TTS, XTTS, FishAudio, Qwen3

4

Generate Synthetic Audio

Generate Arabic deepfake audio, starting with Saudi dialects and expanding to other dialects

5

Train Arabic Detector

Train detection model on combined real + synthetic Arabic dataset

The dataset is currently being expanded to include dialects from 11 Arab countries.

Results

F1 Score

97%

AUC Score

97%

Inference Time

< 30s per file

Benchmark

Outperforms several peer-reviewed papers

Robustness

Works across voices & recording quality

Papers I Outperform

WaveFake: A Data Set to Facilitate Audio Deepfake Detection

arxiv.org ↗

Does Audio Deepfake Detection Generalize?

arxiv.org ↗

FoR: A Dataset for Synthetic Speech Detection

ieeexplore.ieee.org ↗

Research paper in progress.

Brewing ideas, coding intelligence

Overview

Why Four Spectrograms?

MFCC

System Architecture

Deployment Architecture

Beyond English: Arabic Deepfake Detection

Collect Arabic Audio

Record Volunteer Speech

Fine-tune TTS Models

Generate Synthetic Audio

Train Arabic Detector

Results

Papers I Outperform