Brewing ideas, coding intelligence

0%
🥈2nd Place

DAL Sheep Classification 2025

2nd Place - Kaggle Competition

0%

Public LB

0%

Private LB

0

Model Ensemble

PyTorchComputer VisionTransfer LearningEnsemble

Challenge Overview

A Kaggle AI competition where the goal was to classify sheep breeds from images using computer vision. The dataset was relatively small, which made overfitting a real concern, and my early models kept getting stuck around 85% accuracy. The biggest part of the work was figuring out which model architectures, training strategies, and augmentations actually helped move past that plateau.

Goal

Classify sheep breeds from images

Evaluation

Accuracy on public and private leaderboard

Dataset

Small dataset requiring careful regularization

Key Challenge

Breaking through a stubborn ~85% accuracy plateau

Model Journey

Click to see details. Each step below shows what I changed and how it affected performance.

83%

ConvNeXt-Large

Initial Attempt

Started with ConvNeXt-Large as the backbone. Got 83% on the public leaderboard, which felt disappointing at first. Funny thing though, this initial model scored 94% on the private leaderboard. I started better than I thought :)

83% → 92%

CoAtNet

Architecture Switch + Bug Fix

Switched to CoAtNet which combines convolution and attention mechanisms. Initially got the same 83%, but after finding and fixing training bugs the accuracy jumped to 92%.

Escaped plateau

+ CosineAnnealing

Scheduler Breakthrough

Added CosineAnnealingWarmRestarts as the learning rate scheduler. This was the key breakthrough, the warm restarts helped the model escape local minima that were keeping it stuck.

92% → 95%

+ Gradual Unfreezing

Fine-tuning Strategy

Implemented gradual unfreezing: epochs 0–10 trained only the classification head with the backbone frozen, then epoch 11+ unfroze everything. This careful approach yielded a +3% accuracy boost.

97% / 98%

Full Ensemble

Final Ensemble

Combined 4 diverse models (CoAtNet-3, EVA02-Base, ViT-Base, ConvNeXt-Small) with weighted soft voting. Each model captures different features. The diversity across the models helped push the score to 97% on the public leaderboard and 98% on the private leaderboard.

Final Ensemble Architecture

35%

CoAtNet-3

coatnet_3_rw_224.sw_in12k

Best individual cross-validation score. Combines convolution for local features with self-attention for global context.

30%

EVA02-Base

eva02_base_patch14_224.mim_in22k

Excellent at capturing fine-grained texture differences between breeds. Pre-trained on ImageNet-22k with masked image modeling.

20%

ViT-Base

vit_base_patch16_224.augreg_in21k

Pure attention architecture provides a different inductive bias. Captures global spatial relationships that CNNs can miss.

15%

ConvNeXt-Small

convnext_small.fb_in22k

Smaller model with good regularization properties. Adds diversity to the ensemble without overfitting to training patterns.

Weighted Soft Voting

Weighted average of softmax outputs

Final Prediction

Training Strategy

Data Augmentation

MixUp (α=0.2) + CutMix (α=0.8), randomly applied with 50% probability

Rotation, Color Jitter, Random Erasing, TTA (8 samples)

Gradual Unfreezing

Epochs 0–10: frozen backbone, train head only

Epochs 11+: unfreeze everything at once

+3% accuracy boost

With vs Without Gradual Unfreezing
Epoch 0Head only
Standard
Input Image
Conv 1–3
Conv 4–6
Conv 7–9
Head
Prediction
Gradual
Input Image
Conv 1–3
Conv 4–6
Conv 7–9
Head
Prediction
Validation Accuracy10%50%84%Epochunfreeze
Standard10.0%Gradual10.0%

Data Augmentation

MixUp (α=0.2) + CutMix (α=0.8), randomly applied with 50% probability

Rotation, Color Jitter, Random Erasing, TTA (8 samples)

CosineAnnealingWarmRestarts

Learning rate scheduler that periodically restarts LR to escape local minima. This was the key breakthrough that pushed accuracy from 83% to 92%.

Key breakthrough: 83% → 92%

Learning Rate Schedule
1e-63e-45e-48e-40.00101020304050EpochLR
Warm restartT₀=5, T_mult=2, η_max=1e-3

Training Config

Batch Size16
Epochs20
Base LR1e-5
Cross Validation5-Fold Stratified
Loss60% CE + 40% Focal
GPURTX 6000 Ada
Image Size224×224 (384×384 ConvNeXt)

What Didn't Work

A big part of the competition was testing ideas quickly and dropping the ones that didn’t help.

Pseudo-labeling

Tried using model predictions on unlabeled data as pseudo-labels for additional training. No measurable improvement. So, I removed it.

Object detection first

Tried isolating sheep with an object detection model before classification. The gains looked limited compared to the extra complexity and time.