Scaling ML inference for real-time recommendations

Real-time recommendation systems are deceptively complex. What appears as an instant product suggestion to a user involves orchestrating multiple models, data pipelines, and caching layers—all while maintaining sub-100ms response times.

When a major e-commerce client approached us with deteriorating recommendation performance, we discovered their system was buckling under its own sophistication.

The Challenge

The existing system was processing 50,000 requests per second across 2M+ products, with each recommendation requiring:

User embedding lookup (20ms)
Item similarity computation (45ms)
Ranking model inference (30ms)
Business rule filtering (15ms)

Total latency: 110ms average, with p99 exceeding 300ms.

Infrastructure costs had grown to $180k/month, with 80% spent on GPU instances running barely-optimized models.

The Solution

Rather than throwing more hardware at the problem, we implemented a three-tier optimization strategy:

1. Intelligent Caching Architecture

We introduced a hierarchical caching system that exploited recommendation patterns:

# User behavior clustering for cache warming
user_clusters = cluster_users_by_behavior(historical_interactions)
cache_warm_popular_items_by_cluster(user_clusters)

# Real-time cache with smart eviction
@cache_with_ttl(ttl_seconds=300)
def get_recommendations(user_id, context):
    return generate_recommendations(user_id, context)

This reduced cold starts by 85% and cut average response time to 65ms.

2. Model Distillation Pipeline

We compressed the ensemble of 5 heavyweight models into a single distilled model:

Teacher ensemble: 5 models, 2.3GB total
Distilled student: Single model, 140MB
Performance retention: 97.2% of original accuracy
Inference speedup: 12x faster

The distillation process preserved the ensemble's decision boundaries while dramatically reducing computational overhead.

3. Adaptive Batching

Instead of processing requests individually, we implemented dynamic batching that adapts to traffic patterns:

def adaptive_batch_processor():
    batch_size = calculate_optimal_batch_size(current_qps)
    batch = collect_requests(max_size=batch_size, max_wait_ms=10)
    return process_batch_inference(batch)

This improved GPU utilization from 23% to 78% while maintaining latency requirements.

Results

After 6 weeks of implementation and optimization:

Latency: 110ms → 22ms (80% reduction)
Infrastructure costs: $180k/month → $72k/month (60% reduction)
Throughput: 50k RPS → 125k RPS with same hardware
Model accuracy: Maintained 97.2% of original performance

Key Insights

Profile before optimizing: Most "performance problems" are actually architectural problems
Cache intelligence matters: Generic caching helps; behavioral caching transforms
Model complexity ≠ Model quality: Often simpler models perform just as well in production
Measure what matters: Latency percentiles tell a different story than averages

The real lesson? Sometimes the best way to scale is to step back and question whether you need to scale at all.

Need help optimizing your ML infrastructure? We've helped dozens of companies achieve similar results. Get in touch.