INDUSTRY / RECOMMENDATION_SYSTEMS
CTROptimization@ Scale
TorchRec + FBGEMM + DLRM
Industrial-scale recommendation for 10^11 user-item pairswith sub-10ms latency at millions of QPS.
Meta AI
TorchRec
FBGEMM-GPU
DLRM
Two-Tower
INT8
$ system.monitor --live
QPS:1,247,893
P99_LATENCY:8.3ms
CTR_LIFT:+2.47%
EMBEDDING_DIM:10^11
[LIVE] Production metrics • Meta-scale infrastructure
/*
PROBLEM:
Digital ads and content feeds must select and rank items in milliseconds
under massive sparsity and tight latency/throughput SLOs.
under massive sparsity and tight latency/throughput SLOs.
*/
SCALE_CHALLENGES
EMBEDDING_SCALE
TB-level tables
10^8–10^11 IDs
QPS_RANKING
Millions/sec
Sub-10ms budget
SPARSITY
Billions of pairs
Massive cold-start
CONTINUAL_TRAIN
Shifting catalogs
Non-stationary
{
"solution": "TorchRec + FBGEMM + DLRM",
"training": "DistributedModelParallel",
"sharding": "TW/RW/CW auto-planner",
"inference": "INT8 quantized FBGEMM",
"scale": "thousands of GPUs"
}
TorchRec
PyTorch's domain library for large-scale recommenders
Parallelism Primitives
DistributedModelParallel (DMP) wires model/data parallelism with auto-planning
01
Embedding Sharding
Table-wise, row-wise, column-wise, and hybrid TW-RW-CW schemes with EmbeddingPlanner
02
FBGEMM Integration
INT8/FP16 quantized kernels, table-batched embedding bags for training & inference
03
Production-Ready
Open-sourced by Meta AI, proven at Meta-scale with ecosystem adoption
04
github.com/meta-pytorch/torchrec
System Blueprint
Training → Serving
Event Logs
[1/7]
Feature Store
[2/7]
TorchRec Training
[3/7]
Model Registry
[4/7]
Inference Build
[5/7]
Online Ranker
[6/7]
Monitoring
[7/7]
Sharding & Planning
TW-RW-CW schemes, EmbeddingPlanner auto-generates strategies
FBGEMM Kernels
INT8/FP16 ops, table-batched embedding bags
DLRM Architecture
Dense MLP + sparse embeddings + interaction
Models
DLRM
Deep Learning Recommendation Model
Bottom MLP (dense) + EmbeddingBagCollection (sparse) → feature interaction (dot/cat) → Top MLP → CTR/CVR head
ebc = EmbeddingBagCollection( tables=[ EmbeddingBagConfig( name="user_id", num_embeddings=N_users, embedding_dim=64 ), EmbeddingBagConfig( name="item_id", num_embeddings=N_items, embedding_dim=64 ) ] ) model = DLRMLike( embedding_bag_collection=ebc, dense_in_dim=32, top_mlp_sizes=[256,128,1] )
→ Production default | Meta-proven architecture
Two-Tower
Retrieval Model
User tower and Item tower trained with contrastive loss. ANN retrieval (pre-rank) → send top-K to DLRM ranker
User Tower
Embedding + MLP → user_vector
Item Tower
Embedding + MLP → item_vector
ANN Search
Sub-10ms recall at high throughput
→ Candidate generation | Databricks-endorsed
DISTRIBUTED_TRAINING
Training at Scale
[1]
DMP Parallelism
Mixes data/model parallel; large tables are row/table/column-wise sharded
Planner auto-chooses per-table plan
[2]
2D Sparse Parallel
Training across thousands of GPUs via 2D embedding parallel
Available via TorchRec DMP collection
[3]
Fused Optimizers
Fused embedding optimizers; Adagrad/Lion/Adam for dense towers
Mixed precision: FP16/BF16 dense, 8-bit sparse
[4]
Streaming Updates
Mini-batching from log streams; periodic full re-index of new IDs
Handles catalog changes and cold-start
SHARDING_EXAMPLE
planner = EmbeddingShardingPlanner( model, topology=Topology( world_size=W, compute_device="cuda" ) ) plan = planner.collective_plan() sharded = DistributedModelParallel( module=model, plan=plan ) # Planner chooses table/row/column-wise # sharding per table; DMP executes it
INFERENCE_STACK
Quantization & Serving
Edge (Two-Tower)
Candidate retrieval
<5–10ms
Ranker (DLRM)
Final scoring
<50ms P99
OPTIMIZATION_STACK
[✓]
Quantize & shard for serving with TorchRec inference APIs
[✓]
Table-batched embedding bag ops minimize kernel launches
[✓]
Memory-bandwidth optimized FBGEMM paths
[✓]
Multi-host sharded inference with hot-ID caches
QUANTIZATION_API
from torchrec.inference.modules import ( quantize_inference_model, shard_quant_model ) # Quantize to INT8 embeddings q = quantize_inference_model(ebc) # Shard for distributed serving q_shard, _ = shard_quant_model( q, compute_device=device, sharding_device=device ) # Deployment-ready quantization & sharding
Case Blueprints
A
Ads CTR Ranker (DLRM + TorchRec)
BUILD_STEPS
1.Schema: user/item/campaign sparse IDs + dense context
2.Model: DLRM with TorchRec EmbeddingBagCollection
3.Sharding: TW for tiny tables, RW for user_id, CW for item_id
4.Training: Mixed precision, in-batch negatives, thousands-GPU scaling
5.Inference: Quantize + shard with TorchRec APIs, FBGEMM serving
OBJECTIVE
Maximize CTR/ROAS under latency & budget constraints
DELIVERABLES
Lift curves by segment, calibration report, latency P95/P99, shard plan & memory budget
B
Two-Stage Recommender (Retrieval → Ranking)
BUILD_STEPS
1.Retrieval: Two-Tower (user/item towers) with ANN recall
2.Ranking: DLRM on K candidates
3.TorchRec handles both models' embeddings
4.Quantization: INT8 tables for retrieval, FP16 for ranker
5.Online: Canary rollout, CTR & dwell-time lift, tail-latency alarms
OBJECTIVE
Scale to 10^8–10^9 items with strict SLOs
DELIVERABLES
Industry recipes using TorchRec for Two-Tower + DLRM pipelines
C
Catalog Cold-Start & Long-Tail Boost
BUILD_STEPS
1.Side-info embeddings (text/category/price) → item tower bootstrap
2.Frequency-aware negatives + re-weighting in loss
3.Serving: Exploration bonus for cold items, damped rank updates
OBJECTIVE
Mitigate cold-start and long-tail starvation
DELIVERABLES
Catalog coverage ↑, new-item CTR ↑, head-tail exposure parity ↑
MLOps & SRE
Registry
• Checkpoint
• Shard plan
• Quant config
• Featureset version
CI/CD
• Offline eval gates
• Schema drift tests
• Blue/green deploy
• Cache seeding
Observability
• CTR/CVR tracking
• Exposure parity
• Feature staleness
• GPU metrics
Governance
• Content fairness
• Drift monitors
• Audit trails
• Rollback triggers
PRODUCTION_KPIS
Quality
CTR lift vs baseline
+3–10%
Quality
AUC / PR-AUC
≥ baseline + 0.5–1.5 pts
Latency
Ranker P99
≤ 50 ms
Latency
Retrieval P99
≤ 10 ms
Robustness
Exposure parity (JSD)
↓ 20–40%
Ops
Timeouts
< 0.3%
Efficiency
GPU mem util.
≥ 70% sustained
Offline Evaluation
→ AUC/PR, calibration (ECE)
→ Lift vs heuristic
→ Counterfactual replay (IPS/DR optional)
Online A/B Testing
→ CTR/CVR uplift, RPM/ROAS
→ Canary (5–10% traffic)
→ Rollback on CTR floor breach
Deploy Meta-scale recommendation systems with TorchRec
CTR_LIFT
+3–10%
P99_LATENCY
<50ms
SCALE
10^11
PRECISION
INT8
TorchRec • FBGEMM • DLRM • Meta AI • Production-Proven