YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion
A generative diffusion model for Marketing Mix Modeling (MMM) that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.
v2 Fixes (from v1)
Problem 1: Sales Alignment (predicted sales didn't match total sales)
Root cause: v1 had loss_sales = 0.0 β no gradient signal for sales reconstruction.
Fix: Added differentiable sales reconstruction loss (L_sales) that flows through coefficient β contribution β total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).
Problem 2: Coefficients Too Smooth (compared to GT)
Root cause: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process Ο=0.05). Fixes:
- Spectral loss (
L_spectral): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing - Multi-scale temporal loss: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
- Higher GT volatility: Increased OU volatility (0.05β0.12 for media, 0.03β0.08 for controls) + regime-change jumps
- Contribution matching loss: Directly matches predicted channel-level contributions to GT
- Reduced smoothness weight: 0.1 β 0.05
- Loss warmup: Core denoising trained first, auxiliary losses ramped in after 25% of training
Architecture
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
β CONDITIONING β β STAGE 1: Campaign/Geo Denoiser β
β β β (β Kimodo Root Denoiser) β
β β’ Media Spend ββββΆβ Transformer (4 layers, d=192) β
β (5 channels) β β Denoises aggregate patterns β
β β’ Controls β β from controls + total sales β
β (3 variables) β ββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β’ Total Sales β β Campaign Context
ββββββββββββββββββββ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Channel Denoiser β
β (β Kimodo Body Denoiser) β
β Cross-Attention + Transformer (6 layers, d=256)β
β β
β CONSTRAINTS: β
β β’ Log-space for media (exp β always β₯ 0) β
β β’ PhysDiff-style projection every K steps β
β β’ Soft sign penalty loss β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT: Time-Varying Coefficients (T, 8) β
β Ξ²_TV, Ξ²_Digital, Ξ²_Social, Ξ²_Print, Ξ²_Radio β
β Ξ²_Seasonality, Ξ²_Trend, Ξ²_CompetitorPrice β
β β Sales = base + Ξ£ Ξ²_mΒ·Hill(Adstock(x)) β
β + Ξ£ Ξ²_cΒ·ctrl + noise β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Kimodo β MMM Mapping
| Kimodo (Motion Generation) | MMM-Diffusion (Marketing) |
|---|---|
| Text prompts | Media spend, non-marketing vars, total sales |
| Motion/position constraints | Sign constraints (Ξ²_media β₯ 0) + prior bounds |
| Root denoiser (trajectory) | Campaign/Geo denoiser (aggregate patterns) |
| Body denoiser (joint angles) | Channel denoiser (per-channel coefficients) |
| Skeleton positions/rotations | Time-varying coefficients for decomposition |
| Foot contact constraints | Media positivity constraint |
| Velocity loss | Multi-scale temporal loss |
Losses (v2)
L_total = L_campaign + 2Β·L_channel + 0.5Β·L_spectral + 0.1Β·L_temporal
+ aux_weight Β· (0.2Β·L_sales + 0.2Β·L_contrib) + 0.01Β·L_sign
where aux_weight ramps from 0β1 after 25% warmup
L_campaign = MSE(agg_pred, agg_target) β Stage 1 xβ-prediction
L_channel = MSE(coeff_pred, coeff_target) β Stage 2 xβ-prediction (PRIMARY)
L_spectral = MSE(log|FFT(pred)|, log|FFT(target)|) β Frequency preservation
L_temporal = MSE(ΞΒΉpred, ΞΒΉtarget) + 0.5Β·MSE(ΞΒ²pred, ΞΒ²target) β Multi-scale
L_sales = MSE(pred_sales/scale, actual_sales/scale) β Sales reconstruction
L_contrib = MSE(pred_contrib, true_contrib) β Channel contribution matching
L_sign = ReLU(-Ξ²_media_log - 5) β Soft positivity
Sampling
Two samplers available:
- DDPM (500 steps): Stochastic, well-calibrated temporal variation
- DDIM (50-100 steps): 5-10x faster, deterministic (eta=0)
Results (v2, GPU training, 150 epochs)
| Metric | v1 (CPU, 30 epochs) | v2 (GPU, 150 epochs) |
|---|---|---|
| Final training loss | 0.129 | 0.68 |
| Channel loss | β | 0.14 |
| Media positivity | β 100% | β 100% |
| Temporal variation ratio | 0.2-0.4 (too smooth) | 0.3-1.2 (calibrated) |
| MAPE (fitted base) | β | 7.0% |
| Model size | 2.7M | 7.2M |
Key improvement: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.
Note on coefficient correlation: Per-channel correlation with GT is near zero. This is expected β the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates plausible coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.
Files
mmm_diffusion_v2.pyβ Full v2 implementation with all fixesmmm_diffusion_v2.ptβ Best model checkpoint (v2, 150 epochs on GPU)training_history_v2.pngβ Training loss curves (all 7 loss components)coeff_comparison_v2.pngβ True vs predicted time-varying coefficientssales_decomposition_v2.pngβ Sales decomposition with RΒ² and MAPEmmm_diffusion.pyβ Original v1 implementation (kept for reference)mmm_diffusion_model.ptβ v1 model checkpoint
Usage
from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
import torch
# Generate data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)
dataset = MMMDiffusionDataset(samples, normalize=True)
# Build and load model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
d_model_channel=256, n_layers_campaign=4,
n_layers_channel=6, T_diff=500)
ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()
# Generate coefficients (DDPM)
conditioning = ... # (1, T, 9) [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=500)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] guaranteed positive (media channels)
# Or faster with DDIM (50 steps)
coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)
References
- GMD (arxiv:2305.12577) β Two-stage trajectory + body diffusion
- MDM (arxiv:2209.14916) β Transformer denoiser, xβ-prediction
- PhysDiff (arxiv:2212.02500) β Physics-based constraint projection
- PDM (arxiv:2402.03559) β Projected diffusion for hard constraints
- NNN (arxiv:2504.06212) β Neural network MMM (Google)
- Improved DDPM (arxiv:2102.09672) β Cosine noise schedule
- DDIM (arxiv:2010.02502) β Deterministic sampling
License
MIT