MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion

A generative diffusion model for Marketing Mix Modeling (MMM) that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.

v2 Fixes (from v1)

Problem 1: Sales Alignment (predicted sales didn't match total sales)

Root cause: v1 had loss_sales = 0.0 — no gradient signal for sales reconstruction. Fix: Added differentiable sales reconstruction loss (L_sales) that flows through coefficient → contribution → total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).

Problem 2: Coefficients Too Smooth (compared to GT)

Root cause: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process σ=0.05). Fixes:

Spectral loss (L_spectral): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
Multi-scale temporal loss: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
Higher GT volatility: Increased OU volatility (0.05→0.12 for media, 0.03→0.08 for controls) + regime-change jumps
Contribution matching loss: Directly matches predicted channel-level contributions to GT
Reduced smoothness weight: 0.1 → 0.05
Loss warmup: Core denoising trained first, auxiliary losses ramped in after 25% of training

Architecture

┌──────────────────┐   ┌──────────────────────────────────────────┐
│  CONDITIONING     │   │  STAGE 1: Campaign/Geo Denoiser          │
│                   │   │  (≈ Kimodo Root Denoiser)                │
│  • Media Spend    │──▶│  Transformer (4 layers, d=192)           │
│    (5 channels)   │   │  Denoises aggregate patterns             │
│  • Controls       │   │  from controls + total sales             │
│    (3 variables)  │   └──────────────┬───────────────────────────┘
│  • Total Sales    │                  │ Campaign Context
└──────────────────┘                   ▼
                    ┌──────────────────────────────────────────────┐
                    │  STAGE 2: Channel Denoiser                   │
                    │  (≈ Kimodo Body Denoiser)                    │
                    │  Cross-Attention + Transformer (6 layers, d=256)│
                    │                                              │
                    │  CONSTRAINTS:                                │
                    │  • Log-space for media (exp → always ≥ 0)    │
                    │  • PhysDiff-style projection every K steps   │
                    │  • Soft sign penalty loss                    │
                    └──────────────┬───────────────────────────────┘
                                   ▼
                    ┌──────────────────────────────────────────────┐
                    │  OUTPUT: Time-Varying Coefficients (T, 8)    │
                    │  β_TV, β_Digital, β_Social, β_Print, β_Radio │
                    │  β_Seasonality, β_Trend, β_CompetitorPrice   │
                    │  → Sales = base + Σ β_m·Hill(Adstock(x))     │
                    │          + Σ β_c·ctrl + noise                │
                    └──────────────────────────────────────────────┘

Kimodo → MMM Mapping

Kimodo (Motion Generation)	MMM-Diffusion (Marketing)
Text prompts	Media spend, non-marketing vars, total sales
Motion/position constraints	Sign constraints (β_media ≥ 0) + prior bounds
Root denoiser (trajectory)	Campaign/Geo denoiser (aggregate patterns)
Body denoiser (joint angles)	Channel denoiser (per-channel coefficients)
Skeleton positions/rotations	Time-varying coefficients for decomposition
Foot contact constraints	Media positivity constraint
Velocity loss	Multi-scale temporal loss

Losses (v2)

L_total = L_campaign + 2·L_channel + 0.5·L_spectral + 0.1·L_temporal
        + aux_weight · (0.2·L_sales + 0.2·L_contrib) + 0.01·L_sign

where aux_weight ramps from 0→1 after 25% warmup

L_campaign  = MSE(agg_pred, agg_target)           — Stage 1 x₀-prediction
L_channel   = MSE(coeff_pred, coeff_target)        — Stage 2 x₀-prediction (PRIMARY)
L_spectral  = MSE(log|FFT(pred)|, log|FFT(target)|) — Frequency preservation
L_temporal  = MSE(Δ¹pred, Δ¹target) + 0.5·MSE(Δ²pred, Δ²target) — Multi-scale
L_sales     = MSE(pred_sales/scale, actual_sales/scale)  — Sales reconstruction
L_contrib   = MSE(pred_contrib, true_contrib)      — Channel contribution matching
L_sign      = ReLU(-β_media_log - 5)               — Soft positivity

Sampling

Two samplers available:

DDPM (500 steps): Stochastic, well-calibrated temporal variation
DDIM (50-100 steps): 5-10x faster, deterministic (eta=0)

Results (v2, GPU training, 150 epochs)

Metric	v1 (CPU, 30 epochs)	v2 (GPU, 150 epochs)
Final training loss	0.129	0.68
Channel loss	—	0.14
Media positivity	✅ 100%	✅ 100%
Temporal variation ratio	0.2-0.4 (too smooth)	0.3-1.2 (calibrated)
MAPE (fitted base)	—	7.0%
Model size	2.7M	7.2M

Key improvement: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.

Note on coefficient correlation: Per-channel correlation with GT is near zero. This is expected — the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates plausible coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.

Files

mmm_diffusion_v2.py — Full v2 implementation with all fixes
mmm_diffusion_v2.pt — Best model checkpoint (v2, 150 epochs on GPU)
training_history_v2.png — Training loss curves (all 7 loss components)
coeff_comparison_v2.png — True vs predicted time-varying coefficients
sales_decomposition_v2.png — Sales decomposition with R² and MAPE
mmm_diffusion.py — Original v1 implementation (kept for reference)
mmm_diffusion_model.pt — v1 model checkpoint

Usage

from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
import torch

# Generate data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)
dataset = MMMDiffusionDataset(samples, normalize=True)

# Build and load model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
                          d_model_channel=256, n_layers_campaign=4,
                          n_layers_channel=6, T_diff=500)
ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate coefficients (DDPM)
conditioning = ...  # (1, T, 9) [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=500)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] guaranteed positive (media channels)

# Or faster with DDIM (50 steps)
coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)

References

GMD (arxiv:2305.12577) — Two-stage trajectory + body diffusion
MDM (arxiv:2209.14916) — Transformer denoiser, x₀-prediction
PhysDiff (arxiv:2212.02500) — Physics-based constraint projection
PDM (arxiv:2402.03559) — Projected diffusion for hard constraints
NNN (arxiv:2504.06212) — Neural network MMM (Google)
Improved DDPM (arxiv:2102.09672) — Cosine noise schedule
DDIM (arxiv:2010.02502) — Deterministic sampling

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for sujimenon/mmm-diffusion