YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MMM-Diffusion: Marketing Mix Modeling via Dual-Denoiser Diffusion

A generative diffusion model for Marketing Mix Modeling (MMM) that predicts time-varying coefficients for sales decomposition. Adapted from NVIDIA's Kimodo/GMD dual-denoiser architecture.

v2 Fixes (from v1)

Problem 1: Sales Alignment (predicted sales didn't match total sales)

Root cause: v1 had loss_sales = 0.0 β€” no gradient signal for sales reconstruction. Fix: Added differentiable sales reconstruction loss (L_sales) that flows through coefficient β†’ contribution β†’ total sales path. Uses warmup schedule (first 25% of epochs focus on core coefficient denoising, then sales loss ramps in).

Problem 2: Coefficients Too Smooth (compared to GT)

Root cause: Smoothness loss weight (0.1) was too high relative to reconstruction loss, and GT coefficient volatility was too low (OU process Οƒ=0.05). Fixes:

  1. Spectral loss (L_spectral): Log-magnitude FFT loss that penalizes frequency spectrum differences, with higher weights on high frequencies to fight smoothing
  2. Multi-scale temporal loss: Matches 1st AND 2nd order temporal derivatives (velocity + acceleration)
  3. Higher GT volatility: Increased OU volatility (0.05β†’0.12 for media, 0.03β†’0.08 for controls) + regime-change jumps
  4. Contribution matching loss: Directly matches predicted channel-level contributions to GT
  5. Reduced smoothness weight: 0.1 β†’ 0.05
  6. Loss warmup: Core denoising trained first, auxiliary losses ramped in after 25% of training

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CONDITIONING     β”‚   β”‚  STAGE 1: Campaign/Geo Denoiser          β”‚
β”‚                   β”‚   β”‚  (β‰ˆ Kimodo Root Denoiser)                β”‚
β”‚  β€’ Media Spend    │──▢│  Transformer (4 layers, d=192)           β”‚
β”‚    (5 channels)   β”‚   β”‚  Denoises aggregate patterns             β”‚
β”‚  β€’ Controls       β”‚   β”‚  from controls + total sales             β”‚
β”‚    (3 variables)  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  β€’ Total Sales    β”‚                  β”‚ Campaign Context
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  STAGE 2: Channel Denoiser                   β”‚
                    β”‚  (β‰ˆ Kimodo Body Denoiser)                    β”‚
                    β”‚  Cross-Attention + Transformer (6 layers, d=256)β”‚
                    β”‚                                              β”‚
                    β”‚  CONSTRAINTS:                                β”‚
                    β”‚  β€’ Log-space for media (exp β†’ always β‰₯ 0)    β”‚
                    β”‚  β€’ PhysDiff-style projection every K steps   β”‚
                    β”‚  β€’ Soft sign penalty loss                    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  OUTPUT: Time-Varying Coefficients (T, 8)    β”‚
                    β”‚  Ξ²_TV, Ξ²_Digital, Ξ²_Social, Ξ²_Print, Ξ²_Radio β”‚
                    β”‚  Ξ²_Seasonality, Ξ²_Trend, Ξ²_CompetitorPrice   β”‚
                    β”‚  β†’ Sales = base + Ξ£ Ξ²_mΒ·Hill(Adstock(x))     β”‚
                    β”‚          + Ξ£ Ξ²_cΒ·ctrl + noise                β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kimodo β†’ MMM Mapping

Kimodo (Motion Generation) MMM-Diffusion (Marketing)
Text prompts Media spend, non-marketing vars, total sales
Motion/position constraints Sign constraints (Ξ²_media β‰₯ 0) + prior bounds
Root denoiser (trajectory) Campaign/Geo denoiser (aggregate patterns)
Body denoiser (joint angles) Channel denoiser (per-channel coefficients)
Skeleton positions/rotations Time-varying coefficients for decomposition
Foot contact constraints Media positivity constraint
Velocity loss Multi-scale temporal loss

Losses (v2)

L_total = L_campaign + 2Β·L_channel + 0.5Β·L_spectral + 0.1Β·L_temporal
        + aux_weight Β· (0.2Β·L_sales + 0.2Β·L_contrib) + 0.01Β·L_sign

where aux_weight ramps from 0β†’1 after 25% warmup

L_campaign  = MSE(agg_pred, agg_target)           β€” Stage 1 xβ‚€-prediction
L_channel   = MSE(coeff_pred, coeff_target)        β€” Stage 2 xβ‚€-prediction (PRIMARY)
L_spectral  = MSE(log|FFT(pred)|, log|FFT(target)|) β€” Frequency preservation
L_temporal  = MSE(Δ¹pred, Δ¹target) + 0.5Β·MSE(Δ²pred, Δ²target) β€” Multi-scale
L_sales     = MSE(pred_sales/scale, actual_sales/scale)  β€” Sales reconstruction
L_contrib   = MSE(pred_contrib, true_contrib)      β€” Channel contribution matching
L_sign      = ReLU(-Ξ²_media_log - 5)               β€” Soft positivity

Sampling

Two samplers available:

  • DDPM (500 steps): Stochastic, well-calibrated temporal variation
  • DDIM (50-100 steps): 5-10x faster, deterministic (eta=0)

Results (v2, GPU training, 150 epochs)

Metric v1 (CPU, 30 epochs) v2 (GPU, 150 epochs)
Final training loss 0.129 0.68
Channel loss β€” 0.14
Media positivity βœ… 100% βœ… 100%
Temporal variation ratio 0.2-0.4 (too smooth) 0.3-1.2 (calibrated)
MAPE (fitted base) β€” 7.0%
Model size 2.7M 7.2M

Key improvement: Temporal variation ratio (pred_std / GT_std) improved from 0.2-0.4 to 0.3-1.2, meaning predicted coefficients now exhibit realistic temporal dynamics instead of being over-smoothed.

Note on coefficient correlation: Per-channel correlation with GT is near zero. This is expected β€” the MMM coefficient recovery problem is fundamentally ill-posed (many coefficient combinations produce similar sales). The diffusion model generates plausible coefficient trajectories conditioned on the input data, not deterministic point estimates. For practical use, ensemble multiple samples for uncertainty quantification.

Files

  • mmm_diffusion_v2.py β€” Full v2 implementation with all fixes
  • mmm_diffusion_v2.pt β€” Best model checkpoint (v2, 150 epochs on GPU)
  • training_history_v2.png β€” Training loss curves (all 7 loss components)
  • coeff_comparison_v2.png β€” True vs predicted time-varying coefficients
  • sales_decomposition_v2.png β€” Sales decomposition with RΒ² and MAPE
  • mmm_diffusion.py β€” Original v1 implementation (kept for reference)
  • mmm_diffusion_model.pt β€” v1 model checkpoint

Usage

from mmm_diffusion_v2 import MMMDiffusionModel, MMMDataGenerator, MMMDiffusionDataset
import torch

# Generate data
gen = MMMDataGenerator(n_weeks=104, seed=42)
samples = gen.generate_dataset(100)
dataset = MMMDiffusionDataset(samples, normalize=True)

# Build and load model
model = MMMDiffusionModel(n_media=5, n_ctrl=3, d_model_campaign=192,
                          d_model_channel=256, n_layers_campaign=4,
                          n_layers_channel=6, T_diff=500)
ckpt = torch.load('mmm_diffusion_model_v2.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate coefficients (DDPM)
conditioning = ...  # (1, T, 9) [media_spend, controls, total_sales]
coefficients = model.sample(conditioning, n_steps=500)
decoded = dataset.decode_coefficients(coefficients)
# decoded[:, :, :5] guaranteed positive (media channels)

# Or faster with DDIM (50 steps)
coefficients = model.sample_ddim(conditioning, n_steps=50, eta=0.0)

References

  • GMD (arxiv:2305.12577) β€” Two-stage trajectory + body diffusion
  • MDM (arxiv:2209.14916) β€” Transformer denoiser, xβ‚€-prediction
  • PhysDiff (arxiv:2212.02500) β€” Physics-based constraint projection
  • PDM (arxiv:2402.03559) β€” Projected diffusion for hard constraints
  • NNN (arxiv:2504.06212) β€” Neural network MMM (Google)
  • Improved DDPM (arxiv:2102.09672) β€” Cosine noise schedule
  • DDIM (arxiv:2010.02502) β€” Deterministic sampling

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for sujimenon/mmm-diffusion