arxiv:2605.14438

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

Published on May 14

· Submitted by

Jialiang Cheng on May 15

alibaba-inc

Upvote

Authors:

Abstract

BEAM enables dynamic expert selection in Mixture-of-Experts models through trainable binary masks, achieving significant computational savings while maintaining high performance.

AI-generated summary

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5times faster decoding and 1.4times higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

View arXiv page View PDF Add to collection

Community

Julius-L

Paper submitter 3 days ago

Interesting work on practical MoE inference acceleration. BEAM introduces trainable binary expert masks for token-adaptive routing, which helps avoid the redundant computation of fixed Top-K routing while staying more deployment-friendly than methods that require major architectural changes or expensive retraining. The results are also amazing: over 98% performance retention with up to 85% MoE FLOPs reduction, 2.5× faster decoding, and 1.4× higher throughput. Overall, this is a simple and practical plug-and-play solution for making MoE models more efficient in real-world inference.

librarian-bot

about 6 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.14438

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14438 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14438 in a Space README.md to link it from this page.