The Hidden Tax of Sequential Drafting
Speculative decoding is one of the cleverest tricks in modern LLM inference. A small, fast draft model races ahead to propose several tokens; the large target model verifies them all in one parallel forward pass. When draft quality is high, you get multiple tokens for the cost of a single verification step — a genuine free lunch.
But there is a subtle, rarely-examined inefficiency at the heart of every standard speculative decoding system: drafting and verification happen one after the other. The draft model generates its proposals. It finishes. Only then does the target model begin to verify. Those drafting milliseconds sit directly on the critical path — not hidden, not amortized, not free. Every time the draft model runs is a period during which the massive target model sits completely idle.
The severity of this problem compounds in production batch-serving settings. With tens of concurrent requests, the draft model must generate proposals for every active request at each step. The target model's entire forward pass waits. And because performance is acutely sensitive to acceptance rates, any degradation in draft quality can flip the system from faster to slower than if you'd never used speculative decoding at all.
Prior work addressed this by improving draft quality — better draft models (EAGLE, EAGLE-2), smarter token selection (TETRIS), tree-based speculation. MineDraft takes a supplementary angle: it asks what if we could draft and verify simultaneously and use the saved GPU time for drafter to generate even better proposals?
The name is a deliberate nod to the Minecraft game engine, which loads the next world chunk into memory while the player is still navigating the current one — so the world appears seamless. MineDraft applies exactly this background-loading principle to LLM inference, ensuring the draft model is always one step ahead and its compute time never appears on the critical path.
Batch Parallelism: Always One Step Ahead
The core mechanism is elegant. A standard system processes m requests per step. MineDraft maintains 2m requests split into two batches, Batch 0 and Batch 1. At a given moment, the target model is verifying Batch 0 (current target batch) while the draft model is speculating ahead for Batch 1 (current draft batch). At the next step, they swap roles — Batch 0 now becomes the draft batch and Batch 1 becomes the target batch — and so on so forth.
Four Interacting Components
MineDraft decomposes into four precisely-specified subsystems that work together to realize Batch Parallelism without architectural surgery to the underlying LLM stack:
Assigns each running request a batch ID (0 for Batch 0, 1 for Batch 1) using a balance counter that tracks size difference. Recycles batch IDs on termination (finish, abort, or preemption).
Manages the full request lifecycle: Waiting → Running → Finished. Patches vLLM's KV-block allocator so the draft batch doesn't wastefully pre-allocate GPU memory for tokens that aren't being verified yet.
Runs the small draft model on a dedicated separate GPU. Generates k speculative tokens for the draft batch while the Verifier processes the target batch.
Runs the large target model on the primary GPU cluster. Receives draft tokens from the Drafter, evaluates them in one forward pass, and returns accepted tokens plus target sampler output back to the Drafter.
Each SD step ends at a sync point — the moment the Drafter returns output to the Scheduler. At this point, the batches alternate roles: the old target batch becomes the new draft batch, and vice versa. This alternation requires no centralized orchestration beyond the Batch Manager's lightweight bookkeeping.
Theoretical Efficiency Guarantee
MineDraft doesn't just work empirically — it is provably faster. Under mild assumptions about the draft model's acceptance function f(t) = 1 − e−αt (the standard exponential model used throughout the speculative decoding literature), the authors prove:
The product αV captures both draft quality (α, the acceptance sharpness parameter) and verification cost (V). When either the verifier is slower or the draft model is more accurate, the advantage of parallelism grows. This explains why MineDraft pairs especially powerfully with strong drafting strategies: better acceptance rates don't just improve token throughput — they amplify the parallel speedup.
Consistent Gains Across Every Setting
MineDraft was evaluated across seven target–draft model pairings, four benchmark datasets (Arena, ShareGPT, Spec-Bench, LLM-Tough-Questions), and extensive ablations. The conclusion is unambiguous: MineDraft outperforms standard SD and all baselines in every configuration without exception.
Throughput
| Target Model | Draft Model | Avg Gain vs Best BL ↑ | Max Gain vs SD ↑ |
|---|---|---|---|
| Qwen3-32B | Qwen3-0.6B | 40–47% | 70.3% |
| Qwen3-32B | Qwen3-1.7B | 38–48% | 75.6% |
| Qwen3-32B | Qwen3-4B | 49–65% | 75.1% |
| Qwen3-32B | Qwen3-8B | — | OOM for baseline |
| Llama-3.3-70B AWQ-INT4 | Llama-3.1-8B | 21–30% | 37.0% |
| vicuna-33b | EAGLE-Vicuna-33B | 1–7.5% | 22.0% |
| vicuna-13b | EAGLE-Vicuna-13B | 2–6.9% | 21.0% |
End-to-End Latency
| Target Model | Draft Model | Avg Gain vs Best BL ↑ | Max Gain vs SD ↑ |
|---|---|---|---|
| Qwen3-32B | Qwen3-0.6B | 22–26% | 37.8% |
| Qwen3-32B | Qwen3-1.7B | 18–28% | 39.5% |
| Qwen3-32B | Qwen3-4B | 25–31% | 38.9% |
| Qwen3-32B | Qwen3-8B | — | OOM for baseline |
| Llama-3.3-70B AWQ-INT4 | Llama-3.1-8B | 7–15% | 20.6% |
| vicuna-33b | EAGLE-Vicuna-33B | 0.7-6% | 15.5% |
| vicuna-13b | EAGLE-Vicuna-13B | 2–5.9% | 16.3% |
An important secondary finding: by placing the draft model on a dedicated GPU, MineDraft eliminates the VRAM contention that afflicts standard SD with large draft models. In Setting 4 (Qwen3-32B with Qwen3-8B as draft), standard SD fails entirely with an out-of-memory error. MineDraft continues operating at 332–610 tokens/sec.
MineDraft also composes cleanly with existing drafting strategies. When integrated with TETRIS, it outperforms standalone TETRIS; paired with EAGLE, it outperforms standalone EAGLE. This orthogonality is load-bearing: any improvement to draft acceptance rates directly amplifies the parallel speedup.
One Thing to Remember
For ML engineers deploying LLM inference at scale: if you are already running speculative decoding and can spare one GPU, MineDraft is a plug-in upgrade. It composes with whatever drafting strategy you already use, and the theoretical guarantee means you benefit even when draft quality is mediocre. The larger and slower your verifier, the more drafting latency gets hidden — which means MineDraft's advantage is strongest precisely where you need it most.
Known Trade-offs
There are two structural limitations of the Batch Parallelism design:
When a request is terminated via user abort or preemption, new replacements are assigned to the draft batch to prevent verifying requests without available drafts. This can permanently skew batch sizes, degrading the load-balancing that Batch Parallelism relies on. Chunked prefill further exacerbates this. A future version of MineDraft will apply the full balance-tracking logic to all subsequent steps.
When one batch empties (all requests finish) and the other has no ready draft tokens, the system falls back to standard sequential SD, circumscribing maximum potential improvement in implementation. A proposed mitigation is drawn from PEARL: propose draft tokens for the remaining batch while it is being verified, re-drafting requests that failed verification, to partially overlap even the final steps of a request batch.
Future work will also explore: extending MineDraft to the vLLM v1 engine and its chunked prefill mode; eliminating the dedicated GPU requirement using weight-padding techniques; and studying whether MineDraft can be combined with parallel drafting methods (DFlash, P-EAGLE) for compounding gains.
Cite this Work
@misc{tang2026minedraft, title = {MineDraft: A Framework for Batch Parallel Speculative Decoding}, author = {Tang, Zhenwei and Verma, Arun and Zhou, Zijian and Wu, Zhaoxuan and Prakash, Alok and Rus, Daniela and Low, Bryan Kian Hsiang}, journal = {arXiv preprint arXiv:2603.18016}, year = {2026}, }