pickaxe MineDraft: A Framework for
Batch Parallel Speculative Decoding

What if your draft model never blocked the verifier? MineDraft overlaps drafting and verification across parallel batches, hiding latency that standard speculative decoding leaves on the table.

Zhenwei Tang1,2*Arun Verma2*†Zijian Zhou2,3Zhaoxuan Wu2Alok Prakash2Daniela Rus4,2Bryan Kian Hsiang Low3,2
1College of Computing and Data Science, Nanyang Technological University, Singapore
2Singapore-MIT Alliance for Research and Technology Centre, Singapore
3Department of Computer Science, National University of Singapore, Singapore
4CSAIL, Massachusetts Institute of Technology, USA
(*Equal contribution    Corresponding author)
TL;DR

MineDraft introduces Batch Parallel Speculative Decoding: running the drafter and verifier simultaneously, while the verifier processes one batch, the drafter, running on a dedicated GPU, is already generating tokens for the next batch. The two batches alternate continuously, hiding the drafter's latency entirely within the verifier's computation time. The result: up to 75% more throughput and 39% lower end-to-end latency over standard speculative decoding, with just one extra GPU.

The Problem

The Hidden Tax of Sequential Drafting

Speculative decoding is one of the cleverest tricks in modern LLM inference. A small, fast draft model races ahead to propose several tokens; the large target model verifies them all in one parallel forward pass. When draft quality is high, you get multiple tokens for the cost of a single verification step, a genuine free lunch.

But there is a subtle, rarely-examined inefficiency at the heart of every standard speculative decoding system: drafting and verification happen one after the other. The draft model generates its proposals. It finishes. Only then does the target model begin to verify. Those drafting milliseconds sit directly on the critical path: not hidden, not amortized, not free. Every cycle the draft model runs is a cycle the massive target model sits completely idle.

In standard speculative decoding, frequent rejections can make SD slower than ordinary autoregressive generation, because you have paid for drafting and gained nothing.

The severity of this problem compounds in production batch-serving settings. With tens of concurrent requests, the draft model must generate proposals for every active request at each step. The target model's entire forward pass waits. And because performance is acutely sensitive to acceptance rates, any degradation in draft quality can flip the system from faster to slower than if you had never used speculative decoding at all.

Prior work addressed this by improving draft quality: better draft models (EAGLE, EAGLE-2), smarter token selection (TETRIS), tree-based speculation. MineDraft takes a fundamentally different angle: rather than making individual draft tokens more likely to be accepted, it asks what if we could run drafting and verification simultaneously and use the saved GPU time for drafter to generate even better proposals?

The name is a deliberate nod to the Minecraft game engine, which loads the next world chunk into memory while the player is still navigating the current one, so the world appears seamless. MineDraft applies exactly this background-loading principle to LLM inference, ensuring the draft model is always one step ahead and its compute time never appears on the critical path.


Method

Batch Parallelism: Always One Step Ahead

The core mechanism is elegant. A standard system processes m requests per step. MineDraft maintains 2m requests split into two alternating batches. At any given moment, the target model is verifying Batch 0 while the draft model is speculating ahead for Batch 1. At the next step, they swap roles. The draft model is never idle waiting for verification to finish; it is always working on the other batch.

Standard Speculative Decoding: Sequential
Process 1
Draft
Verify
Draft
Verify
Draft
···
⟵ Draft on critical path. Verifier idles during every drafting phase. ⟶
MineDraft: Batch Parallel Speculative Decoding
Drafter
Batch 0
Batch 1
Batch 0
Batch 1
···
Verifier
init
·
Batch 0
·
Batch 1
·
Batch 0
···
✦ Drafting and verification fully overlap: drafting latency disappears ✦
Drafting
Verification
GPU-to-GPU sync
Idle / init
MineDraft parallelizes drafting and verification by maintaining two alternating batches. The Drafter (running on a dedicated GPU) is always one step ahead, so its compute time is hidden inside the Verifier's forward pass. The one-step init cost is negligible over long request sequences.

Four Interacting Components

MineDraft decomposes into four precisely-specified subsystems that work together to realize Batch Parallelism without architectural surgery to the underlying LLM stack:

Batch Manager

Assigns each incoming request to Batch 0 or Batch 1 using a balance counter that tracks size difference. Recycles batch IDs on completion or preemption. Tracks skip_batch, the current draft batch awaiting its first verification.

📋
Scheduler

Manages the full request lifecycle: Waiting → Running → Finished. Patches vLLM's KV-block allocator so the draft batch does not wastefully pre-allocate GPU memory for tokens that are not being verified yet.

Drafter

Runs the small draft model on a dedicated separate GPU. Generates k speculative tokens for the draft batch while the Verifier processes the target batch. Communicates via direct GPU-to-GPU transfer, with no CPU round-trip.

Verifier

Runs the large target model on the primary GPU cluster. Receives draft tokens from the Drafter, evaluates them in one forward pass, and returns accepted tokens plus target sampler output back to the Drafter.

Each SD step ends at a sync point, the moment the Drafter returns output to the Scheduler. At this point, the batches alternate roles: the old target batch becomes the new draft batch, and vice versa. This alternation requires no centralized orchestration beyond the Batch Manager's lightweight bookkeeping.

Theoretical Efficiency Guarantee

MineDraft does not just work empirically; it is provably faster. Under mild assumptions about the draft model's acceptance function f(t) = 1 − e−αt (the standard exponential model used throughout the speculative decoding literature), we prove:

Theorem 1: Formal Guarantee
For αV ≥ −W−1(−1/2e) − 1 ≈ 1.68, standard SD satisfies TSD1.59 · TPSD, meaning MineDraft's Parallel SD is at least 37% faster than standard SD, with improvement growing with draft quality.

The product αV captures both draft quality (α is the acceptance sharpness parameter) and verification cost (V). When either the verifier is slower or the draft model is more accurate, the advantage of parallelism grows. This explains why MineDraft pairs especially powerfully with strong drafting strategies: better acceptance rates do not just improve token throughput; they amplify the parallel speedup.


Results

Consistent Gains Across Every Setting

MineDraft was evaluated across seven target–draft model pairings, four benchmark datasets (Arena, ShareGPT, Spec-Bench, LLM-Tough-Questions), and extensive ablations. The conclusion is unambiguous: MineDraft outperforms standard SD and all prior parallelism baselines in every configuration without exception.

0
Max throughput gain over standard SD (tokens/sec)
0
Max end-to-end latency reduction vs standard SD
0
Extra GPU required (dedicated to the draft model)

Throughput: Qwen3-32B + 1.7B Draft, Arena

Representative throughput numbers at 2 speculative tokens per step on the Arena dataset. MineDraft (extra=1) adds one additional concurrent sequence per batch on top of the base configuration.

MineDraft (extra=1) ~ 570 tok/s
MineDraft (standalone) ~ 520 tok/s
PEARL + TETRIS (extra=1) ~ 430 tok/s
PEARL ~ 390 tok/s
Standard SD ~ 310 tok/s

Throughput Results Summary

Target Model Draft Model Avg TP Gain ↑ Max Gain vs SD ↑
Qwen3-32B Qwen3-0.6B 40–47% 70.32%
Qwen3-32B Qwen3-1.7B 38–48% 75.68%
Qwen3-32B Qwen3-4B 49–65% 75.12%
Qwen3-32B Qwen3-8B OOM for baseline
Llama-3.3-70B Llama-3.1-8B 21–31% 37.06%
Vicuna-33B + EAGLE EAGLE-Vicuna-33B 1–7.5% 22.09%
Vicuna-13B + EAGLE EAGLE-Vicuna-13B 2–6.9% 21.0%

End-to-End Latency (E2EL)

Throughput and latency tell different stories. MineDraft excels at both. In the Qwen3-32B + 1.7B setting, it achieves an average E2E latency improvement of 24.6% over the best baseline, with a maximum reduction of 39.5% compared to standard SD. For Qwen3-32B + 4B, maximum reductions reach 38.9%.

E2EL Reduction: Qwen3-32B + 1.7B Draft, Arena

Latency reduction relative to standard SD at 2 speculative tokens per step. Higher bars indicate greater latency savings.

MineDraft (extra=1) 39.5% reduction
MineDraft (standalone) ~33% reduction
PEARL + TETRIS (extra=1) ~18% reduction
PEARL ~12% reduction
Standard SD 0% (baseline)

End-to-End Latency Results Summary

Target Model Draft Model Avg E2EL Reduction ↓ Max Reduction vs SD ↓
Qwen3-32B Qwen3-0.6B 22–26% 37.89%
Qwen3-32B Qwen3-1.7B 18–28% 39.51%
Qwen3-32B Qwen3-4B 25–31% 38.94%
Qwen3-32B Qwen3-8B OOM for baseline
Llama-3.3-70B Llama-3.1-8B 7–15% 20.63%
vicuna-33B EAGLE-Vicuna-33B 0.7-6% 15.52%
vicuna-13B EAGLE-Vicuna-13B 2–6% 16.3%

An important secondary finding: by placing the draft model on a dedicated GPU, MineDraft eliminates the VRAM contention that afflicts standard SD with large draft models. In Setting 4 (Qwen3-32B with Qwen3-8B as draft), standard SD fails entirely with an out-of-memory error. MineDraft continues operating at 250–560 tokens/sec.

MineDraft also composes cleanly with existing drafting strategies. When integrated with TETRIS, it outperforms standalone TETRIS; paired with EAGLE, it outperforms standalone EAGLE. This orthogonality is load-bearing: any improvement to draft acceptance rates directly amplifies the parallel speedup.


Key Insight

One Thing to Remember

pickaxe
Standard speculative decoding puts the draft model on the critical path. MineDraft takes it off. By running the drafter one batch ahead on its own GPU, drafting latency becomes a sunk cost hidden inside verification, yielding a speedup that is guaranteed by theory and confirmed across every model family tested.

For ML engineers deploying LLM inference at scale: if you are already running speculative decoding and can spare one GPU, MineDraft is a plug-in upgrade. It composes with whatever drafting strategy you already use, and the theoretical guarantee means you benefit even when draft quality is mediocre. The larger and slower your verifier, the more drafting latency gets hidden, which means MineDraft's advantage is strongest precisely where you need it most.


Limitations & Future Work

Known Trade-offs

There are two structural limitations of the Batch Parallelism design:

Irrecoverable Batch Imbalance

When a request is terminated via user abort or preemption, new replacements are assigned to the draft batch, the only batch that can accept new requests. This can permanently skew batch sizes, degrading the load-balancing that Batch Parallelism relies on. Chunked prefill further exacerbates this. A future version of MineDraft will apply the full balance-tracking logic to all subsequent steps rather than switching to a simpler policy post-initialization.

Batch Exhaustion Tail Effect

When one batch empties (all requests finish) and the other has no ready draft tokens, the system falls back to standard sequential SD. This tail effect limits the theoretical ceiling below 50% total speedup. A proposed mitigation draws from PEARL: instruct the Drafter to generate tokens for the remaining batch's in-flight requests, re-drafting on failure, to partially overlap even the final steps of a request batch.

Future work will also explore: extending MineDraft to the vLLM v1 engine and its chunked prefill mode; eliminating the dedicated GPU requirement using weight-padding techniques (enabling the draft and target models to share GPU memory); and studying whether Batch Parallelism can be combined with tree-attention and parallel drafting methods (DFlash, P-EAGLE) for compounding gains.


Citation

Cite this Work

BibTeX
@inproceedings{tang2026minedraft,
  title         = {MineDraft: A Framework for Batch Parallel Speculative Decoding},
  author        = {Tang, Zhenwei and Verma, Arun and Zhou, Zijian and
                  Wu, Zhaoxuan and Prakash, Alok and
                  Rus, Daniela and Low, Bryan Kian Hsiang},
  booktitle     = {Proceedings of the 43rd International Conference on Machine Learning},
  year          = {2026},
  }