ML-ADSA-87 — Benchmarks and Sizes¶
Companion to the specification (docs/30) and verification dossier (docs/31). All numbers below are
measured by go test -bench on the actual reference implementation — not estimates.
Measurement conditions (honest).
- Platform: Apple M5 (arm64), Go reference implementation in qrl-integration/ml-adsa/qrysm/mladsa/
(byte-identical to the canonical go-mladsa/).
- Command: go test ./mladsa/ -run='^$' -bench=. -benchmem (default benchtime); deterministic KAT
inputs (bench_test.go reuses kat_test.go conditions).
- This is the reference implementation: portable, not optimized, not yet constant-time
hardened, and AggregateF includes Fiat-Shamir-with-aborts rejection sampling, so per-op timing has
real variance (an optimized/AVX2/assembly + batched implementation would be substantially faster;
docs/32 item #6). Numbers are indicative of the algorithm's shape, not of a production-tuned build.
1. Primitive operations¶
| Operation | ns/op | ≈ time | B/op | allocs/op |
|---|---|---|---|---|
ExpandA(ρ) (matrix expansion) |
166,835 | 0.167 ms | 121,664 | 177 |
ContentKeyDerive (per-content key refresh, L1) |
101,076 | 0.101 ms | 203,937 | 147 |
MemberKeyGen (registration keypair + PoP) |
807,869 | 0.81 ms | 920,402 | 589 |
Verify (in-house, = FIPS-204 verify) |
433,461 | 0.43 ms | 356,668 | 318 |
Verify (go-qrllib native ML-DSA-87) |
137,960 | 0.138 ms | 1,158 | 4 |
MemberKeyGen is a one-time, per-epoch-registration cost (it generates an ML-DSA registration keypair and
a proof-of-possession). ContentKeyDerive is the recurring per-content refresh.
2. Aggregation (AggregateF) by committee size N¶
Includes rejection sampling; the output is a single ML-DSA-87 signature regardless of N.
| N (signers) | ns/op | ≈ time | per-signer | B/op | allocs/op |
|---|---|---|---|---|---|
| 1 | 399,119 | 0.40 ms | 0.40 ms | 708,286 | 521 |
| 4 | 851,372 | 0.85 ms | 0.21 ms | 1,470,873 | 1,139 |
| 16 | 4,983,889 | 4.98 ms | 0.31 ms | 9,011,744 | 7,107 |
| 64 | 18,927,026 | 18.9 ms | 0.30 ms | 16,729,178 | 13,393 |
| 128 | 23,825,097 | 23.8 ms | 0.19 ms | 33,002,311 | 26,452 |
Aggregation is ≈ linear in N (one A·y and one response per signer, plus the shared combine), modulated by
rejection retries — hence the per-op variance and the not-perfectly-monotone per-signer column. Crucially
this is a one-time combine per slot/content, and in the decentralized deployment it is split across the
committee (each signer does its own ContentParts/ContentResponse; any party runs the public combine).
3. Verification cost is O(1) in N — the headline result¶
The verifier does one ML-DSA-87 verify against pk* regardless of how many signers aggregated:
| Scheme | bytes on the wire | verify cost |
|---|---|---|
| Per-attester signature list (naive PQ port) | N × 4627 (≤ 592,256 at N=128) | N verifies (O(N)) |
| ML-ADSA aggregate | 4627 (constant) | 1 verify (~0.14–0.43 ms), O(1) |
This is the BLS-like win: constant signature size and constant verification, independent of committee size.
4. Sizes¶
| Object | Size | Note |
|---|---|---|
Aggregate public key pk* |
2592 B | a valid ML-DSA-87 public key (constant) |
Aggregate signature σ* |
4627 B | a valid ML-DSA-87 signature (constant, any N) |
| Aggregation bits | ⌈N/8⌉ + 1 B (≤ ~17 B at N=128) | already present upstream; identifies the signer set |
| Master secret seed | 32 B | the only long-term secret per signer |
Compression vs the per-attester list¶
| Committee N | list size | ML-ADSA | reduction |
|---|---|---|---|
| 8 | 37,016 B | 4,627 B | 8× |
| 16 | 74,032 B | 4,627 B | 16× |
| 64 | 296,128 B | 4,627 B | 64× |
| 128 (max) | 592,256 B | 4,627 B | 128× |
No needed information is lost: signer set = the public aggregation bits, key = Σ epoch-tree tᵢ, validity
= one FIPS-204 verify.
4b. After optimization pass (pure-Go, byte-identical, same machine)¶
The pure-Go arithmetic was optimized (branchless constant-time modQ/cabs; a fused allocation-free
multiply-accumulate pwacc replacing padd(acc, pw(...)) in the A·y/A·z matrix products). Output is
byte-identical (all KATs unchanged). Measured deltas (Apple M5):
| Op | before | after | Δ allocs/op | Δ B/op |
|---|---|---|---|---|
| ContentKeyDerive | 0.101 ms | 0.077 ms | 147 → 91 | 204 KB → 89 KB |
| AggregateF N=16 | 4.98 ms | 5.3 ms¹ | 7,107 → 5,202 | 9.0 MB → 5.1 MB |
| AggregateF N=64 | 18.9 ms | 15.8 ms | 13,393 → 9,753 | 16.7 MB → 9.3 MB |
| AggregateF N=128 | 23.8 ms | 19.4 ms | 26,452 → 19,228 | 33.0 MB → 18.2 MB |
| Verify (in-house) | 0.43 ms | 0.37 ms | 318 → 262 | 357 KB → 242 KB |
¹ N=16 time is within rejection-sampling noise; allocations/bytes dropped ~30–45% across the board, which
is the durable win. Roughly 45% less memory traffic and ~18–25% faster on the hot path, byte-for-
byte identical output. Further gains (in-place NTT, lazy reduction, AVX2 asm) are in docs/34.
5. Reproduce¶
cd qrl-integration/ml-adsa/qrysm
go test ./mladsa/ -run='^$' -bench=. -benchmem # all benchmarks above
Indicative production-tuning headroom (not yet done, docs/32): NTT/poly arithmetic in assembly/AVX2,
batched verification, allocation reduction (the reference allocates liberally), and a constant-time
hardened build. These would primarily speed up AggregateF (the rejection-sampling loop) and bring
in-house Verify toward the go-qrllib native figure.