// changelog

Releases

Name: Smedjan
Author: Andrei Dodu

Every version of Smedjan, generated from the git history. No ghostwritten release notes — just the commits that shipped.

v0.1.3 · 2026-06-28 · 23 commits

Added

6e42661feat: Rust edition 2024 + un-ignore all tests (v0.1.3)
dfedce7feat: forward_with_cache + RoPE cache + decode benchmark baseline
eb45406feat: RoPE frequency cache — precomputed cos/sin table eliminates per-thread pow+sincos
a4edd8dfeat: KV-cache in full-attention forward + Muon optimizer + decode kernel in qmul
b81afb3feat: full craft — decode kernel in forward, GPU embedding gather, LoRA on attn, BPE, CLI, self-distill
56e2d4cfeat: output-centric quantized decode kernel (3.5x faster) + KV-cache + self-distillation + spec-decode
84fde1ffeat(qwen3.5): LoRA fine-tuning — frozen 9B base + trainable low-rank adapters
4a4b58bfeat(qwen3.5): quantized forward produces real non-zero logits from 9B weights on Metal
e55adf9feat(qwen3.5): quantized GEMM — 9B Qwythos loads + runs forward on 16GB M1
3a95ebdfeat(qwen3.5): strict_qwen35 forward — real activations (softplus+dt_bias+A_log+RMSNormGated)
aae43e9feat(qwen3.5): affine-int4 safetensors loader — 927-tensor Q4 artifact → Qwen35Model
c9a538dfeat(qwen3.5): full hybrid model forward (Qwen35Model) — verified end-to-end
e05084efeat(qwen3.5): full-attention layer (GQA+QK-norm+partial-RoPE+causal+gate) — verified
e7793befeat(qwen3.5): config_from_hf_qwen35 parser (Phase 5 start)
ba44dfcfeat(qwen3.5): full Gated-DeltaNet mixer layer assembly — verified end-to-end
2370309feat(qwen3.5): asymmetric GQA head expansion (expand_heads) for DeltaNet 16k/32v heads
fa08195feat(qwen3.5): sigmoid (via softmax, no new kernel) + attention output-gate
0d153f0feat(qwen3.5): causal conv1d + partial-RoPE primitives (composed ops, verified)
de13d81feat(gated-deltanet): Qwen3.5 linear-attention mixer — materialized, composed-ops, both backends

Changed

9030687refactor: genericize distillation API naming

Docs

55fc089docs: re-render banner — SMEDJAN wordmark now Zen Dots (was Electrolize), matches the site; cache-bust ?v=3
8b1753fdocs: re-render README banner with the correct Zen Dots headline (was a fallback bold sans)

Other

bd3d855readme: cache-bust the banner URL (?v=2) so GitHub camo serves the corrected Zen Dots banner

v0.1.2 · 2026-06-24 · 27 commits

Other

1221930docs+release: CUDA at full parity (269/269 both backends) + NVIDIA perf; bump v0.1.2
de8fcdetrain: name every forged model (Smedjan-5M, Smedjan-98M, ...)
1612280train: live Unicode loss sparkline in the progress line

CUDA

fcc9aebcuda: parallelize l2_norm_check (gradient clipping) — multi-block atomicAdd
53dc40acuda: route grad-weight GEMM (TN) through cuBLAS — free the backward
d456e26cuda: bf16 tensor-core GEMM on the fast path (opt-in via SMEDJAN_BF16_GEMM)
28abef4cuda: key buffer pool by device instance — fix cross-stream reuse race
84e841bcuda: real buffer pool — eliminate per-op cudaMalloc thrash

Tests

e86ea4btest(cuda): gate the fp16-range-clamp asserts to Metal (CUDA fast path is now TF32)
51bdbf1test(cuda): gate perf-benchmark timing assertions to the Metal backend
90db798test(cuda): make matmul range tests hermetic (pin fp16 path explicitly)

Added

8b2995cfeat(cuda): TF32 tensor cores for the training path (phase 3 — the actual speedup)
622a07afeat(cuda): route precise 2D + batched matmuls through cuBLAS (phase 2)
8f498b1feat(cuda): wire cuBLAS GEMM (tensor cores) — phase 1: gpu_matmul (NN)
7dd3a16feat(cuda): port YaRN RoPE to the CUDA backend
e0f3b47feat(train): --checkpoint-interval CLI flag (was hardcoded 5000, no flag)
de9cf11feat(interop): close website-listed gaps — bf16/f16 import, config.json→model, GGUF q4_0; correct mixer status

Fixed

8371b69fix(cuda): use cudarc result::sgemm + sys::lib().cublasSetMathMode (correct API paths)
0cdcbbdfix(cuda): fp32 batched matmul precise path (batched/block-sparse gradchecks)
5f96e8afix(cuda): fp32 precise matmul path (fixes gradchecks numeric=0 from fp16 rounding)
cf91df5fix(cuda): remove duplicate kernel definitions that broke NVRTC for the whole backend
7cc12cdfix(gguf): align every tensor data offset to general.alignment (32)

Chore

1eefad0chore(cuda): turnkey remote CUDA verify script for vast.ai box
61195f2chore(gitignore): ignore secrets/credentials (vast keys, SSH keys, .env, certs)

Docs

9868aa7docs(readme): honest GGUF/llama.cpp framing (valid GGML blocks; turnkey inference is roadmap)
892c3fedocs(quantize): export_gguf doc comment lists q4_0 too
779e7a7docs: forge banner, re-measured performance, zero-warning rustdoc

v0.1.1 · 2026-06-24 · 320 commits

Other

9a2780ereadme: logo + title header; honest re-measured performance
b7dfd43style(safetensors): format export import module
f3ca531ci(runtime): lock before syncing checkout
99e8b60ci(runtime): add train smokes and serialize GPU gates
19caa25ci(runtime): enforce full Mac readiness gate
84add82tests: wrap seg buffers in u32_to_buf so causal_doc_mask tests typecheck under CUDA (Buf != BufU32)
385010dbench: fix KV-cache overflow in decode section (both backends)
3d4e788wip(cuda): backend port — de-objc2 shared code, BufU32, cuda Buf/Arc + stubs
53dcf21revert(train): the loss-readout "fix" (1eb974d) regressed training — restore correct behavior
d7c0357investigate(checkpointing): f16 fix helps but a separate recompute-gradient bug remains — keep off
653a90fFused transpose+RoPE kernel + wire scaled_causal_softmax
d874a8aRevert inline FP16 cast — pre-cast path 36% faster (bandwidth-bound)
9778967Lift fused kernel d_model limit + buffer pool recycling
0a5b43cPersistent Metal compute encoder + conditional workspace
55f4d50Fix checkpoint corruption + NaN gradients + proper ReLoRA
923b7b7Add repetition penalty to text generation
cf7d9b2Fix BPE chunk split, GPU backend aliases, DPO bounds checks
f068b80GQA strided attention: skip repeat_kv copy for inference
8fbe24eGuard against zero-weight data mixing (division by zero)
713b623Fix fused CE wrong offset + sample_token panic on empty probs
163a8d9Fix MoD train/inference mismatch, stale flash comment, softmax 1D
75edeadFix 5 bugs: unicode escapes, Ollama parser, SamplingConfig, inverse_sqrt_lr, GALORE guard
9a97dfcCRITICAL FIX: skip weight decay for norm weights (1D params)
9171a70Remove mid-step loss spike check (breaks GPU batching)
03090a3Skip catastrophic loss batches (> 3× EMA) before backward
2fc89a3Enable embedding gradient via weight-tying (remove detach)
5107f1aRevert embedding detach (NaN from LM head backward) + shuffled loader
160544bFix embedding gradient + shuffled data loader
4b10f9aCRITICAL FIX: causal_mask now records on autograd tape
04ca936Fix grow_model: scaled init (0.01×) instead of zero/full random
1de16d5Pretrained loading + zeroed grow_model for stable progressive training
448b658Fused kernels + persistent layer + bench command + MegaFfn backward
c1d0620Gate accumulate_grad_for_test behind #[cfg(test)]
cd44ce18 new unit tests: ReLU, AXPY, WSD, inverse-sqrt, EMA, SliceCols, Muon
c450c88EMA self-distillation via LM head + grow_model lowrank support
2b137d8grow_model: lowrank + QK-norm + MTP support for progressive training
ba88dd4Remove all _ prefix suppressions — wire every variable properly
e1d2785Loss delta tracking
d115bb1Safetensors export + enhanced training summary + overflow fix
7d365c2LR schedules (WSO/invsqrt/trapezoid) + best model auto-save
039ebdaTraining UX: ETA estimation, loss spike detection, training summary, checkpoint merging
bccdf60AXPY in Muon NS loop (3→2 dispatches per iteration)
3fbad81Muon pre-allocated workspace (+11%), AXPY kernel, DataMixer
961da19DataMixer + ReLU routing + remaining SIMD + sliding window + freeze
a5f65f3ReLU activation + ReMoE routing + progressive layer freezing + sliding window wired
03b3b16Wire sliding window into attention + progressive layer freezing
5162d51Wire sliding window attention into causal mask + scaled_causal_softmax
4b68afeSIMD-optimize softmax_backward + rms_norm_backward (9/11 kernels)
9c1b008Implement ReLoRA merge + anti-PGD noise + SIMD l2_norm_check
70ff224Fix z-loss buffer aliasing: pool recycling caused scalar_buf corruption
bdff8b7SIMD-optimize cross_entropy, rms_norm_residual, logsumexp kernels
9bf4d4fSIMD-first reductions: 4x faster on Apple Silicon (softmax, rms_norm, scaled_causal_softmax)
4f0b986CPU optimizer (Apple Silicon zero-copy) + forward_hidden + apply_lm_head
f95b65bWire FusedLinearCrossEntropy into training loop + forward_hidden
13d6494Fix speculative pretraining: batch ref forward in no_grad + explicit flush
55c1476Fix gradient checkpointing: Metal GPU matmul non-determinism
3cd429aShared-expert MoE (DeepSeek-V3) + Phase B progress
cf95659FusedLinearCrossEntropy: chunked logit computation saving ~2GB peak memory
8a2d92aComplete Phase A: bias-based MoE balancing + ReLoRA config
3694c37Muon optimizer: 2.5x convergence via Newton-Schulz orthogonalization
0f05eb3Phase A: WSD schedule + EMA self-distillation + anti-PGD noise config
763afd2Gate QK-norm at d>=512, revert fused projections (dispatch analysis)
9c06241Disable z-loss (bug: loss drops to 0.018), verify all configs
0cddb4eFP16 activation compression between transformer layers
162c4ffMixture of Depths + GGUF export + BitNet INT2 confirmed
0a56c8bWire --stochastic-depth and --sliding-window CLI args
27e952fStochastic depth + sliding window attention config
05d9df5Out-of-place RoPE forward: eliminate copy+in-place (2→1 dispatch)
461bbd4Z-loss + LogSumExp kernel for MoE training stability
9759836QK-norm + curriculum learning
059c6e6Multi-token prediction (Meta 2024): 4x sample efficiency
ed24c25+21% algorithmic throughput: 39,430 → 47,700 tok/s
b9f2b0fALBERT shared layers + multi-token prediction prep
8032ee2Wire Sophia optimizer into training loop — 2x faster convergence
4d983a0Sophia optimizer + AIQ analysis (items #2-#7 from research list)
763b6e2BitNet b1.58: ternary weight caching + output scaling fix
bc3b7d2Flash Attention v2: shared memory K/V tiles (forward + backward)
a6ee4deRevert broken Flash v2 merge — backward shader incomplete
643a4e5WIP: Flash Attention v2 — shared memory tiles + LSE + split backward
0993307Low-rank attention projections: Q,K,V,O decomposed → 2478 tok/s
4766d72Embedding factorization + Lion optimizer + sliding window prep
c94b90fGALORE + Speculative Pretraining — all 5 PhD compounds done
309dcfaMulti-scale progressive training: grow_model + CLI
a8d98c1Online data pruning + PhD-level compounding framework
416ce62Low-rank FFN training: 2x throughput on M1 (PhD-level compound)
09f5eeaPre-allocated loss workspace + deep allocation analysis
52145beUpdate AndreOS backend to match andreos-gpu crate API
b00e746Wire BitNet to training: --bitnet flag for ternary FFN weights
7274d6dFP16 weight cache: avoid redundant float→half casts per step
af968d7Zero warnings: wire all 9 via gpu_diagnostic API
2fcdb0cWire warnings: 3 new tests, gpu.rs backend-agnostic import
8787280Phase-by-phase optimization sweep
9304a1aOptimize scale_rows backward + batch size tuning
1766090MoE: on-tape routing via matmul selector + scale_rows
579316dFix MoE gradient flow: scale_rows op + on-tape column extraction
b9c8f2bAudit: MoE training produces uniform output (known bug)
7ca3be5Phase 3 + 6: Data quality pipeline + API distillation
75dcbaePhase 4: μP (Maximal Update Parameterization)
6a5cfddPhase 5: BitNet ternary matmul — add/subtract only, no multiply
2b7abe6MoE GPU routing: 16x faster with gather/scatter kernels
95603dfPhase 2: Mixture of Experts (MoE) — N×params at 1×compute
a2c9835Phase 1 complete: Flash Attention forward + backward
1cb9c2cPhase 1: Flash Attention forward — fused QK^T→mask→softmax→@V
72070bcAdd comprehensive README + update project documentation
4b933c8Add AndreOS backend — zero-overhead direct GPU access
a395cc0CUDA backend: full kernel parity with Metal (46 kernels)
b791ea3Add CUDA backend scaffold — portable training across Metal + NVIDIA
d436123Rust-level optimizations: pre-alloc data buffers + write_u32_to_buffer
f895e83Async optimizer flush + params_buffer aliasing fix + checkpointing verified
8a9e65dFix misleading comment: clamp is [1,256] not [1,512]
b2a5377Restore FP16 backward — NaN was from buffer recycling, not precision
b853333Fix FP16 NaN: clamp half casts + FP32 backward pass
261ee5dFix critical grad accum buffer recycling bug + E2E validation
2808a63NICE TO HAVE: RoPE scaling, warm restarts, file logging, early stopping, GQA backward fix
9d49729Fix resume off-by-one: start from step+1 after checkpoint
5825f0cAdd checkpoint resume, validation loss, dropout config (production-ready)
418137cRevert batched FP16 matmuls — cast overhead hurts small attention dims
25fce52Complete FP16: all 12 matmul variants use half-precision inputs
14f12beFP16 forward+backward: all non-batched matmuls use half inputs
377ec32FP16 tensor storage for matmul inputs: cast + half* reads
469d301FP16 mixed-precision matmul: half shared memory + accumulate
47e2a80Clean up clip_gradients batching (no perf change)
1da0933Merge forward+backward into single GPU command batch
7aafda7Document buffer pool recycling safety (no functional change)
30f1d53PERF-14: RoPE sincos single instruction + rms_norm_residual_with_sum
8fc1fe7Fix 3 critical bugs found in manual audit
ac8acd7Add gradient accumulation, GQA, and DPO (1,771 lines)
57b8397PERF-11: Move loss readback to log interval (eliminate per-step GPU sync)
71ca782Revert "PERF-1+2: simdgroup matrix intrinsics + bank conflict fix"
d671d77PERF-1+2: simdgroup matrix intrinsics + bank conflict fix
9a285e4Fix w_norm GPU sync regression: check every 100 steps, not every log step
5c98985Wire all warnings + audit fix: 0 warnings, 60 tests
6f677e5Add speculative decoding + knowledge distillation (901 lines)
832787eRevert broken SIMD shaders + params pool from WIP commit
b362f29WIP: Round 2 audit — SIMD shaders, KV cache, sampling optimizations
eb8485dFix BUG-1 concat_parts assertion + BUG-2 loss requires_grad
d3e0f8bFix RISK-4 regression: targeted copy only for shared grad buffers
d9243e5Wire batched matmul shaders: 96 dispatches → 3 per attention layer
94fa132Fix RISK-4 gradient corruption + RISK-5 memory pressure + batched MSL shaders
48dbd2bSystematic audit fixes: bugs, performance, safety
717aee5Deep bug audit: fix overflow, add 15 tests (55 total)
39435bdDeep perf audit: batched grad clip, embedding backward, mmap zero-copy, tokenizer early-exit
0f79979Fix Retained<GpuBuffer> type mismatch in batched gradient clipping
9e1f75cSystematic audit: fix bugs, maximize GPU perf, minimize latency
d6f8b47GPU-resident KV cache concat + Tensor::from_buffer
6bf4419Buffer pool + fused residual+RMSNorm kernel
c0ccda7Fused SiLU-gate kernel: eliminate 2 dispatches + 1 temp buffer per transformer layer
6ff3fe4Clean up temp file, add to gitignore
2244205Major perf: command buffer batching + eliminate all forward-path CPU roundtrips
1fc4871Fix bugs + GPU-accelerate backward pass: eliminate 14/15 CPU roundtrips
bae7bc7Launch full autopilot pipeline (batch=32, 1774 tok/s)
43b1bfbAdd training pipeline automation and monitoring scripts
41a32a2Prepare 43K SFT instruction pairs for fine-tuning
a753ae3Optimize BPE tokenizer: 27MB in 51s (was hours)
a9893efAdd test suite, SFT fine-tuning, quantization, Rust API
606d25dZero warnings, zero clippy lints — production-grade codebase
66109b3Clean stale imports from objc2-metal migration and refactors
f2a6aeeFix attention gradient flow with batched matmul ops
bd68d30Add data pipeline, gradient checkpointing, training stabilization

Chore

54fe5cfchore: add FUNDING.yml (GitHub Sponsors → smedjan org)
4c58c95chore: add authors field to Cargo.toml
866ef6bchore: slim crates.io package (exclude dev/CI data + scripts)
498af28chore: add OSS package metadata (license/repository/readme/keywords)
1e74061chore: drop useless vec! in matmul_precise test (clippy clean)
5af4fc2chore: clear all 39 clippy style warnings → clippy/build warning-free
140a96echore: make cargo clippy pass (pre-existing correctness-lint errors)

Renamed

58397f2rename: AndreAI -> Smedjan; prepare for open-source release

Fixed

490b11bfix(generate): speculative decoding respects max_tokens; cover untested live paths
88f6676fix(attention): sliding window attends to exactly w keys, not w+1
c49ef70fix(checkpoint): validate model geometry on load
98b20cafix(data): reject malformed pipeline inputs without panicking
fd76a5dfix(bench): reject zero batch/seq/iters before runtime
91c0b29fix(grow): validate progressive-growth geometry
f856052fix(convert): harden quantized checkpoint loading
bb0c5bafix(train): validate auxiliary training runtimes
28cfe38fix(generate): validate sampling args before runtime
fdfead1fix(ci): serialize mac poller reports
6bb7a7cfix(runtime): harden longctx eval and per-sample loss
42f2df2fix(loss): honor upstream cross entropy gradients
1cdaeb5fix(train): harden runtime validation and moe routing
7dbdf77fix(train): validate yarn scale before training
3d8d852fix(checkpoint): persist yarn rope config
ec9fcd1fix(safetensors): reject malformed imports precisely
68fc3c1fix(rope): keep yarn fused rope ci-clean and differentiable
b2aad2bfix(eval): keep long-context suite ci-clean
bed9459fix(checkpoint): reject malformed artifacts without panics
23bd467fix(tokenizer): reject malformed tokenizer files instead of panicking
245be58fix(data): validate mixed shard inputs
89ae410fix(cli): report runtime input errors without panics
3fa5496fix(train): reject unknown training config values
92001d8fix(data): reject malformed training datasets
3078be5fix(train): resume from final state without skipping steps
68165f2fix(runtime): restore AdamW resume state across backends
9bdb13efix(runtime): harden fused training paths
db80161fix(runtime): harden training sanitizer paths
e9a8a2dfix(cuda): restore backend parity after cautious muon
652c03cfix(cuda): real fp16 cast pack/unpack — the f16 buffer is f32-backed
39e51e4fix(rwkv): numerically-stable wkv — RWKV now trains (was loss-flat at seq>=~32)
abcdefffix(attention,train): make --block-sparse-top-k train (gather dispatch + step-level pool bypass)
43df13dfix(autograd): zero the scatter-add accumulator in backward_gather_blocks
9bfe2ccfix(train): flush between grad-accum micro-steps — fixes gradient corruption at seq_len >= 256
d96ba2efix(metal): quarantine buffer pool by default — fixes silent gradient corruption at seq_len >= 256
83200e4fix(metal): Flash Attention partial-last-q-block bug (fwd+bwd) + add grad-check & dense-equivalence tests
b248ffcfix(train): loss readout — copy loss BEFORE backward (correct buffer-hazard fix)
a3f21f3fix(metal): RoPE transpose kernel under-dispatch (corrupted cached decode) + thread-local matmul-path flags
1eb974dfix(train): correct loss readout at large batch (was a degenerate constant 1.0)
9652e00fix(train,checkpoint): kill optimizer double-allocation + de-flake the hybrid test (self-audit)
9084b07fix(docs): bf16-matmul is overflow-mitigation only, NOT a default — real-run finding
dc28af6fix(optim): AdamW eps 1e-8 -> 1e-5 — the second half of the instability fix
70aa3d8fix(rmsnorm): bound the backward to kill AdamW's instability (gradient explosion → divergence)
48fe541fix(checkpointing): re-enable gradient checkpointing — root-cause was buffer-pool corruption
9ace2f2fix(metal): invalidate fp16/ternary conversion cache on buffer reuse (systemic aliasing bug)
9d00376fix: memory estimation accounts for frozen base weights in ReLoRA mode
f02b789fix: GGUF export writes actual quantized data for Q8_0, shared_layers naming
edb1d82fix: param_count accounts for lowrank (ReLoRA) and shared_layers (ALBERT)
931d922fix: shared_layers (ALBERT) mode returned duplicate params N times
a719b45fix: dedup generate prefill, use zero-copy as_slice for logit readback
d21249afix: flash attention backward dK/dV data race — use atomic adds
c2a9c22fix: DPO optimizer runs batched, recycle gradient/cache buffers
95a8499fix: SFT optimizer runs batched, recycle gradient/cache buffers
3ad0ca0fix: recycle ternary cache buffers instead of dropping on clear
a0c13dbfix: forward_hidden now respects checkpointed flag
a6a3bbffix: load_training_state accepts v4 checkpoint format, handles ReLoRA base params

Changed

1c2e0e7refactor(model): extract embed_lookup; grad-check embedding scatter-add backward
3d737e9refactor(backend): route shared code through crate::gpu abstraction layer
2097a0erefactor(optim): delete dead GaLore code + correct stale "shipped" doc claims
5a65be4refactor: eliminate the clippy.toml thresholds with real refactors (no suppression, no allows)
ce0c5c2refactor: remove ALL #[allow] annotations — fix every warning at the root, no suppression
0def7fcrefactor: extract dispatch_backward_op, fix batch matmul, -96 lines

Tests

6ed5b2atest: grad-check broadcast_rows, concat, slice, exp, relu backward
7a4de10test(runtime): promote serial correctness gates
db4fa85test(rwkv): grad-check the existing wkv backward + scope chunked RWKV (design done, gated on a primitive)
7959811test(autograd): grad-check the whole attention kernel path — close the unverified-custom-backward gap
d58b1fatest(optim): ground the beta2=0.999 anomaly (not a bug) + handoff: delivered roadmap
04ba2b2test(train): end-to-end convergence smoke test — proves the loop actually learns
def000dtest: add 7 new tests covering checkpoint, loss, optimizer, quantize, data

Added

86c10b5feat(train): --yarn-scale CLI flag + train_smoke yarn case + CUDA yarn guard
2494018feat(rope): activate YaRN from config (with_yarn) end-to-end
c1089e7feat(rope): YaRN per-frequency RoPE scaling in the fused rope kernels
605a21dfeat(safetensors): HF-Llama import/export (load external weights as retrofit init)
08d1103feat(safetensors): zero-dep export/import with bit-exact round-trip
7bc10d2feat(eval): long-context NIAH/RULER retrieval+reasoning suite
77f25bcfeat(optim): Cautious Muon (--cautious) — sign-agreement update masking
5bc8f64feat(cuda): flash attention forward + backward (online softmax, tiled)
b5e1a93feat(cuda): 8-bit (block-wise int8) AdamW optimizer
e339b62feat(cuda): BitNet ternary matmul/quantize + drop all #[allow] suppressions
15eafc1feat(cuda): real bf16-tile matmul + bf16 flag
3333ca4feat(cuda): wire simdgroup matmul flag + variants (alias to tiled/fp32)
e2c5fabfeat(cuda): port MoE gather/scatter + precise fp32-tile matmul
3e44d1dfeat(cuda): port block-sparse attention + Muon/NorMuon optimizer to CUDA
8905a0afeat(train): wire --ssm / --rwkv / --linear-attn / --linear-attn-period flags
b16d52ffeat(cuda): wire matmul batch + fp32 path; forward runs to transpose_rope
b4ced5ffeat(cuda): CUDA backend compiles clean (--features cuda, 0 errors)
da17124feat(bench): --simdgroup-matmul flag so the hardware MMA path is measurable
a6a4c23feat(ssm): chunked O(seq·chunk) SSD forward — verified equal to the materialised O(seq²) form
e45fd1bfeat(attention,autograd): block-sparse TRAINABLE backward (§3) + fix latent non-square batched_matmul_trans_a
a36ba92feat(attention,model): seq-packing — thread seg_ids through forward (per-document attention)
6b752b8feat(ci): origin/main poll driver + launchd agent for the Mac CI runner (B4)
c29b75bfeat(tests,metal): Phase B GPU-correctness harness + buffer_from_slice cache fix
9d366e1feat(generate): no-repeat-ngram-size control + land EMA-export and NorMuon batch
b8a82effeat(attention): MLA incremental decode with latent KV cache (caches c, not K/V)
d5e21cffeat(attention): true-subquadratic block-sparse gather attention (4× fewer FLOPs, 1.65× at seq=1024)
60ab749feat(attention): per-batch causal_doc_mask; revert broken packing model-integration (kept op-level)
e973e41feat(rmsnorm): root-cause + grounded test for the activation-collapse instability (#5)
d663815feat(checkpoint,optim): persist muon/hybrid/8-bit optimizer state across resume
48b93d1feat(metal): batched simdgroup MMA — extend the hardware fast path to attention matmuls
eae5412feat(attention): block-sparse attention (MoBA/NSA) — subq.ai's quality-preserving sparse attn
f45cf32feat(data): sequence packing / varlen — block-diagonal causal mask, no padding waste
c118f9cfeat(attention): MLA — Multi-head Latent Attention (16× KV-cache shrink, new AttnKind)
305b2f2feat(metal): bf16 default-matmul option — fp32 range without the fp16 ±65504 clamp
bfe1b68feat(optim): 8-bit AdamW — block-wise int8 moments (~4× less optimizer memory)
165fba0feat(metal): simdgroup_matrix MMA matmul — hardware matrix units (measured 1.29× at 1024³)
aa4a377feat(optim): Muon+AdamW hybrid (role-aware) + configurable AdamW + update/per-tensor clipping
b2e627cfeat(cli): wire generate_batch to generate --batch-file (one prompt per line)
972eeadfeat(generate): batched multi-sequence generation through one KV cache
76ddbe3feat(tokenizer): import external BPE from a GPT-2/HF merges.txt
e3821e2feat(metal): bf16 tiled matmul — fp32 range without the fp16 ±65504 clamp
2022a29feat(eval,sampling): perplexity metric + min-p / locally-typical sampling
5532ccbfeat(metal): dedicated broadcast_rows kernel (replaces the K=1 outer-product matmul) — #5
baedd06feat(metal): opt-in full-FP32 matmul (precision/range path) — addresses the fp16-tile clamp
2945ff8feat(model): wire the RWKV time-mix into the Transformer block (closes the mixer set)
b587793feat(model): wire the SSM (Mamba-2/SSD) mixer into the Transformer block
ccd5268feat(model): hybrid per-layer topology — alternate transformer & linear-attention layers
128bf76feat(rwkv): RWKV-6-style time-mix core (token-shift + per-channel WKV)
a4db8f1feat(ssm): selective state-space (Mamba-2/SSD-style) token mixer core
8780d52feat(autograd): elementwise exp op (SSM/RWKV selective-decay primitive)
2823fa5feat(linear-attn): wire O(N) linear attention into the model (config + checkpoint v5)
21c18e6feat(linear-attn): O(N) chunked-parallel form — the genuine linear-scaling win
c48ef46feat(autograd): batched_matmul_trans_a op — the linear-attn / SSM state-update primitive
5d13e71feat(linear-attn): softmax-free linear attention core (Stage A, masked reference)

CUDA

405378bcuda: wire adamw update_clip (kernel ignored it; now clamps normalized update like Metal)
5e9af02cuda: wire 9 utility kernels + causal_doc_mask true -inf (suite 140->164)
574f5f7cuda: GQA (repeat_kv) + seq-packing (causal_doc_mask) + rms_norm clamp
79b0800cuda: CE grad mean-scaling via Rust post-scale + wire ema_update
aa330b1cuda: fix rms_norm_backward (missing cross-term) + cross_entropy grad scaling
054a036cuda: fix transpose_perm_backward arg order (was zeroing attention output)
d13a2c9cuda: fix gpu_l2_norm read count (l2_norm_check writes 2 floats [sum_sq, nan]); read 2 not 1
09f824ecuda: fix scaled_causal_softmax OOB + bind ctx for raw htod (training memcheck-clean)
2231b37cuda: route decode through primitive FFN path (fused megakernel is Metal-only)
1243c0acuda: wire dense forward+backward kernel batch (forward runs at 43k tok/s)

Docs

2806f50docs: CUDA training bring-up — memcheck-clean fwd+bwd, scaled_causal_softmax fix, diagnostic recipe
9c6e548docs(cuda): record runtime bring-up state + remaining kernel-wiring map
f3d26b9docs(cuda): scaffold + rented-box playbook for CUDA training-parity
641c354docs: scope the two remaining #12 throughput items precisely (chunked SSM tractable; MLA absorbed = arch fork)
a6783c9docs: Phase A verification record (M3/air) — fixes verified, NorMuon characterized
e667601docs(handoff): drain bleeding & gaps — prioritized plan for a fresh session
bc55404docs(handoff): round-3 results — block-sparse gather + MLA decode delivered; #6 findings + readout-regression postmortem
a4b3151docs(handoff): record drained follow-ups (opt-state persist, batched simdgroup, #5 root-cause) + scope remaining
797b3b2docs: handoff — AdamW hardening roadmap + efficiency techniques (10x capacity @ 1/10 usage)
2762c37docs(test): update AdamW notes — instability fixed (RMSNorm clamp + eps), Muon kept for speed
4f2ecf2docs: correct two matmul comments proven false by measurement

Performance

3fe2128perf(autograd): recycle fresh chained activations in clear_tape (pool reuse 4%->14%)
a9c92caperf(metal): route batched/attention backward through hardware MMA
20239aaperf(matmul): enable hardware MMA by default for train + inference
e807476perf(metal): hardware MMA kernels for the backward pass (trans_a/trans_b)
9dfd90cperf: batch-write dataset files instead of per-token syscalls
6b57acaperf: streaming generate uses zero-copy as_slice for decode logits
3448f3bperf: avoid intermediate Vec allocation in clear_tape/clear_tape_keep_grads
d467b34perf: reduce peak memory during checkpointed backward, increase pool cap
af4f376perf: clippy audit — eliminate 33 warnings, remove hot-path allocations

v0.1.0 · 2026-03-22 · 1 commits

Other

7c3372dAndreAI: pure Rust AI engine built from zero