// changelog
Releases
Every version of Smedjan, generated from the git history. No ghostwritten release notes — just the commits that shipped.
v0.1.3 · 2026-06-28 · 23 commits
Added
- 6e42661feat: Rust edition 2024 + un-ignore all tests (v0.1.3)
- dfedce7feat: forward_with_cache + RoPE cache + decode benchmark baseline
- eb45406feat: RoPE frequency cache — precomputed cos/sin table eliminates per-thread pow+sincos
- a4edd8dfeat: KV-cache in full-attention forward + Muon optimizer + decode kernel in qmul
- b81afb3feat: full craft — decode kernel in forward, GPU embedding gather, LoRA on attn, BPE, CLI, self-distill
- 56e2d4cfeat: output-centric quantized decode kernel (3.5x faster) + KV-cache + self-distillation + spec-decode
- 84fde1ffeat(qwen3.5): LoRA fine-tuning — frozen 9B base + trainable low-rank adapters
- 4a4b58bfeat(qwen3.5): quantized forward produces real non-zero logits from 9B weights on Metal
- e55adf9feat(qwen3.5): quantized GEMM — 9B Qwythos loads + runs forward on 16GB M1
- 3a95ebdfeat(qwen3.5): strict_qwen35 forward — real activations (softplus+dt_bias+A_log+RMSNormGated)
- aae43e9feat(qwen3.5): affine-int4 safetensors loader — 927-tensor Q4 artifact → Qwen35Model
- c9a538dfeat(qwen3.5): full hybrid model forward (Qwen35Model) — verified end-to-end
- e05084efeat(qwen3.5): full-attention layer (GQA+QK-norm+partial-RoPE+causal+gate) — verified
- e7793befeat(qwen3.5): config_from_hf_qwen35 parser (Phase 5 start)
- ba44dfcfeat(qwen3.5): full Gated-DeltaNet mixer layer assembly — verified end-to-end
- 2370309feat(qwen3.5): asymmetric GQA head expansion (expand_heads) for DeltaNet 16k/32v heads
- fa08195feat(qwen3.5): sigmoid (via softmax, no new kernel) + attention output-gate
- 0d153f0feat(qwen3.5): causal conv1d + partial-RoPE primitives (composed ops, verified)
- de13d81feat(gated-deltanet): Qwen3.5 linear-attention mixer — materialized, composed-ops, both backends
Changed
- 9030687refactor: genericize distillation API naming
Docs
Other
- bd3d855readme: cache-bust the banner URL (?v=2) so GitHub camo serves the corrected Zen Dots banner
v0.1.2 · 2026-06-24 · 27 commits
Other
CUDA
- fcc9aebcuda: parallelize l2_norm_check (gradient clipping) — multi-block atomicAdd
- 53dc40acuda: route grad-weight GEMM (TN) through cuBLAS — free the backward
- d456e26cuda: bf16 tensor-core GEMM on the fast path (opt-in via SMEDJAN_BF16_GEMM)
- 28abef4cuda: key buffer pool by device instance — fix cross-stream reuse race
- 84e841bcuda: real buffer pool — eliminate per-op cudaMalloc thrash
Tests
Added
- 8b2995cfeat(cuda): TF32 tensor cores for the training path (phase 3 — the actual speedup)
- 622a07afeat(cuda): route precise 2D + batched matmuls through cuBLAS (phase 2)
- 8f498b1feat(cuda): wire cuBLAS GEMM (tensor cores) — phase 1: gpu_matmul (NN)
- 7dd3a16feat(cuda): port YaRN RoPE to the CUDA backend
- e0f3b47feat(train): --checkpoint-interval CLI flag (was hardcoded 5000, no flag)
- de9cf11feat(interop): close website-listed gaps — bf16/f16 import, config.json→model, GGUF q4_0; correct mixer status
Fixed
- 8371b69fix(cuda): use cudarc result::sgemm + sys::lib().cublasSetMathMode (correct API paths)
- 0cdcbbdfix(cuda): fp32 batched matmul precise path (batched/block-sparse gradchecks)
- 5f96e8afix(cuda): fp32 precise matmul path (fixes gradchecks numeric=0 from fp16 rounding)
- cf91df5fix(cuda): remove duplicate kernel definitions that broke NVRTC for the whole backend
- 7cc12cdfix(gguf): align every tensor data offset to general.alignment (32)
Chore
v0.1.1 · 2026-06-24 · 320 commits
Other
- 9a2780ereadme: logo + title header; honest re-measured performance
- b7dfd43style(safetensors): format export import module
- f3ca531ci(runtime): lock before syncing checkout
- 99e8b60ci(runtime): add train smokes and serialize GPU gates
- 19caa25ci(runtime): enforce full Mac readiness gate
- 84add82tests: wrap seg buffers in u32_to_buf so causal_doc_mask tests typecheck under CUDA (Buf != BufU32)
- 385010dbench: fix KV-cache overflow in decode section (both backends)
- 3d4e788wip(cuda): backend port — de-objc2 shared code, BufU32, cuda Buf/Arc + stubs
- 53dcf21revert(train): the loss-readout "fix" (1eb974d) regressed training — restore correct behavior
- d7c0357investigate(checkpointing): f16 fix helps but a separate recompute-gradient bug remains — keep off
- 653a90fFused transpose+RoPE kernel + wire scaled_causal_softmax
- d874a8aRevert inline FP16 cast — pre-cast path 36% faster (bandwidth-bound)
- 9778967Lift fused kernel d_model limit + buffer pool recycling
- 0a5b43cPersistent Metal compute encoder + conditional workspace
- 55f4d50Fix checkpoint corruption + NaN gradients + proper ReLoRA
- 923b7b7Add repetition penalty to text generation
- cf7d9b2Fix BPE chunk split, GPU backend aliases, DPO bounds checks
- f068b80GQA strided attention: skip repeat_kv copy for inference
- 8fbe24eGuard against zero-weight data mixing (division by zero)
- 713b623Fix fused CE wrong offset + sample_token panic on empty probs
- 163a8d9Fix MoD train/inference mismatch, stale flash comment, softmax 1D
- 75edeadFix 5 bugs: unicode escapes, Ollama parser, SamplingConfig, inverse_sqrt_lr, GALORE guard
- 9a97dfcCRITICAL FIX: skip weight decay for norm weights (1D params)
- 9171a70Remove mid-step loss spike check (breaks GPU batching)
- 03090a3Skip catastrophic loss batches (> 3× EMA) before backward
- 2fc89a3Enable embedding gradient via weight-tying (remove detach)
- 5107f1aRevert embedding detach (NaN from LM head backward) + shuffled loader
- 160544bFix embedding gradient + shuffled data loader
- 4b10f9aCRITICAL FIX: causal_mask now records on autograd tape
- 04ca936Fix grow_model: scaled init (0.01×) instead of zero/full random
- 1de16d5Pretrained loading + zeroed grow_model for stable progressive training
- 448b658Fused kernels + persistent layer + bench command + MegaFfn backward
- c1d0620Gate accumulate_grad_for_test behind #[cfg(test)]
- cd44ce18 new unit tests: ReLU, AXPY, WSD, inverse-sqrt, EMA, SliceCols, Muon
- c450c88EMA self-distillation via LM head + grow_model lowrank support
- 2b137d8grow_model: lowrank + QK-norm + MTP support for progressive training
- ba88dd4Remove all _ prefix suppressions — wire every variable properly
- e1d2785Loss delta tracking
- d115bb1Safetensors export + enhanced training summary + overflow fix
- 7d365c2LR schedules (WSO/invsqrt/trapezoid) + best model auto-save
- 039ebdaTraining UX: ETA estimation, loss spike detection, training summary, checkpoint merging
- bccdf60AXPY in Muon NS loop (3→2 dispatches per iteration)
- 3fbad81Muon pre-allocated workspace (+11%), AXPY kernel, DataMixer
- 961da19DataMixer + ReLU routing + remaining SIMD + sliding window + freeze
- a5f65f3ReLU activation + ReMoE routing + progressive layer freezing + sliding window wired
- 03b3b16Wire sliding window into attention + progressive layer freezing
- 5162d51Wire sliding window attention into causal mask + scaled_causal_softmax
- 4b68afeSIMD-optimize softmax_backward + rms_norm_backward (9/11 kernels)
- 9c1b008Implement ReLoRA merge + anti-PGD noise + SIMD l2_norm_check
- 70ff224Fix z-loss buffer aliasing: pool recycling caused scalar_buf corruption
- bdff8b7SIMD-optimize cross_entropy, rms_norm_residual, logsumexp kernels
- 9bf4d4fSIMD-first reductions: 4x faster on Apple Silicon (softmax, rms_norm, scaled_causal_softmax)
- 4f0b986CPU optimizer (Apple Silicon zero-copy) + forward_hidden + apply_lm_head
- f95b65bWire FusedLinearCrossEntropy into training loop + forward_hidden
- 13d6494Fix speculative pretraining: batch ref forward in no_grad + explicit flush
- 55c1476Fix gradient checkpointing: Metal GPU matmul non-determinism
- 3cd429aShared-expert MoE (DeepSeek-V3) + Phase B progress
- cf95659FusedLinearCrossEntropy: chunked logit computation saving ~2GB peak memory
- 8a2d92aComplete Phase A: bias-based MoE balancing + ReLoRA config
- 3694c37Muon optimizer: 2.5x convergence via Newton-Schulz orthogonalization
- 0f05eb3Phase A: WSD schedule + EMA self-distillation + anti-PGD noise config
- 763afd2Gate QK-norm at d>=512, revert fused projections (dispatch analysis)
- 9c06241Disable z-loss (bug: loss drops to 0.018), verify all configs
- 0cddb4eFP16 activation compression between transformer layers
- 162c4ffMixture of Depths + GGUF export + BitNet INT2 confirmed
- 0a56c8bWire --stochastic-depth and --sliding-window CLI args
- 27e952fStochastic depth + sliding window attention config
- 05d9df5Out-of-place RoPE forward: eliminate copy+in-place (2→1 dispatch)
- 461bbd4Z-loss + LogSumExp kernel for MoE training stability
- 9759836QK-norm + curriculum learning
- 059c6e6Multi-token prediction (Meta 2024): 4x sample efficiency
- ed24c25+21% algorithmic throughput: 39,430 → 47,700 tok/s
- b9f2b0fALBERT shared layers + multi-token prediction prep
- 8032ee2Wire Sophia optimizer into training loop — 2x faster convergence
- 4d983a0Sophia optimizer + AIQ analysis (items #2-#7 from research list)
- 763b6e2BitNet b1.58: ternary weight caching + output scaling fix
- bc3b7d2Flash Attention v2: shared memory K/V tiles (forward + backward)
- a6ee4deRevert broken Flash v2 merge — backward shader incomplete
- 643a4e5WIP: Flash Attention v2 — shared memory tiles + LSE + split backward
- 0993307Low-rank attention projections: Q,K,V,O decomposed → 2478 tok/s
- 4766d72Embedding factorization + Lion optimizer + sliding window prep
- c94b90fGALORE + Speculative Pretraining — all 5 PhD compounds done
- 309dcfaMulti-scale progressive training: grow_model + CLI
- a8d98c1Online data pruning + PhD-level compounding framework
- 416ce62Low-rank FFN training: 2x throughput on M1 (PhD-level compound)
- 09f5eeaPre-allocated loss workspace + deep allocation analysis
- 52145beUpdate AndreOS backend to match andreos-gpu crate API
- b00e746Wire BitNet to training: --bitnet flag for ternary FFN weights
- 7274d6dFP16 weight cache: avoid redundant float→half casts per step
- af968d7Zero warnings: wire all 9 via gpu_diagnostic API
- 2fcdb0cWire warnings: 3 new tests, gpu.rs backend-agnostic import
- 8787280Phase-by-phase optimization sweep
- 9304a1aOptimize scale_rows backward + batch size tuning
- 1766090MoE: on-tape routing via matmul selector + scale_rows
- 579316dFix MoE gradient flow: scale_rows op + on-tape column extraction
- b9c8f2bAudit: MoE training produces uniform output (known bug)
- 7ca3be5Phase 3 + 6: Data quality pipeline + API distillation
- 75dcbaePhase 4: μP (Maximal Update Parameterization)
- 6a5cfddPhase 5: BitNet ternary matmul — add/subtract only, no multiply
- 2b7abe6MoE GPU routing: 16x faster with gather/scatter kernels
- 95603dfPhase 2: Mixture of Experts (MoE) — N×params at 1×compute
- a2c9835Phase 1 complete: Flash Attention forward + backward
- 1cb9c2cPhase 1: Flash Attention forward — fused QK^T→mask→softmax→@V
- 72070bcAdd comprehensive README + update project documentation
- 4b933c8Add AndreOS backend — zero-overhead direct GPU access
- a395cc0CUDA backend: full kernel parity with Metal (46 kernels)
- b791ea3Add CUDA backend scaffold — portable training across Metal + NVIDIA
- d436123Rust-level optimizations: pre-alloc data buffers + write_u32_to_buffer
- f895e83Async optimizer flush + params_buffer aliasing fix + checkpointing verified
- 8a9e65dFix misleading comment: clamp is [1,256] not [1,512]
- b2a5377Restore FP16 backward — NaN was from buffer recycling, not precision
- b853333Fix FP16 NaN: clamp half casts + FP32 backward pass
- 261ee5dFix critical grad accum buffer recycling bug + E2E validation
- 2808a63NICE TO HAVE: RoPE scaling, warm restarts, file logging, early stopping, GQA backward fix
- 9d49729Fix resume off-by-one: start from step+1 after checkpoint
- 5825f0cAdd checkpoint resume, validation loss, dropout config (production-ready)
- 418137cRevert batched FP16 matmuls — cast overhead hurts small attention dims
- 25fce52Complete FP16: all 12 matmul variants use half-precision inputs
- 14f12beFP16 forward+backward: all non-batched matmuls use half inputs
- 377ec32FP16 tensor storage for matmul inputs: cast + half* reads
- 469d301FP16 mixed-precision matmul: half shared memory + accumulate
- 47e2a80Clean up clip_gradients batching (no perf change)
- 1da0933Merge forward+backward into single GPU command batch
- 7aafda7Document buffer pool recycling safety (no functional change)
- 30f1d53PERF-14: RoPE sincos single instruction + rms_norm_residual_with_sum
- 8fc1fe7Fix 3 critical bugs found in manual audit
- ac8acd7Add gradient accumulation, GQA, and DPO (1,771 lines)
- 57b8397PERF-11: Move loss readback to log interval (eliminate per-step GPU sync)
- 71ca782Revert "PERF-1+2: simdgroup matrix intrinsics + bank conflict fix"
- d671d77PERF-1+2: simdgroup matrix intrinsics + bank conflict fix
- 9a285e4Fix w_norm GPU sync regression: check every 100 steps, not every log step
- 5c98985Wire all warnings + audit fix: 0 warnings, 60 tests
- 6f677e5Add speculative decoding + knowledge distillation (901 lines)
- 832787eRevert broken SIMD shaders + params pool from WIP commit
- b362f29WIP: Round 2 audit — SIMD shaders, KV cache, sampling optimizations
- eb8485dFix BUG-1 concat_parts assertion + BUG-2 loss requires_grad
- d3e0f8bFix RISK-4 regression: targeted copy only for shared grad buffers
- d9243e5Wire batched matmul shaders: 96 dispatches → 3 per attention layer
- 94fa132Fix RISK-4 gradient corruption + RISK-5 memory pressure + batched MSL shaders
- 48dbd2bSystematic audit fixes: bugs, performance, safety
- 717aee5Deep bug audit: fix overflow, add 15 tests (55 total)
- 39435bdDeep perf audit: batched grad clip, embedding backward, mmap zero-copy, tokenizer early-exit
- 0f79979Fix Retained<GpuBuffer> type mismatch in batched gradient clipping
- 9e1f75cSystematic audit: fix bugs, maximize GPU perf, minimize latency
- d6f8b47GPU-resident KV cache concat + Tensor::from_buffer
- 6bf4419Buffer pool + fused residual+RMSNorm kernel
- c0ccda7Fused SiLU-gate kernel: eliminate 2 dispatches + 1 temp buffer per transformer layer
- 6ff3fe4Clean up temp file, add to gitignore
- 2244205Major perf: command buffer batching + eliminate all forward-path CPU roundtrips
- 1fc4871Fix bugs + GPU-accelerate backward pass: eliminate 14/15 CPU roundtrips
- bae7bc7Launch full autopilot pipeline (batch=32, 1774 tok/s)
- 43b1bfbAdd training pipeline automation and monitoring scripts
- 41a32a2Prepare 43K SFT instruction pairs for fine-tuning
- a753ae3Optimize BPE tokenizer: 27MB in 51s (was hours)
- a9893efAdd test suite, SFT fine-tuning, quantization, Rust API
- 606d25dZero warnings, zero clippy lints — production-grade codebase
- 66109b3Clean stale imports from objc2-metal migration and refactors
- f2a6aeeFix attention gradient flow with batched matmul ops
- bd68d30Add data pipeline, gradient checkpointing, training stabilization
Chore
- 54fe5cfchore: add FUNDING.yml (GitHub Sponsors → smedjan org)
- 4c58c95chore: add authors field to Cargo.toml
- 866ef6bchore: slim crates.io package (exclude dev/CI data + scripts)
- 498af28chore: add OSS package metadata (license/repository/readme/keywords)
- 1e74061chore: drop useless vec! in matmul_precise test (clippy clean)
- 5af4fc2chore: clear all 39 clippy style warnings → clippy/build warning-free
- 140a96echore: make cargo clippy pass (pre-existing correctness-lint errors)
Renamed
- 58397f2rename: AndreAI -> Smedjan; prepare for open-source release
Fixed
- 490b11bfix(generate): speculative decoding respects max_tokens; cover untested live paths
- 88f6676fix(attention): sliding window attends to exactly w keys, not w+1
- c49ef70fix(checkpoint): validate model geometry on load
- 98b20cafix(data): reject malformed pipeline inputs without panicking
- fd76a5dfix(bench): reject zero batch/seq/iters before runtime
- 91c0b29fix(grow): validate progressive-growth geometry
- f856052fix(convert): harden quantized checkpoint loading
- bb0c5bafix(train): validate auxiliary training runtimes
- 28cfe38fix(generate): validate sampling args before runtime
- fdfead1fix(ci): serialize mac poller reports
- 6bb7a7cfix(runtime): harden longctx eval and per-sample loss
- 42f2df2fix(loss): honor upstream cross entropy gradients
- 1cdaeb5fix(train): harden runtime validation and moe routing
- 7dbdf77fix(train): validate yarn scale before training
- 3d8d852fix(checkpoint): persist yarn rope config
- ec9fcd1fix(safetensors): reject malformed imports precisely
- 68fc3c1fix(rope): keep yarn fused rope ci-clean and differentiable
- b2aad2bfix(eval): keep long-context suite ci-clean
- bed9459fix(checkpoint): reject malformed artifacts without panics
- 23bd467fix(tokenizer): reject malformed tokenizer files instead of panicking
- 245be58fix(data): validate mixed shard inputs
- 89ae410fix(cli): report runtime input errors without panics
- 3fa5496fix(train): reject unknown training config values
- 92001d8fix(data): reject malformed training datasets
- 3078be5fix(train): resume from final state without skipping steps
- 68165f2fix(runtime): restore AdamW resume state across backends
- 9bdb13efix(runtime): harden fused training paths
- db80161fix(runtime): harden training sanitizer paths
- e9a8a2dfix(cuda): restore backend parity after cautious muon
- 652c03cfix(cuda): real fp16 cast pack/unpack — the f16 buffer is f32-backed
- 39e51e4fix(rwkv): numerically-stable wkv — RWKV now trains (was loss-flat at seq>=~32)
- abcdefffix(attention,train): make --block-sparse-top-k train (gather dispatch + step-level pool bypass)
- 43df13dfix(autograd): zero the scatter-add accumulator in backward_gather_blocks
- 9bfe2ccfix(train): flush between grad-accum micro-steps — fixes gradient corruption at seq_len >= 256
- d96ba2efix(metal): quarantine buffer pool by default — fixes silent gradient corruption at seq_len >= 256
- 83200e4fix(metal): Flash Attention partial-last-q-block bug (fwd+bwd) + add grad-check & dense-equivalence tests
- b248ffcfix(train): loss readout — copy loss BEFORE backward (correct buffer-hazard fix)
- a3f21f3fix(metal): RoPE transpose kernel under-dispatch (corrupted cached decode) + thread-local matmul-path flags
- 1eb974dfix(train): correct loss readout at large batch (was a degenerate constant 1.0)
- 9652e00fix(train,checkpoint): kill optimizer double-allocation + de-flake the hybrid test (self-audit)
- 9084b07fix(docs): bf16-matmul is overflow-mitigation only, NOT a default — real-run finding
- dc28af6fix(optim): AdamW eps 1e-8 -> 1e-5 — the second half of the instability fix
- 70aa3d8fix(rmsnorm): bound the backward to kill AdamW's instability (gradient explosion → divergence)
- 48fe541fix(checkpointing): re-enable gradient checkpointing — root-cause was buffer-pool corruption
- 9ace2f2fix(metal): invalidate fp16/ternary conversion cache on buffer reuse (systemic aliasing bug)
- 9d00376fix: memory estimation accounts for frozen base weights in ReLoRA mode
- f02b789fix: GGUF export writes actual quantized data for Q8_0, shared_layers naming
- edb1d82fix: param_count accounts for lowrank (ReLoRA) and shared_layers (ALBERT)
- 931d922fix: shared_layers (ALBERT) mode returned duplicate params N times
- a719b45fix: dedup generate prefill, use zero-copy as_slice for logit readback
- d21249afix: flash attention backward dK/dV data race — use atomic adds
- c2a9c22fix: DPO optimizer runs batched, recycle gradient/cache buffers
- 95a8499fix: SFT optimizer runs batched, recycle gradient/cache buffers
- 3ad0ca0fix: recycle ternary cache buffers instead of dropping on clear
- a0c13dbfix: forward_hidden now respects checkpointed flag
- a6a3bbffix: load_training_state accepts v4 checkpoint format, handles ReLoRA base params
Changed
- 1c2e0e7refactor(model): extract embed_lookup; grad-check embedding scatter-add backward
- 3d737e9refactor(backend): route shared code through crate::gpu abstraction layer
- 2097a0erefactor(optim): delete dead GaLore code + correct stale "shipped" doc claims
- 5a65be4refactor: eliminate the clippy.toml thresholds with real refactors (no suppression, no allows)
- ce0c5c2refactor: remove ALL #[allow] annotations — fix every warning at the root, no suppression
- 0def7fcrefactor: extract dispatch_backward_op, fix batch matmul, -96 lines
Tests
- 6ed5b2atest: grad-check broadcast_rows, concat, slice, exp, relu backward
- 7a4de10test(runtime): promote serial correctness gates
- db4fa85test(rwkv): grad-check the existing wkv backward + scope chunked RWKV (design done, gated on a primitive)
- 7959811test(autograd): grad-check the whole attention kernel path — close the unverified-custom-backward gap
- d58b1fatest(optim): ground the beta2=0.999 anomaly (not a bug) + handoff: delivered roadmap
- 04ba2b2test(train): end-to-end convergence smoke test — proves the loop actually learns
- def000dtest: add 7 new tests covering checkpoint, loss, optimizer, quantize, data
Added
- 86c10b5feat(train): --yarn-scale CLI flag + train_smoke yarn case + CUDA yarn guard
- 2494018feat(rope): activate YaRN from config (with_yarn) end-to-end
- c1089e7feat(rope): YaRN per-frequency RoPE scaling in the fused rope kernels
- 605a21dfeat(safetensors): HF-Llama import/export (load external weights as retrofit init)
- 08d1103feat(safetensors): zero-dep export/import with bit-exact round-trip
- 7bc10d2feat(eval): long-context NIAH/RULER retrieval+reasoning suite
- 77f25bcfeat(optim): Cautious Muon (--cautious) — sign-agreement update masking
- 5bc8f64feat(cuda): flash attention forward + backward (online softmax, tiled)
- b5e1a93feat(cuda): 8-bit (block-wise int8) AdamW optimizer
- e339b62feat(cuda): BitNet ternary matmul/quantize + drop all #[allow] suppressions
- 15eafc1feat(cuda): real bf16-tile matmul + bf16 flag
- 3333ca4feat(cuda): wire simdgroup matmul flag + variants (alias to tiled/fp32)
- e2c5fabfeat(cuda): port MoE gather/scatter + precise fp32-tile matmul
- 3e44d1dfeat(cuda): port block-sparse attention + Muon/NorMuon optimizer to CUDA
- 8905a0afeat(train): wire --ssm / --rwkv / --linear-attn / --linear-attn-period flags
- b16d52ffeat(cuda): wire matmul batch + fp32 path; forward runs to transpose_rope
- b4ced5ffeat(cuda): CUDA backend compiles clean (--features cuda, 0 errors)
- da17124feat(bench): --simdgroup-matmul flag so the hardware MMA path is measurable
- a6a4c23feat(ssm): chunked O(seq·chunk) SSD forward — verified equal to the materialised O(seq²) form
- e45fd1bfeat(attention,autograd): block-sparse TRAINABLE backward (§3) + fix latent non-square batched_matmul_trans_a
- a36ba92feat(attention,model): seq-packing — thread seg_ids through forward (per-document attention)
- 6b752b8feat(ci): origin/main poll driver + launchd agent for the Mac CI runner (B4)
- c29b75bfeat(tests,metal): Phase B GPU-correctness harness + buffer_from_slice cache fix
- 9d366e1feat(generate): no-repeat-ngram-size control + land EMA-export and NorMuon batch
- b8a82effeat(attention): MLA incremental decode with latent KV cache (caches c, not K/V)
- d5e21cffeat(attention): true-subquadratic block-sparse gather attention (4× fewer FLOPs, 1.65× at seq=1024)
- 60ab749feat(attention): per-batch causal_doc_mask; revert broken packing model-integration (kept op-level)
- e973e41feat(rmsnorm): root-cause + grounded test for the activation-collapse instability (#5)
- d663815feat(checkpoint,optim): persist muon/hybrid/8-bit optimizer state across resume
- 48b93d1feat(metal): batched simdgroup MMA — extend the hardware fast path to attention matmuls
- eae5412feat(attention): block-sparse attention (MoBA/NSA) — subq.ai's quality-preserving sparse attn
- f45cf32feat(data): sequence packing / varlen — block-diagonal causal mask, no padding waste
- c118f9cfeat(attention): MLA — Multi-head Latent Attention (16× KV-cache shrink, new AttnKind)
- 305b2f2feat(metal): bf16 default-matmul option — fp32 range without the fp16 ±65504 clamp
- bfe1b68feat(optim): 8-bit AdamW — block-wise int8 moments (~4× less optimizer memory)
- 165fba0feat(metal): simdgroup_matrix MMA matmul — hardware matrix units (measured 1.29× at 1024³)
- aa4a377feat(optim): Muon+AdamW hybrid (role-aware) + configurable AdamW + update/per-tensor clipping
- b2e627cfeat(cli): wire generate_batch to generate --batch-file (one prompt per line)
- 972eeadfeat(generate): batched multi-sequence generation through one KV cache
- 76ddbe3feat(tokenizer): import external BPE from a GPT-2/HF merges.txt
- e3821e2feat(metal): bf16 tiled matmul — fp32 range without the fp16 ±65504 clamp
- 2022a29feat(eval,sampling): perplexity metric + min-p / locally-typical sampling
- 5532ccbfeat(metal): dedicated broadcast_rows kernel (replaces the K=1 outer-product matmul) — #5
- baedd06feat(metal): opt-in full-FP32 matmul (precision/range path) — addresses the fp16-tile clamp
- 2945ff8feat(model): wire the RWKV time-mix into the Transformer block (closes the mixer set)
- b587793feat(model): wire the SSM (Mamba-2/SSD) mixer into the Transformer block
- ccd5268feat(model): hybrid per-layer topology — alternate transformer & linear-attention layers
- 128bf76feat(rwkv): RWKV-6-style time-mix core (token-shift + per-channel WKV)
- a4db8f1feat(ssm): selective state-space (Mamba-2/SSD-style) token mixer core
- 8780d52feat(autograd): elementwise exp op (SSM/RWKV selective-decay primitive)
- 2823fa5feat(linear-attn): wire O(N) linear attention into the model (config + checkpoint v5)
- 21c18e6feat(linear-attn): O(N) chunked-parallel form — the genuine linear-scaling win
- c48ef46feat(autograd): batched_matmul_trans_a op — the linear-attn / SSM state-update primitive
- 5d13e71feat(linear-attn): softmax-free linear attention core (Stage A, masked reference)
CUDA
- 405378bcuda: wire adamw update_clip (kernel ignored it; now clamps normalized update like Metal)
- 5e9af02cuda: wire 9 utility kernels + causal_doc_mask true -inf (suite 140->164)
- 574f5f7cuda: GQA (repeat_kv) + seq-packing (causal_doc_mask) + rms_norm clamp
- 79b0800cuda: CE grad mean-scaling via Rust post-scale + wire ema_update
- aa330b1cuda: fix rms_norm_backward (missing cross-term) + cross_entropy grad scaling
- 054a036cuda: fix transpose_perm_backward arg order (was zeroing attention output)
- d13a2c9cuda: fix gpu_l2_norm read count (l2_norm_check writes 2 floats [sum_sq, nan]); read 2 not 1
- 09f824ecuda: fix scaled_causal_softmax OOB + bind ctx for raw htod (training memcheck-clean)
- 2231b37cuda: route decode through primitive FFN path (fused megakernel is Metal-only)
- 1243c0acuda: wire dense forward+backward kernel batch (forward runs at 43k tok/s)
Docs
- 2806f50docs: CUDA training bring-up — memcheck-clean fwd+bwd, scaled_causal_softmax fix, diagnostic recipe
- 9c6e548docs(cuda): record runtime bring-up state + remaining kernel-wiring map
- f3d26b9docs(cuda): scaffold + rented-box playbook for CUDA training-parity
- 641c354docs: scope the two remaining #12 throughput items precisely (chunked SSM tractable; MLA absorbed = arch fork)
- a6783c9docs: Phase A verification record (M3/air) — fixes verified, NorMuon characterized
- e667601docs(handoff): drain bleeding & gaps — prioritized plan for a fresh session
- bc55404docs(handoff): round-3 results — block-sparse gather + MLA decode delivered; #6 findings + readout-regression postmortem
- a4b3151docs(handoff): record drained follow-ups (opt-state persist, batched simdgroup, #5 root-cause) + scope remaining
- 797b3b2docs: handoff — AdamW hardening roadmap + efficiency techniques (10x capacity @ 1/10 usage)
- 2762c37docs(test): update AdamW notes — instability fixed (RMSNorm clamp + eps), Muon kept for speed
- 4f2ecf2docs: correct two matmul comments proven false by measurement
Performance
- 3fe2128perf(autograd): recycle fresh chained activations in clear_tape (pool reuse 4%->14%)
- a9c92caperf(metal): route batched/attention backward through hardware MMA
- 20239aaperf(matmul): enable hardware MMA by default for train + inference
- e807476perf(metal): hardware MMA kernels for the backward pass (trans_a/trans_b)
- 9dfd90cperf: batch-write dataset files instead of per-token syscalls
- 6b57acaperf: streaming generate uses zero-copy as_slice for decode logits
- 3448f3bperf: avoid intermediate Vec allocation in clear_tape/clear_tape_keep_grads
- d467b34perf: reduce peak memory during checkpointed backward, increase pool cap
- af4f376perf: clippy audit — eliminate 33 warnings, remove hot-path allocations
v0.1.0 · 2026-03-22 · 1 commits
Other
- 7c3372dAndreAI: pure Rust AI engine built from zero