// changelog

Releases

Every version of Smedjan, generated from the git history. No ghostwritten release notes — just the commits that shipped.

v0.1.3 · 2026-06-28 · 23 commits

Added

  • 6e42661feat: Rust edition 2024 + un-ignore all tests (v0.1.3)
  • dfedce7feat: forward_with_cache + RoPE cache + decode benchmark baseline
  • eb45406feat: RoPE frequency cache — precomputed cos/sin table eliminates per-thread pow+sincos
  • a4edd8dfeat: KV-cache in full-attention forward + Muon optimizer + decode kernel in qmul
  • b81afb3feat: full craft — decode kernel in forward, GPU embedding gather, LoRA on attn, BPE, CLI, self-distill
  • 56e2d4cfeat: output-centric quantized decode kernel (3.5x faster) + KV-cache + self-distillation + spec-decode
  • 84fde1ffeat(qwen3.5): LoRA fine-tuning — frozen 9B base + trainable low-rank adapters
  • 4a4b58bfeat(qwen3.5): quantized forward produces real non-zero logits from 9B weights on Metal
  • e55adf9feat(qwen3.5): quantized GEMM — 9B Qwythos loads + runs forward on 16GB M1
  • 3a95ebdfeat(qwen3.5): strict_qwen35 forward — real activations (softplus+dt_bias+A_log+RMSNormGated)
  • aae43e9feat(qwen3.5): affine-int4 safetensors loader — 927-tensor Q4 artifact → Qwen35Model
  • c9a538dfeat(qwen3.5): full hybrid model forward (Qwen35Model) — verified end-to-end
  • e05084efeat(qwen3.5): full-attention layer (GQA+QK-norm+partial-RoPE+causal+gate) — verified
  • e7793befeat(qwen3.5): config_from_hf_qwen35 parser (Phase 5 start)
  • ba44dfcfeat(qwen3.5): full Gated-DeltaNet mixer layer assembly — verified end-to-end
  • 2370309feat(qwen3.5): asymmetric GQA head expansion (expand_heads) for DeltaNet 16k/32v heads
  • fa08195feat(qwen3.5): sigmoid (via softmax, no new kernel) + attention output-gate
  • 0d153f0feat(qwen3.5): causal conv1d + partial-RoPE primitives (composed ops, verified)
  • de13d81feat(gated-deltanet): Qwen3.5 linear-attention mixer — materialized, composed-ops, both backends

Changed

  • 9030687refactor: genericize distillation API naming

Docs

  • 55fc089docs: re-render banner — SMEDJAN wordmark now Zen Dots (was Electrolize), matches the site; cache-bust ?v=3
  • 8b1753fdocs: re-render README banner with the correct Zen Dots headline (was a fallback bold sans)

Other

  • bd3d855readme: cache-bust the banner URL (?v=2) so GitHub camo serves the corrected Zen Dots banner

v0.1.2 · 2026-06-24 · 27 commits

Other

  • 1221930docs+release: CUDA at full parity (269/269 both backends) + NVIDIA perf; bump v0.1.2
  • de8fcdetrain: name every forged model (Smedjan-5M, Smedjan-98M, ...)
  • 1612280train: live Unicode loss sparkline in the progress line

CUDA

  • fcc9aebcuda: parallelize l2_norm_check (gradient clipping) — multi-block atomicAdd
  • 53dc40acuda: route grad-weight GEMM (TN) through cuBLAS — free the backward
  • d456e26cuda: bf16 tensor-core GEMM on the fast path (opt-in via SMEDJAN_BF16_GEMM)
  • 28abef4cuda: key buffer pool by device instance — fix cross-stream reuse race
  • 84e841bcuda: real buffer pool — eliminate per-op cudaMalloc thrash

Tests

  • e86ea4btest(cuda): gate the fp16-range-clamp asserts to Metal (CUDA fast path is now TF32)
  • 51bdbf1test(cuda): gate perf-benchmark timing assertions to the Metal backend
  • 90db798test(cuda): make matmul range tests hermetic (pin fp16 path explicitly)

Added

  • 8b2995cfeat(cuda): TF32 tensor cores for the training path (phase 3 — the actual speedup)
  • 622a07afeat(cuda): route precise 2D + batched matmuls through cuBLAS (phase 2)
  • 8f498b1feat(cuda): wire cuBLAS GEMM (tensor cores) — phase 1: gpu_matmul (NN)
  • 7dd3a16feat(cuda): port YaRN RoPE to the CUDA backend
  • e0f3b47feat(train): --checkpoint-interval CLI flag (was hardcoded 5000, no flag)
  • de9cf11feat(interop): close website-listed gaps — bf16/f16 import, config.json→model, GGUF q4_0; correct mixer status

Fixed

  • 8371b69fix(cuda): use cudarc result::sgemm + sys::lib().cublasSetMathMode (correct API paths)
  • 0cdcbbdfix(cuda): fp32 batched matmul precise path (batched/block-sparse gradchecks)
  • 5f96e8afix(cuda): fp32 precise matmul path (fixes gradchecks numeric=0 from fp16 rounding)
  • cf91df5fix(cuda): remove duplicate kernel definitions that broke NVRTC for the whole backend
  • 7cc12cdfix(gguf): align every tensor data offset to general.alignment (32)

Chore

  • 1eefad0chore(cuda): turnkey remote CUDA verify script for vast.ai box
  • 61195f2chore(gitignore): ignore secrets/credentials (vast keys, SSH keys, .env, certs)

Docs

  • 9868aa7docs(readme): honest GGUF/llama.cpp framing (valid GGML blocks; turnkey inference is roadmap)
  • 892c3fedocs(quantize): export_gguf doc comment lists q4_0 too
  • 779e7a7docs: forge banner, re-measured performance, zero-warning rustdoc

v0.1.1 · 2026-06-24 · 320 commits

Other

  • 9a2780ereadme: logo + title header; honest re-measured performance
  • b7dfd43style(safetensors): format export import module
  • f3ca531ci(runtime): lock before syncing checkout
  • 99e8b60ci(runtime): add train smokes and serialize GPU gates
  • 19caa25ci(runtime): enforce full Mac readiness gate
  • 84add82tests: wrap seg buffers in u32_to_buf so causal_doc_mask tests typecheck under CUDA (Buf != BufU32)
  • 385010dbench: fix KV-cache overflow in decode section (both backends)
  • 3d4e788wip(cuda): backend port — de-objc2 shared code, BufU32, cuda Buf/Arc + stubs
  • 53dcf21revert(train): the loss-readout "fix" (1eb974d) regressed training — restore correct behavior
  • d7c0357investigate(checkpointing): f16 fix helps but a separate recompute-gradient bug remains — keep off
  • 653a90fFused transpose+RoPE kernel + wire scaled_causal_softmax
  • d874a8aRevert inline FP16 cast — pre-cast path 36% faster (bandwidth-bound)
  • 9778967Lift fused kernel d_model limit + buffer pool recycling
  • 0a5b43cPersistent Metal compute encoder + conditional workspace
  • 55f4d50Fix checkpoint corruption + NaN gradients + proper ReLoRA
  • 923b7b7Add repetition penalty to text generation
  • cf7d9b2Fix BPE chunk split, GPU backend aliases, DPO bounds checks
  • f068b80GQA strided attention: skip repeat_kv copy for inference
  • 8fbe24eGuard against zero-weight data mixing (division by zero)
  • 713b623Fix fused CE wrong offset + sample_token panic on empty probs
  • 163a8d9Fix MoD train/inference mismatch, stale flash comment, softmax 1D
  • 75edeadFix 5 bugs: unicode escapes, Ollama parser, SamplingConfig, inverse_sqrt_lr, GALORE guard
  • 9a97dfcCRITICAL FIX: skip weight decay for norm weights (1D params)
  • 9171a70Remove mid-step loss spike check (breaks GPU batching)
  • 03090a3Skip catastrophic loss batches (> 3× EMA) before backward
  • 2fc89a3Enable embedding gradient via weight-tying (remove detach)
  • 5107f1aRevert embedding detach (NaN from LM head backward) + shuffled loader
  • 160544bFix embedding gradient + shuffled data loader
  • 4b10f9aCRITICAL FIX: causal_mask now records on autograd tape
  • 04ca936Fix grow_model: scaled init (0.01×) instead of zero/full random
  • 1de16d5Pretrained loading + zeroed grow_model for stable progressive training
  • 448b658Fused kernels + persistent layer + bench command + MegaFfn backward
  • c1d0620Gate accumulate_grad_for_test behind #[cfg(test)]
  • cd44ce18 new unit tests: ReLU, AXPY, WSD, inverse-sqrt, EMA, SliceCols, Muon
  • c450c88EMA self-distillation via LM head + grow_model lowrank support
  • 2b137d8grow_model: lowrank + QK-norm + MTP support for progressive training
  • ba88dd4Remove all _ prefix suppressions — wire every variable properly
  • e1d2785Loss delta tracking
  • d115bb1Safetensors export + enhanced training summary + overflow fix
  • 7d365c2LR schedules (WSO/invsqrt/trapezoid) + best model auto-save
  • 039ebdaTraining UX: ETA estimation, loss spike detection, training summary, checkpoint merging
  • bccdf60AXPY in Muon NS loop (3→2 dispatches per iteration)
  • 3fbad81Muon pre-allocated workspace (+11%), AXPY kernel, DataMixer
  • 961da19DataMixer + ReLU routing + remaining SIMD + sliding window + freeze
  • a5f65f3ReLU activation + ReMoE routing + progressive layer freezing + sliding window wired
  • 03b3b16Wire sliding window into attention + progressive layer freezing
  • 5162d51Wire sliding window attention into causal mask + scaled_causal_softmax
  • 4b68afeSIMD-optimize softmax_backward + rms_norm_backward (9/11 kernels)
  • 9c1b008Implement ReLoRA merge + anti-PGD noise + SIMD l2_norm_check
  • 70ff224Fix z-loss buffer aliasing: pool recycling caused scalar_buf corruption
  • bdff8b7SIMD-optimize cross_entropy, rms_norm_residual, logsumexp kernels
  • 9bf4d4fSIMD-first reductions: 4x faster on Apple Silicon (softmax, rms_norm, scaled_causal_softmax)
  • 4f0b986CPU optimizer (Apple Silicon zero-copy) + forward_hidden + apply_lm_head
  • f95b65bWire FusedLinearCrossEntropy into training loop + forward_hidden
  • 13d6494Fix speculative pretraining: batch ref forward in no_grad + explicit flush
  • 55c1476Fix gradient checkpointing: Metal GPU matmul non-determinism
  • 3cd429aShared-expert MoE (DeepSeek-V3) + Phase B progress
  • cf95659FusedLinearCrossEntropy: chunked logit computation saving ~2GB peak memory
  • 8a2d92aComplete Phase A: bias-based MoE balancing + ReLoRA config
  • 3694c37Muon optimizer: 2.5x convergence via Newton-Schulz orthogonalization
  • 0f05eb3Phase A: WSD schedule + EMA self-distillation + anti-PGD noise config
  • 763afd2Gate QK-norm at d>=512, revert fused projections (dispatch analysis)
  • 9c06241Disable z-loss (bug: loss drops to 0.018), verify all configs
  • 0cddb4eFP16 activation compression between transformer layers
  • 162c4ffMixture of Depths + GGUF export + BitNet INT2 confirmed
  • 0a56c8bWire --stochastic-depth and --sliding-window CLI args
  • 27e952fStochastic depth + sliding window attention config
  • 05d9df5Out-of-place RoPE forward: eliminate copy+in-place (2→1 dispatch)
  • 461bbd4Z-loss + LogSumExp kernel for MoE training stability
  • 9759836QK-norm + curriculum learning
  • 059c6e6Multi-token prediction (Meta 2024): 4x sample efficiency
  • ed24c25+21% algorithmic throughput: 39,430 → 47,700 tok/s
  • b9f2b0fALBERT shared layers + multi-token prediction prep
  • 8032ee2Wire Sophia optimizer into training loop — 2x faster convergence
  • 4d983a0Sophia optimizer + AIQ analysis (items #2-#7 from research list)
  • 763b6e2BitNet b1.58: ternary weight caching + output scaling fix
  • bc3b7d2Flash Attention v2: shared memory K/V tiles (forward + backward)
  • a6ee4deRevert broken Flash v2 merge — backward shader incomplete
  • 643a4e5WIP: Flash Attention v2 — shared memory tiles + LSE + split backward
  • 0993307Low-rank attention projections: Q,K,V,O decomposed → 2478 tok/s
  • 4766d72Embedding factorization + Lion optimizer + sliding window prep
  • c94b90fGALORE + Speculative Pretraining — all 5 PhD compounds done
  • 309dcfaMulti-scale progressive training: grow_model + CLI
  • a8d98c1Online data pruning + PhD-level compounding framework
  • 416ce62Low-rank FFN training: 2x throughput on M1 (PhD-level compound)
  • 09f5eeaPre-allocated loss workspace + deep allocation analysis
  • 52145beUpdate AndreOS backend to match andreos-gpu crate API
  • b00e746Wire BitNet to training: --bitnet flag for ternary FFN weights
  • 7274d6dFP16 weight cache: avoid redundant float→half casts per step
  • af968d7Zero warnings: wire all 9 via gpu_diagnostic API
  • 2fcdb0cWire warnings: 3 new tests, gpu.rs backend-agnostic import
  • 8787280Phase-by-phase optimization sweep
  • 9304a1aOptimize scale_rows backward + batch size tuning
  • 1766090MoE: on-tape routing via matmul selector + scale_rows
  • 579316dFix MoE gradient flow: scale_rows op + on-tape column extraction
  • b9c8f2bAudit: MoE training produces uniform output (known bug)
  • 7ca3be5Phase 3 + 6: Data quality pipeline + API distillation
  • 75dcbaePhase 4: μP (Maximal Update Parameterization)
  • 6a5cfddPhase 5: BitNet ternary matmul — add/subtract only, no multiply
  • 2b7abe6MoE GPU routing: 16x faster with gather/scatter kernels
  • 95603dfPhase 2: Mixture of Experts (MoE) — N×params at 1×compute
  • a2c9835Phase 1 complete: Flash Attention forward + backward
  • 1cb9c2cPhase 1: Flash Attention forward — fused QK^T→mask→softmax→@V
  • 72070bcAdd comprehensive README + update project documentation
  • 4b933c8Add AndreOS backend — zero-overhead direct GPU access
  • a395cc0CUDA backend: full kernel parity with Metal (46 kernels)
  • b791ea3Add CUDA backend scaffold — portable training across Metal + NVIDIA
  • d436123Rust-level optimizations: pre-alloc data buffers + write_u32_to_buffer
  • f895e83Async optimizer flush + params_buffer aliasing fix + checkpointing verified
  • 8a9e65dFix misleading comment: clamp is [1,256] not [1,512]
  • b2a5377Restore FP16 backward — NaN was from buffer recycling, not precision
  • b853333Fix FP16 NaN: clamp half casts + FP32 backward pass
  • 261ee5dFix critical grad accum buffer recycling bug + E2E validation
  • 2808a63NICE TO HAVE: RoPE scaling, warm restarts, file logging, early stopping, GQA backward fix
  • 9d49729Fix resume off-by-one: start from step+1 after checkpoint
  • 5825f0cAdd checkpoint resume, validation loss, dropout config (production-ready)
  • 418137cRevert batched FP16 matmuls — cast overhead hurts small attention dims
  • 25fce52Complete FP16: all 12 matmul variants use half-precision inputs
  • 14f12beFP16 forward+backward: all non-batched matmuls use half inputs
  • 377ec32FP16 tensor storage for matmul inputs: cast + half* reads
  • 469d301FP16 mixed-precision matmul: half shared memory + accumulate
  • 47e2a80Clean up clip_gradients batching (no perf change)
  • 1da0933Merge forward+backward into single GPU command batch
  • 7aafda7Document buffer pool recycling safety (no functional change)
  • 30f1d53PERF-14: RoPE sincos single instruction + rms_norm_residual_with_sum
  • 8fc1fe7Fix 3 critical bugs found in manual audit
  • ac8acd7Add gradient accumulation, GQA, and DPO (1,771 lines)
  • 57b8397PERF-11: Move loss readback to log interval (eliminate per-step GPU sync)
  • 71ca782Revert "PERF-1+2: simdgroup matrix intrinsics + bank conflict fix"
  • d671d77PERF-1+2: simdgroup matrix intrinsics + bank conflict fix
  • 9a285e4Fix w_norm GPU sync regression: check every 100 steps, not every log step
  • 5c98985Wire all warnings + audit fix: 0 warnings, 60 tests
  • 6f677e5Add speculative decoding + knowledge distillation (901 lines)
  • 832787eRevert broken SIMD shaders + params pool from WIP commit
  • b362f29WIP: Round 2 audit — SIMD shaders, KV cache, sampling optimizations
  • eb8485dFix BUG-1 concat_parts assertion + BUG-2 loss requires_grad
  • d3e0f8bFix RISK-4 regression: targeted copy only for shared grad buffers
  • d9243e5Wire batched matmul shaders: 96 dispatches → 3 per attention layer
  • 94fa132Fix RISK-4 gradient corruption + RISK-5 memory pressure + batched MSL shaders
  • 48dbd2bSystematic audit fixes: bugs, performance, safety
  • 717aee5Deep bug audit: fix overflow, add 15 tests (55 total)
  • 39435bdDeep perf audit: batched grad clip, embedding backward, mmap zero-copy, tokenizer early-exit
  • 0f79979Fix Retained<GpuBuffer> type mismatch in batched gradient clipping
  • 9e1f75cSystematic audit: fix bugs, maximize GPU perf, minimize latency
  • d6f8b47GPU-resident KV cache concat + Tensor::from_buffer
  • 6bf4419Buffer pool + fused residual+RMSNorm kernel
  • c0ccda7Fused SiLU-gate kernel: eliminate 2 dispatches + 1 temp buffer per transformer layer
  • 6ff3fe4Clean up temp file, add to gitignore
  • 2244205Major perf: command buffer batching + eliminate all forward-path CPU roundtrips
  • 1fc4871Fix bugs + GPU-accelerate backward pass: eliminate 14/15 CPU roundtrips
  • bae7bc7Launch full autopilot pipeline (batch=32, 1774 tok/s)
  • 43b1bfbAdd training pipeline automation and monitoring scripts
  • 41a32a2Prepare 43K SFT instruction pairs for fine-tuning
  • a753ae3Optimize BPE tokenizer: 27MB in 51s (was hours)
  • a9893efAdd test suite, SFT fine-tuning, quantization, Rust API
  • 606d25dZero warnings, zero clippy lints — production-grade codebase
  • 66109b3Clean stale imports from objc2-metal migration and refactors
  • f2a6aeeFix attention gradient flow with batched matmul ops
  • bd68d30Add data pipeline, gradient checkpointing, training stabilization

Chore

  • 54fe5cfchore: add FUNDING.yml (GitHub Sponsors → smedjan org)
  • 4c58c95chore: add authors field to Cargo.toml
  • 866ef6bchore: slim crates.io package (exclude dev/CI data + scripts)
  • 498af28chore: add OSS package metadata (license/repository/readme/keywords)
  • 1e74061chore: drop useless vec! in matmul_precise test (clippy clean)
  • 5af4fc2chore: clear all 39 clippy style warnings → clippy/build warning-free
  • 140a96echore: make cargo clippy pass (pre-existing correctness-lint errors)

Renamed

  • 58397f2rename: AndreAI -> Smedjan; prepare for open-source release

Fixed

  • 490b11bfix(generate): speculative decoding respects max_tokens; cover untested live paths
  • 88f6676fix(attention): sliding window attends to exactly w keys, not w+1
  • c49ef70fix(checkpoint): validate model geometry on load
  • 98b20cafix(data): reject malformed pipeline inputs without panicking
  • fd76a5dfix(bench): reject zero batch/seq/iters before runtime
  • 91c0b29fix(grow): validate progressive-growth geometry
  • f856052fix(convert): harden quantized checkpoint loading
  • bb0c5bafix(train): validate auxiliary training runtimes
  • 28cfe38fix(generate): validate sampling args before runtime
  • fdfead1fix(ci): serialize mac poller reports
  • 6bb7a7cfix(runtime): harden longctx eval and per-sample loss
  • 42f2df2fix(loss): honor upstream cross entropy gradients
  • 1cdaeb5fix(train): harden runtime validation and moe routing
  • 7dbdf77fix(train): validate yarn scale before training
  • 3d8d852fix(checkpoint): persist yarn rope config
  • ec9fcd1fix(safetensors): reject malformed imports precisely
  • 68fc3c1fix(rope): keep yarn fused rope ci-clean and differentiable
  • b2aad2bfix(eval): keep long-context suite ci-clean
  • bed9459fix(checkpoint): reject malformed artifacts without panics
  • 23bd467fix(tokenizer): reject malformed tokenizer files instead of panicking
  • 245be58fix(data): validate mixed shard inputs
  • 89ae410fix(cli): report runtime input errors without panics
  • 3fa5496fix(train): reject unknown training config values
  • 92001d8fix(data): reject malformed training datasets
  • 3078be5fix(train): resume from final state without skipping steps
  • 68165f2fix(runtime): restore AdamW resume state across backends
  • 9bdb13efix(runtime): harden fused training paths
  • db80161fix(runtime): harden training sanitizer paths
  • e9a8a2dfix(cuda): restore backend parity after cautious muon
  • 652c03cfix(cuda): real fp16 cast pack/unpack — the f16 buffer is f32-backed
  • 39e51e4fix(rwkv): numerically-stable wkv — RWKV now trains (was loss-flat at seq>=~32)
  • abcdefffix(attention,train): make --block-sparse-top-k train (gather dispatch + step-level pool bypass)
  • 43df13dfix(autograd): zero the scatter-add accumulator in backward_gather_blocks
  • 9bfe2ccfix(train): flush between grad-accum micro-steps — fixes gradient corruption at seq_len >= 256
  • d96ba2efix(metal): quarantine buffer pool by default — fixes silent gradient corruption at seq_len >= 256
  • 83200e4fix(metal): Flash Attention partial-last-q-block bug (fwd+bwd) + add grad-check & dense-equivalence tests
  • b248ffcfix(train): loss readout — copy loss BEFORE backward (correct buffer-hazard fix)
  • a3f21f3fix(metal): RoPE transpose kernel under-dispatch (corrupted cached decode) + thread-local matmul-path flags
  • 1eb974dfix(train): correct loss readout at large batch (was a degenerate constant 1.0)
  • 9652e00fix(train,checkpoint): kill optimizer double-allocation + de-flake the hybrid test (self-audit)
  • 9084b07fix(docs): bf16-matmul is overflow-mitigation only, NOT a default — real-run finding
  • dc28af6fix(optim): AdamW eps 1e-8 -> 1e-5 — the second half of the instability fix
  • 70aa3d8fix(rmsnorm): bound the backward to kill AdamW's instability (gradient explosion → divergence)
  • 48fe541fix(checkpointing): re-enable gradient checkpointing — root-cause was buffer-pool corruption
  • 9ace2f2fix(metal): invalidate fp16/ternary conversion cache on buffer reuse (systemic aliasing bug)
  • 9d00376fix: memory estimation accounts for frozen base weights in ReLoRA mode
  • f02b789fix: GGUF export writes actual quantized data for Q8_0, shared_layers naming
  • edb1d82fix: param_count accounts for lowrank (ReLoRA) and shared_layers (ALBERT)
  • 931d922fix: shared_layers (ALBERT) mode returned duplicate params N times
  • a719b45fix: dedup generate prefill, use zero-copy as_slice for logit readback
  • d21249afix: flash attention backward dK/dV data race — use atomic adds
  • c2a9c22fix: DPO optimizer runs batched, recycle gradient/cache buffers
  • 95a8499fix: SFT optimizer runs batched, recycle gradient/cache buffers
  • 3ad0ca0fix: recycle ternary cache buffers instead of dropping on clear
  • a0c13dbfix: forward_hidden now respects checkpointed flag
  • a6a3bbffix: load_training_state accepts v4 checkpoint format, handles ReLoRA base params

Changed

  • 1c2e0e7refactor(model): extract embed_lookup; grad-check embedding scatter-add backward
  • 3d737e9refactor(backend): route shared code through crate::gpu abstraction layer
  • 2097a0erefactor(optim): delete dead GaLore code + correct stale "shipped" doc claims
  • 5a65be4refactor: eliminate the clippy.toml thresholds with real refactors (no suppression, no allows)
  • ce0c5c2refactor: remove ALL #[allow] annotations — fix every warning at the root, no suppression
  • 0def7fcrefactor: extract dispatch_backward_op, fix batch matmul, -96 lines

Tests

  • 6ed5b2atest: grad-check broadcast_rows, concat, slice, exp, relu backward
  • 7a4de10test(runtime): promote serial correctness gates
  • db4fa85test(rwkv): grad-check the existing wkv backward + scope chunked RWKV (design done, gated on a primitive)
  • 7959811test(autograd): grad-check the whole attention kernel path — close the unverified-custom-backward gap
  • d58b1fatest(optim): ground the beta2=0.999 anomaly (not a bug) + handoff: delivered roadmap
  • 04ba2b2test(train): end-to-end convergence smoke test — proves the loop actually learns
  • def000dtest: add 7 new tests covering checkpoint, loss, optimizer, quantize, data

Added

  • 86c10b5feat(train): --yarn-scale CLI flag + train_smoke yarn case + CUDA yarn guard
  • 2494018feat(rope): activate YaRN from config (with_yarn) end-to-end
  • c1089e7feat(rope): YaRN per-frequency RoPE scaling in the fused rope kernels
  • 605a21dfeat(safetensors): HF-Llama import/export (load external weights as retrofit init)
  • 08d1103feat(safetensors): zero-dep export/import with bit-exact round-trip
  • 7bc10d2feat(eval): long-context NIAH/RULER retrieval+reasoning suite
  • 77f25bcfeat(optim): Cautious Muon (--cautious) — sign-agreement update masking
  • 5bc8f64feat(cuda): flash attention forward + backward (online softmax, tiled)
  • b5e1a93feat(cuda): 8-bit (block-wise int8) AdamW optimizer
  • e339b62feat(cuda): BitNet ternary matmul/quantize + drop all #[allow] suppressions
  • 15eafc1feat(cuda): real bf16-tile matmul + bf16 flag
  • 3333ca4feat(cuda): wire simdgroup matmul flag + variants (alias to tiled/fp32)
  • e2c5fabfeat(cuda): port MoE gather/scatter + precise fp32-tile matmul
  • 3e44d1dfeat(cuda): port block-sparse attention + Muon/NorMuon optimizer to CUDA
  • 8905a0afeat(train): wire --ssm / --rwkv / --linear-attn / --linear-attn-period flags
  • b16d52ffeat(cuda): wire matmul batch + fp32 path; forward runs to transpose_rope
  • b4ced5ffeat(cuda): CUDA backend compiles clean (--features cuda, 0 errors)
  • da17124feat(bench): --simdgroup-matmul flag so the hardware MMA path is measurable
  • a6a4c23feat(ssm): chunked O(seq·chunk) SSD forward — verified equal to the materialised O(seq²) form
  • e45fd1bfeat(attention,autograd): block-sparse TRAINABLE backward (§3) + fix latent non-square batched_matmul_trans_a
  • a36ba92feat(attention,model): seq-packing — thread seg_ids through forward (per-document attention)
  • 6b752b8feat(ci): origin/main poll driver + launchd agent for the Mac CI runner (B4)
  • c29b75bfeat(tests,metal): Phase B GPU-correctness harness + buffer_from_slice cache fix
  • 9d366e1feat(generate): no-repeat-ngram-size control + land EMA-export and NorMuon batch
  • b8a82effeat(attention): MLA incremental decode with latent KV cache (caches c, not K/V)
  • d5e21cffeat(attention): true-subquadratic block-sparse gather attention (4× fewer FLOPs, 1.65× at seq=1024)
  • 60ab749feat(attention): per-batch causal_doc_mask; revert broken packing model-integration (kept op-level)
  • e973e41feat(rmsnorm): root-cause + grounded test for the activation-collapse instability (#5)
  • d663815feat(checkpoint,optim): persist muon/hybrid/8-bit optimizer state across resume
  • 48b93d1feat(metal): batched simdgroup MMA — extend the hardware fast path to attention matmuls
  • eae5412feat(attention): block-sparse attention (MoBA/NSA) — subq.ai's quality-preserving sparse attn
  • f45cf32feat(data): sequence packing / varlen — block-diagonal causal mask, no padding waste
  • c118f9cfeat(attention): MLA — Multi-head Latent Attention (16× KV-cache shrink, new AttnKind)
  • 305b2f2feat(metal): bf16 default-matmul option — fp32 range without the fp16 ±65504 clamp
  • bfe1b68feat(optim): 8-bit AdamW — block-wise int8 moments (~4× less optimizer memory)
  • 165fba0feat(metal): simdgroup_matrix MMA matmul — hardware matrix units (measured 1.29× at 1024³)
  • aa4a377feat(optim): Muon+AdamW hybrid (role-aware) + configurable AdamW + update/per-tensor clipping
  • b2e627cfeat(cli): wire generate_batch to generate --batch-file (one prompt per line)
  • 972eeadfeat(generate): batched multi-sequence generation through one KV cache
  • 76ddbe3feat(tokenizer): import external BPE from a GPT-2/HF merges.txt
  • e3821e2feat(metal): bf16 tiled matmul — fp32 range without the fp16 ±65504 clamp
  • 2022a29feat(eval,sampling): perplexity metric + min-p / locally-typical sampling
  • 5532ccbfeat(metal): dedicated broadcast_rows kernel (replaces the K=1 outer-product matmul) — #5
  • baedd06feat(metal): opt-in full-FP32 matmul (precision/range path) — addresses the fp16-tile clamp
  • 2945ff8feat(model): wire the RWKV time-mix into the Transformer block (closes the mixer set)
  • b587793feat(model): wire the SSM (Mamba-2/SSD) mixer into the Transformer block
  • ccd5268feat(model): hybrid per-layer topology — alternate transformer & linear-attention layers
  • 128bf76feat(rwkv): RWKV-6-style time-mix core (token-shift + per-channel WKV)
  • a4db8f1feat(ssm): selective state-space (Mamba-2/SSD-style) token mixer core
  • 8780d52feat(autograd): elementwise exp op (SSM/RWKV selective-decay primitive)
  • 2823fa5feat(linear-attn): wire O(N) linear attention into the model (config + checkpoint v5)
  • 21c18e6feat(linear-attn): O(N) chunked-parallel form — the genuine linear-scaling win
  • c48ef46feat(autograd): batched_matmul_trans_a op — the linear-attn / SSM state-update primitive
  • 5d13e71feat(linear-attn): softmax-free linear attention core (Stage A, masked reference)

CUDA

  • 405378bcuda: wire adamw update_clip (kernel ignored it; now clamps normalized update like Metal)
  • 5e9af02cuda: wire 9 utility kernels + causal_doc_mask true -inf (suite 140->164)
  • 574f5f7cuda: GQA (repeat_kv) + seq-packing (causal_doc_mask) + rms_norm clamp
  • 79b0800cuda: CE grad mean-scaling via Rust post-scale + wire ema_update
  • aa330b1cuda: fix rms_norm_backward (missing cross-term) + cross_entropy grad scaling
  • 054a036cuda: fix transpose_perm_backward arg order (was zeroing attention output)
  • d13a2c9cuda: fix gpu_l2_norm read count (l2_norm_check writes 2 floats [sum_sq, nan]); read 2 not 1
  • 09f824ecuda: fix scaled_causal_softmax OOB + bind ctx for raw htod (training memcheck-clean)
  • 2231b37cuda: route decode through primitive FFN path (fused megakernel is Metal-only)
  • 1243c0acuda: wire dense forward+backward kernel batch (forward runs at 43k tok/s)

Docs

  • 2806f50docs: CUDA training bring-up — memcheck-clean fwd+bwd, scaled_causal_softmax fix, diagnostic recipe
  • 9c6e548docs(cuda): record runtime bring-up state + remaining kernel-wiring map
  • f3d26b9docs(cuda): scaffold + rented-box playbook for CUDA training-parity
  • 641c354docs: scope the two remaining #12 throughput items precisely (chunked SSM tractable; MLA absorbed = arch fork)
  • a6783c9docs: Phase A verification record (M3/air) — fixes verified, NorMuon characterized
  • e667601docs(handoff): drain bleeding & gaps — prioritized plan for a fresh session
  • bc55404docs(handoff): round-3 results — block-sparse gather + MLA decode delivered; #6 findings + readout-regression postmortem
  • a4b3151docs(handoff): record drained follow-ups (opt-state persist, batched simdgroup, #5 root-cause) + scope remaining
  • 797b3b2docs: handoff — AdamW hardening roadmap + efficiency techniques (10x capacity @ 1/10 usage)
  • 2762c37docs(test): update AdamW notes — instability fixed (RMSNorm clamp + eps), Muon kept for speed
  • 4f2ecf2docs: correct two matmul comments proven false by measurement

Performance

  • 3fe2128perf(autograd): recycle fresh chained activations in clear_tape (pool reuse 4%->14%)
  • a9c92caperf(metal): route batched/attention backward through hardware MMA
  • 20239aaperf(matmul): enable hardware MMA by default for train + inference
  • e807476perf(metal): hardware MMA kernels for the backward pass (trans_a/trans_b)
  • 9dfd90cperf: batch-write dataset files instead of per-token syscalls
  • 6b57acaperf: streaming generate uses zero-copy as_slice for decode logits
  • 3448f3bperf: avoid intermediate Vec allocation in clear_tape/clear_tape_keep_grads
  • d467b34perf: reduce peak memory during checkpointed backward, increase pool cap
  • af4f376perf: clippy audit — eliminate 33 warnings, remove hot-path allocations

v0.1.0 · 2026-03-22 · 1 commits

Other

  • 7c3372dAndreAI: pure Rust AI engine built from zero