Troubleshooting & FAQ
Common failures and how to clear them, a plain account of what is and isn't done yet, and quick answers.
Troubleshooting
Loss is NaN or diverging
Gradient clipping with NaN/Inf detection is on by default. If it still blows up: lower --lr, increase --warmup, and check your data. --bf16-matmul exists for genuine FP16 overflow, but it has coarser precision and can destabilize an otherwise-healthy run — reach for it last.
Out of memory
In order: lower --batch-size / --seq-len, add --grad-accum, turn on --gradient-checkpointing, then --fused-ce and --fp16-activations. See Performance & tuning.
Throughput is terrible
Build with cargo build --release. The hardware simdgroup matmul is on by default; a debug build or a fallback path is many times slower.
CUDA build
Install the CUDA toolkit (12.x) and build with --no-default-features --features cuda. The Metal path is the default on macOS and needs no setup.
Limitations & roadmap
Smedjan is one engineer's engine, and it says so. The honest state:
- safetensors import reads F32, BF16, and F16, and
config.jsonmaps straight to a Smedjan model viaimport-hf. Export works too. - GGUF export covers f32, q8_0, and q4_0 as standard GGML blocks (norms stay f32).
- Faithful bit-exact HuggingFace inference parity is still on the roadmap. The
config.json→ model + BF16/F16 import path works for continued training; reproducing HF inference to the bit (half-split RoPE, fixed QK-norm) is a separate, deliberate divergence that continued training adapts away. - RWKV and block-sparse both train. The RWKV WKV now uses a numerically-stable decay form and converges at long sequence; block-sparse trains like dense.
--ssmand--linear-attntrain as well. - CUDA backward parity for a few specialized kernels is still being completed; the Metal path is the most exercised.
- Long-context evaluation (NIAH / RULER) ships — run
smedjan eval --longctx. The strength of the curve depends on how well the model was trained.
FAQ
Does it need Python or PyTorch?
No. The entire dependency tree is a handful of small crates (clap, rand, memmap2, byteorder) plus the GPU FFI bindings.
Can I move a checkpoint between Metal and CUDA?
Yes — the checkpoint format is portable across both backends. Train on a Mac, resume on an NVIDIA GPU.
What's the smallest useful model?
tiny (~2M) or small are perfect for proving the pipeline and for tiny on-device models. Scale up from there.
Can I export to GGUF / run under llama.cpp?
Export standard GGML weights with smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q4_0 (f32, q8_0, or q4_0). The blocks are validated against the reference GGUF dequantizer, but a Smedjan checkpoint is not yet a turnkey llama.cpp inference model — the tokenizer isn't embedded and the RoPE/QK-norm conventions differ. Direct llama.cpp inference is on the roadmap.
License?
MIT. Own it, fork it, ship it.