Does Smedjan need Python or PyTorch?

No. The entire dependency tree is a handful of small Rust crates (clap, rand, memmap2, byteorder) plus the GPU FFI bindings.

What is the smallest useful Smedjan model?

The tiny (~2M) or small presets are ideal for proving the pipeline and for tiny on-device models. Scale up from there.

Can I export a Smedjan model to GGUF?

Yes — smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q4_0 writes standard GGML f32/q8_0/q4_0 blocks (validated against the reference GGUF dequantizer). It is a valid GGML weight container, but not yet a turnkey llama.cpp inference model: the tokenizer is not embedded and the RoPE/QK-norm conventions differ. Direct llama.cpp inference is on the roadmap.

Training loss is NaN or diverging — what should I do?

Gradient clipping with NaN/Inf detection is on by default. If it still diverges, lower --lr, increase --warmup, and check your data; reach for --bf16-matmul only as a last resort.

Smedjan runs out of memory — how do I fix it?

In order: lower --batch-size and --seq-len, add --grad-accum, enable --gradient-checkpointing, then --fused-ce and --fp16-activations.

// reference

Troubleshooting & FAQ

Common failures and how to clear them, a plain account of what is and isn't done yet, and quick answers.

// 01

Troubleshooting

the usual suspects

Loss is NaN or diverging

Gradient clipping with NaN/Inf detection is on by default. If it still blows up: lower --lr, increase --warmup, and check your data. --bf16-matmul exists for genuine FP16 overflow, but it has coarser precision and can destabilize an otherwise-healthy run — reach for it last.

Out of memory

In order: lower --batch-size / --seq-len, add --grad-accum, turn on --gradient-checkpointing, then --fused-ce and --fp16-activations. See Performance & tuning.

Throughput is terrible

Build with cargo build --release. The hardware simdgroup matmul is on by default; a debug build or a fallback path is many times slower.

CUDA build

Install the CUDA toolkit (12.x) and build with --no-default-features --features cuda. The Metal path is the default on macOS and needs no setup.

// 02

Limitations & roadmap

what's done, what isn't

Smedjan is one engineer's engine, and it says so. The honest state:

safetensors import reads F32, BF16, and F16, and config.json maps straight to a Smedjan model via import-hf. Export works too.
GGUF export covers f32, q8_0, and q4_0 as standard GGML blocks (norms stay f32).
Faithful bit-exact HuggingFace inference parity is still on the roadmap. The config.json → model + BF16/F16 import path works for continued training; reproducing HF inference to the bit (half-split RoPE, fixed QK-norm) is a separate, deliberate divergence that continued training adapts away.
RWKV and block-sparse both train. The RWKV WKV now uses a numerically-stable decay form and converges at long sequence; block-sparse trains like dense. --ssm and --linear-attn train as well.
CUDA backward parity for a few specialized kernels is still being completed; the Metal path is the most exercised.
Long-context evaluation (NIAH / RULER) ships — run smedjan eval --longctx. The strength of the curve depends on how well the model was trained.

// 03

FAQ

quick answers

Does it need Python or PyTorch?

No. The entire dependency tree is a handful of small crates (clap, rand, memmap2, byteorder) plus the GPU FFI bindings.

Can I move a checkpoint between Metal and CUDA?

Yes — the checkpoint format is portable across both backends. Train on a Mac, resume on an NVIDIA GPU.

What's the smallest useful model?

tiny (~2M) or small are perfect for proving the pipeline and for tiny on-device models. Scale up from there.

Can I export to GGUF / run under llama.cpp?

Export standard GGML weights with smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q4_0 (f32, q8_0, or q4_0). The blocks are validated against the reference GGUF dequantizer, but a Smedjan checkpoint is not yet a turnkey llama.cpp inference model — the tokenizer isn't embedded and the RoPE/QK-norm conventions differ. Direct llama.cpp inference is on the roadmap.

License?

MIT. Own it, fork it, ship it.

Getting started →Back to the four-command quickstart.CLI reference →Look up any command.