Core concepts
The handful of ideas you need to read the rest of the docs: how text becomes tokens, what the model is, what a checkpoint holds, and why the same files run on two different GPUs.
Tokenizer
A from-scratch byte-pair encoder (BPE). You train it once on your corpus with smedjan tokenizer; it learns a vocabulary of sub-word units and maps text to integer token IDs and back. Byte-level fallback means it never chokes on an unknown character. The same tokenizer must be used for training and for generation — it is part of the model's contract. You can also import a GPT-2 / HuggingFace merges.txt with smedjan import-bpe.
The model
A pre-norm, decoder-only transformer — the same family as GPT and Llama:
- Normalization: RMSNorm, applied pre-block.
- Positions: Rotary Position Embeddings (RoPE), with NTK-aware and YaRN scaling for context extension.
- Attention: Multi-Head or Grouped-Query (GQA) via
--kv-heads. Alternative O(N) mixers are available — Linear attention, SSM (Mamba-2/SSD), MLA, block-sparse, and RWKV — and all of them train. - Feed-forward: SwiGLU. Optional Mixture-of-Experts routing.
- Head: the output projection is weight-tied to the token embedding.
Model sizes
Presets run from tiny (~2M parameters) through max (up to ~6.5B). Or go fully custom with --size custom --dim --layers --heads --ffn-mult. Parameter count depends on your vocabulary size, so ask the binary rather than guessing:
# print exact parameter counts for the presets at your vocab size smedjan sizes --vocab-size 8192
A concrete example: the medium preset is 45M parameters — dim 512, 12 layers, 8 heads.
Checkpoints & resume
A training checkpoint holds the model weights, the optimizer state, and the step counter, so --resume continues a run exactly where it stopped — same trajectory, not an approximation. Smaller final.bin-style files hold weights for inference and export. Checkpoints are written to --checkpoint-dir at a fixed interval.
Two backends, one codebase
The GPU backend is selected at compile time: Metal on Apple Silicon (the default), CUDA on NVIDIA (--no-default-features --features cuda). Everything above the backend line — the model, autograd, and training loop — is backend-agnostic, and checkpoints are portable across both. Train on a Mac, resume on an H100, keep the same format.
Real state: CUDA forward + training work; backward parity for a few specialized kernels is still being completed (see the roadmap). The Metal path is the most exercised.
Autograd
Gradients come from a hand-written, tape-based reverse-mode autodiff — no external ML framework. Gradient checkpointing trades compute for memory by recomputing activations in the backward pass instead of storing them, which is what lets larger models fit on small GPUs.