// concepts

Core concepts

The handful of ideas you need to read the rest of the docs: how text becomes tokens, what the model is, what a checkpoint holds, and why the same files run on two different GPUs.

// 01

Tokenizer

text ↔ integer tokens

A from-scratch byte-pair encoder (BPE). You train it once on your corpus with smedjan tokenizer; it learns a vocabulary of sub-word units and maps text to integer token IDs and back. Byte-level fallback means it never chokes on an unknown character. The same tokenizer must be used for training and for generation — it is part of the model's contract. You can also import a GPT-2 / HuggingFace merges.txt with smedjan import-bpe.

// 02

The model

decoder-only transformer

A pre-norm, decoder-only transformer — the same family as GPT and Llama:

Normalization: RMSNorm, applied pre-block.
Positions: Rotary Position Embeddings (RoPE), with NTK-aware and YaRN scaling for context extension.
Attention: Multi-Head or Grouped-Query (GQA) via --kv-heads. Alternative O(N) mixers are available — Linear attention, SSM (Mamba-2/SSD), MLA, block-sparse, and RWKV — and all of them train.
Feed-forward: SwiGLU. Optional Mixture-of-Experts routing.
Head: the output projection is weight-tied to the token embedding.

// 03

Model sizes

tiny (2M) → 6.5B, or custom

Presets run from tiny (~2M parameters) through max (up to ~6.5B). Or go fully custom with --size custom --dim --layers --heads --ffn-mult. Parameter count depends on your vocabulary size, so ask the binary rather than guessing:

# print exact parameter counts for the presets at your vocab size
smedjan sizes --vocab-size 8192

A concrete example: the medium preset is 45M parameters — dim 512, 12 layers, 8 heads.

// 04

Checkpoints & resume

weights + optimizer + step

A training checkpoint holds the model weights, the optimizer state, and the step counter, so --resume continues a run exactly where it stopped — same trajectory, not an approximation. Smaller final.bin-style files hold weights for inference and export. Checkpoints are written to --checkpoint-dir at a fixed interval.

// 05

Two backends, one codebase

Metal · CUDA

The GPU backend is selected at compile time: Metal on Apple Silicon (the default), CUDA on NVIDIA (--no-default-features --features cuda). Everything above the backend line — the model, autograd, and training loop — is backend-agnostic, and checkpoints are portable across both. Train on a Mac, resume on an H100, keep the same format.

Real state: CUDA forward + training work; backward parity for a few specialized kernels is still being completed (see the roadmap). The Metal path is the most exercised.

// 06

Autograd

tape-based reverse-mode

Gradients come from a hand-written, tape-based reverse-mode autodiff — no external ML framework. Gradient checkpointing trades compute for memory by recomputing activations in the backward pass instead of storing them, which is what lets larger models fit on small GPUs.

Data & tokenizer →Build and clean the corpus you'll train on.Training →Put these concepts to work.