// concepts

Core concepts

The handful of ideas you need to read the rest of the docs: how text becomes tokens, what the model is, what a checkpoint holds, and why the same files run on two different GPUs.

// 01

Tokenizer

text ↔ integer tokens

A from-scratch byte-pair encoder (BPE). You train it once on your corpus with smedjan tokenizer; it learns a vocabulary of sub-word units and maps text to integer token IDs and back. Byte-level fallback means it never chokes on an unknown character. The same tokenizer must be used for training and for generation — it is part of the model's contract. You can also import a GPT-2 / HuggingFace merges.txt with smedjan import-bpe.

// 02

The model

decoder-only transformer

A pre-norm, decoder-only transformer — the same family as GPT and Llama:

  • Normalization: RMSNorm, applied pre-block.
  • Positions: Rotary Position Embeddings (RoPE), with NTK-aware and YaRN scaling for context extension.
  • Attention: Multi-Head or Grouped-Query (GQA) via --kv-heads. Alternative O(N) mixers are available — Linear attention, SSM (Mamba-2/SSD), MLA, block-sparse, and RWKV — and all of them train.
  • Feed-forward: SwiGLU. Optional Mixture-of-Experts routing.
  • Head: the output projection is weight-tied to the token embedding.
// 03

Model sizes

tiny (2M) → 6.5B, or custom

Presets run from tiny (~2M parameters) through max (up to ~6.5B). Or go fully custom with --size custom --dim --layers --heads --ffn-mult. Parameter count depends on your vocabulary size, so ask the binary rather than guessing:

# print exact parameter counts for the presets at your vocab size
smedjan sizes --vocab-size 8192

A concrete example: the medium preset is 45M parameters — dim 512, 12 layers, 8 heads.

// 04

Checkpoints & resume

weights + optimizer + step

A training checkpoint holds the model weights, the optimizer state, and the step counter, so --resume continues a run exactly where it stopped — same trajectory, not an approximation. Smaller final.bin-style files hold weights for inference and export. Checkpoints are written to --checkpoint-dir at a fixed interval.

// 05

Two backends, one codebase

Metal · CUDA

The GPU backend is selected at compile time: Metal on Apple Silicon (the default), CUDA on NVIDIA (--no-default-features --features cuda). Everything above the backend line — the model, autograd, and training loop — is backend-agnostic, and checkpoints are portable across both. Train on a Mac, resume on an H100, keep the same format.

Real state: CUDA forward + training work; backward parity for a few specialized kernels is still being completed (see the roadmap). The Metal path is the most exercised.

// 06

Autograd

tape-based reverse-mode

Gradients come from a hand-written, tape-based reverse-mode autodiff — no external ML framework. Gradient checkpointing trades compute for memory by recomputing activations in the backward pass instead of storing them, which is what lets larger models fit on small GPUs.