Data & tokenizer
Everything upstream of training: learn a vocabulary, turn raw text into the binary token format, and clean, dedup, and mix your corpus. All of it is the same binary.
Train or import a tokenizer
Train a BPE tokenizer on your corpus, or import an existing GPT-2 / HuggingFace merges file. Pick the vocabulary size up front — larger vocab means shorter sequences but a bigger embedding table.
# train a byte-pair tokenizer (vocab defaults to 32000) smedjan tokenizer --input corpus.txt --vocab-size 16000 --output tokenizer.bin # or import a GPT-2 / HuggingFace merges.txt as a byte-level BPE smedjan import-bpe --merges merges.txt --output tokenizer.bin
| Flag | Default | What it does |
|---|---|---|
--input | — | Text corpus to learn the vocabulary from (tokenizer). |
--vocab-size | 32000 | Target vocabulary size. |
--merges | — | GPT-2/HF merges.txt to import (import-bpe). |
--output | tokenizer.bin | Where to write the tokenizer. |
Prepare training data
Tokenize raw text into the binary stream the trainer memory-maps. Run it once per corpus; the output is what you pass to train --dataset.
# tokenize raw text → memory-mappable binary token stream smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output train.bin
Clean with provenance
For real corpora, run text through the cleaning pipeline first. It splits documents on a separator and can record provenance — source name, URL, and license — to a log, so you keep an audit trail of what went into the model.
# clean text and record where it came from, splitting documents on blank lines smedjan process \ --input raw.txt --tokenizer tokenizer.bin --output clean.bin \ --separator "\n\n" \ --provenance-log prov.log --source-name wikipedia --source-url https://… \ --license CC-BY-SA
| Flag | Default | What it does |
|---|---|---|
--separator | "\n\n" | Document boundary. Empty = treat the file as one document. |
--provenance-log | — | Append a provenance record to this file. |
--source-name | unknown | Source label for provenance. |
--source-url | "" | Source URL for provenance. |
--license | unknown | License string for provenance. |
Deduplicate & filter
Near-duplicate documents waste training and hurt generalization. dedup removes them with MinHash similarity and drops low-quality text below a score threshold. Input is one document per line.
# MinHash near-duplicate removal + quality filtering (one document per line) smedjan dedup --input docs.txt --output filtered.txt \ --threshold 0.8 --min-quality 0.3
| Flag | Default | What it does |
|---|---|---|
--threshold | 0.8 | MinHash similarity (0–1) above which documents are considered duplicates. |
--min-quality | 0.3 | Minimum quality score (0–1) to keep a document. |
Mix datasets
Combine tokenized shards in fixed proportions — useful for balancing domains (e.g. 70% books, 30% web). Weights are relative.
# blend tokenized shards with weights (path:weight, comma-separated) smedjan mix --shards books.bin:0.7,web.bin:0.3 --output train.bin
Reproducibility
smedjan hash --file train.bin prints a SHA-256 so you can pin exactly which data produced a checkpoint.