Objective#
The goal of this project is to design, implement, and train a small-scale Large Language Model (LLM) from scratch, progressing through the full training lifecycle:
- Pre-training on large-scale unlabeled text.
- Supervised Fine-Tuning (SFT) on high-quality instruction-following datasets.
- Parameter-Efficient Fine-Tuning (LoRA) for resource-efficient adaptation.
- Direct Preference Optimization (DPO) for aligning the model with human preferences.
The project aims to serve as a practical, hands-on implementation of LLM training concepts from recent research.
Environment Setup#
-
macOS
withM Series
chip‼️ MPS is not optimized for training
Testing on: macOS MPS device (M4, 64GB RAM)
PyTorch version: 2.3.0
MPS available: True
Matrix 1024x1024: 10.40 TFLOPS | Time: 20.65ms
Matrix 2048x2048: 13.45 TFLOPS | Time: 127.76ms
Matrix 4096x4096: 13.49 TFLOPS | Time: 1018.53ms
Matrix 8192x8192: 12.82 TFLOPS | Time: 8573.45ms
Matrix 16384x16384: 9.37 TFLOPS | Time: 93871.68ms
windows
withCUDA
(recommended)
Testing on:CUDA device (2080Ti, Memory 11G)
PyTorch version: 2.8.0+cu129
CUDA available: True
Matrix 1024x1024: 65.62 TFLOPS | Time: 3.27ms
Matrix 2048x2048: 634.46 TFLOPS | Time: 2.71ms
Matrix 4096x4096: 4447.00 TFLOPS | Time: 3.09ms
Matrix 8192x8192: 34163.30 TFLOPS | Time: 3.22ms
Matrix 16384x16384: 199933.93 TFLOPS | Time: 4.40ms
-
python package for
CUDA
support-
torch
PyTorch ‼️ Be careful: CUDA version must match the PyTorch (12.6, 12.8, 12.9)pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
-
transformers
Installpip install transformers
- or
pip3 install --no-build-isolation transformer_engine[pytorch]
CUDA Transformer Engine
-
peft
pip install peft
-
-
cuda toolkit
Dataset#
- tokenizer dataset
- pre-training dataset
- sft (Supervised Fine-Tuning) dataset
- dpo (Direct Preference Optimization) dataset