This project is a full ground-up implementation of the Transformer architecture for Neural Machine Translation (NMT), specifically for the English → Nepali language pair. Every component — from Multi-Head Attention to Beam Search decoding — was implemented by reading and understanding the original paper:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017. arXiv:1706.03762
No transformers library. No pre-trained weights. No shortcuts.
The model follows the original Transformer encoder-decoder architecture with the following components:
| Component | Details |
|---|---|
| Input Embedding | nn.Embedding(en_vocab=32000, d_model=256) |
| Positional Encoding | Sinusoidal — fixed, not learned |
| Encoder Layers | 6 stacked layers |
| Multi-Head Attention | 8 heads, head_dim = 32 |
| Feed Forward Network | d_model=256 → ff_dim=1024 → d_model=256 |
| Add & Norm | Residual connection + LayerNorm |
| Component | Details |
|---|---|
| Input Embedding | nn.Embedding(ne_vocab=16000, d_model=256) |
| Positional Encoding | Sinusoidal — same as encoder |
| Decoder Layers | 6 stacked layers |
| Masked Multi-Head Attention | Causal mask prevents future token leakage |
| Cross-Attention | Queries from decoder, Keys/Values from encoder |
| Feed Forward Network | Same as encoder |
| Add & Norm | Residual connection + LayerNorm |
| Output Projection | nn.Linear(d_model=256, ne_vocab=16000) |
- Scaled Dot-Product Attention: scores divided by √(head_dim) to prevent vanishing gradients in softmax
- Causal Masking: upper triangular mask in decoder self-attention ensures autoregressive generation
- Embedding Scaling: embeddings multiplied by √(d_model) before adding positional encoding — as per paper
- Weight Tying: encoder and decoder embeddings are separate (not tied) for this language pair
| Parameter | Value | Reason |
|---|---|---|
d_model |
256 | Smaller than paper's 512 — fits dataset size |
num_heads |
8 | Divides evenly into d_model |
ff_dim |
1024 | 4× d_model — follows paper ratio |
num_layers |
6 | Same as paper |
MAX_LEN |
50 | Covers ~95% of sentence lengths in dataset |
batch_size |
256 | Maximizes T4 GPU utilization |
warmup_steps |
4000 | NoamOpt schedule |
en_vocab_size |
32,000 | BPE tokenizer |
ne_vocab_size |
16,000 | BPE tokenizer — smaller for Devanagari |
Used SentencePiece BPE (Byte Pair Encoding) tokenization separately for each language:
English → spm_en.model (vocab: 32,000 tokens)
Nepali → spm_ne.model (vocab: 16,000 tokens)
character_coverage=1.0 was used for both languages to ensure complete Devanagari script coverage — critical for Nepali since missing characters would silently corrupt translations.
Encoding scheme:
Encoder input : [token_ids] + [EOS] # encoder knows when sentence ends
Decoder input : [BOS] + [token_ids] # decoder starts with BOS
Decoder target : [token_ids] + [EOS] # shifted right — classic teacher forcing
Implemented the learning rate schedule from the paper:
lr = d_model^(-0.5) × min(step^(-0.5), step × warmup_steps^(-1.5))
This linearly increases the learning rate for the first warmup_steps steps, then decays proportionally to the inverse square root of the step number. Critical for Transformer stability — standard Adam without warmup causes instability early in training.
CrossEntropyLoss(ignore_index=0) — PAD tokens are excluded from loss computation so the model doesn't waste capacity learning to predict padding.
- Platform: Kaggle Notebooks
- GPU: NVIDIA T4 (15GB VRAM)
- Optimization:
torch.compile()for kernel fusion speedup - Dataset pinned to GPU: entire tokenized dataset loaded into VRAM to eliminate CPU bottleneck
| Epoch | Loss | Accuracy | Notes |
|---|---|---|---|
| 1 | 8.50 | 9.0% | Random initialization |
| 5 | 4.30 | 27.1% | Basic patterns emerging |
| 10 | 3.10 | 38.5% | Structural learning |
| 20 | 1.74 | 57.5% | Vocabulary mapping |
| 30 | 0.86 | 76.9% | Strong token prediction |
| 40 | 0.50 | 85.0% | Near convergence |
| 55 | 0.37 | 88.93% | ✅ Sweet spot — best generalization |
| 60 | 0.13 | 96.5% |
A key experimental finding of this project was identifying the exact epoch where overfitting begins:
Epoch 55 → Loss: 0.366 | Accuracy: 88.93% ✅ Best generalization
Epoch 57 → Loss: 0.092 | Accuracy: 98.09% ⚠️ 4x loss drop in 2 epochs
Epoch 58 → Loss: 0.069 | Accuracy: 98.83% ❌ Memorizing training data
Epoch 60 → Loss: 0.137 | Accuracy: 96.52% ❌ Loss bouncing — unstable
A 6x loss drop in 3 epochs (0.37 → 0.06) is a clear signature of memorization rather than generalization. Inference quality at epoch 60 was demonstrably worse than epoch 55 on unseen sentences, confirming overfitting.
Conclusion: Epoch 55 checkpoint is used for inference. This experimentally validates the importance of early stopping in sequence-to-sequence tasks.
Implemented Beam Search decoding with configurable beam size:
translate_beam_search("He is playing football.", beam_size=5)At each decoding step:
- Expand each beam by top-k next tokens
- Score candidates by cumulative log probability
- Keep top
beam_sizecandidates - Stop when all beams hit
[EOS]ormax_lenreached
| English | Nepali (Model) | Notes |
|---|---|---|
| He is playing football. | झिक्नु पर्ने मार्ग हो । | Partial — got structure right |
| She went to the market. | तिनी को घरमा जानु पर्नेहोल | Got "she" (तिनी) and "go" (जानु) ✅ |
| The weather is very cold today. | धेरै वास्तविक मौसम अस्पष्ट छ | Got "very" (धेरै) and "weather" (मौसम) ✅ |
| I love you. | म तपाईंलाई जान्छु | Got "I" (म) and "you" (तपाईंलाई) ✅ |
| Now get some sleep. | केहीं सूत्र निदाई छ | Got "sleep" (निदाई) ✅ |
The model correctly identifies key content words (pronouns, nouns, some verbs) and produces grammatically structured Nepali output, despite imperfect semantic accuracy. This is expected behavior for a model of this scale trained on a moderately sized dataset.
├── translator.py # Full training code
├── README.md # This file
├── training_curves.png # Loss & accuracy plots
├── spm_en.model # English BPE tokenizer
├── spm_ne.model # Nepali BPE tokenizer
└── checkpoint_ep55.pt # Best model weights (epoch 55)
-
CPU-GPU Bottleneck: Initial training ran at ~11 it/s due to CPU data loading overhead. Pinning the entire tokenized dataset to GPU VRAM eliminated this bottleneck entirely.
-
torch.compile(): PyTorch 2.0's compile significantly reduced per-step time by fusing operations. First epoch is slow (tracing) — subsequent epochs are fast.
-
Overfitting in NMT: Overfitting in translation is subtle — training accuracy can reach 98% while actual translation quality degrades. Always evaluate on held-out sentences, not just training metrics.
-
NoamOpt is critical: Standard Adam without the warmup schedule caused training instability in early epochs. The warmup period is essential for Transformer convergence.
-
BPE Tokenization: Separate tokenizers per language with
character_coverage=1.0is essential for Devanagari script — shared tokenizers or incomplete coverage silently degrade Nepali output. -
Sweet Spot Identification: The loss curve showed clear convergence around epoch 40-50, with a sharp drop indicating memorization after epoch 55. Visual inspection of the loss curve combined with qualitative inference testing is the most reliable method for identifying early stopping point.
- Vaswani et al. (2017) — Attention is All You Need
- Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer
- The Annotated Transformer — Harvard NLP
Sajak Basnet ML Researcher
Built this to deeply understand Transformer internals by implementing every component from the paper — not just calling library functions.
"I love you" → "म तपाईंलाई जान्छु" — the model learned "I" and "you" correctly, just confused love with going. Training continues.
