🔤 English → Nepali Neural Machine Translator

Transformer Implemented from Scratch — "Attention is All You Need"

📌 Overview

This project is a full ground-up implementation of the Transformer architecture for Neural Machine Translation (NMT), specifically for the English → Nepali language pair. Every component — from Multi-Head Attention to Beam Search decoding — was implemented by reading and understanding the original paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017. arXiv:1706.03762

No transformers library. No pre-trained weights. No shortcuts.

🏗️ Architecture

The model follows the original Transformer encoder-decoder architecture with the following components:

Encoder

Component	Details
Input Embedding	`nn.Embedding(en_vocab=32000, d_model=256)`
Positional Encoding	Sinusoidal — fixed, not learned
Encoder Layers	6 stacked layers
Multi-Head Attention	8 heads, head_dim = 32
Feed Forward Network	d_model=256 → ff_dim=1024 → d_model=256
Add & Norm	Residual connection + LayerNorm

Decoder

Component	Details
Input Embedding	`nn.Embedding(ne_vocab=16000, d_model=256)`
Positional Encoding	Sinusoidal — same as encoder
Decoder Layers	6 stacked layers
Masked Multi-Head Attention	Causal mask prevents future token leakage
Cross-Attention	Queries from decoder, Keys/Values from encoder
Feed Forward Network	Same as encoder
Add & Norm	Residual connection + LayerNorm
Output Projection	`nn.Linear(d_model=256, ne_vocab=16000)`

Key Design Choices

Scaled Dot-Product Attention: scores divided by √(head_dim) to prevent vanishing gradients in softmax
Causal Masking: upper triangular mask in decoder self-attention ensures autoregressive generation
Embedding Scaling: embeddings multiplied by √(d_model) before adding positional encoding — as per paper
Weight Tying: encoder and decoder embeddings are separate (not tied) for this language pair

📊 Hyperparameters

Parameter	Value	Reason
`d_model`	256	Smaller than paper's 512 — fits dataset size
`num_heads`	8	Divides evenly into d_model
`ff_dim`	1024	4× d_model — follows paper ratio
`num_layers`	6	Same as paper
`MAX_LEN`	50	Covers ~95% of sentence lengths in dataset
`batch_size`	256	Maximizes T4 GPU utilization
`warmup_steps`	4000	NoamOpt schedule
`en_vocab_size`	32,000	BPE tokenizer
`ne_vocab_size`	16,000	BPE tokenizer — smaller for Devanagari

🔤 Tokenization

Used SentencePiece BPE (Byte Pair Encoding) tokenization separately for each language:

English  → spm_en.model  (vocab: 32,000 tokens)
Nepali   → spm_ne.model  (vocab: 16,000 tokens)

character_coverage=1.0 was used for both languages to ensure complete Devanagari script coverage — critical for Nepali since missing characters would silently corrupt translations.

Encoding scheme:

Encoder input  : [token_ids] + [EOS]          # encoder knows when sentence ends
Decoder input  : [BOS] + [token_ids]           # decoder starts with BOS
Decoder target : [token_ids] + [EOS]           # shifted right — classic teacher forcing

⚙️ Training

Optimizer — NoamOpt (Warm-up Schedule)

Implemented the learning rate schedule from the paper:

lr = d_model^(-0.5) × min(step^(-0.5), step × warmup_steps^(-1.5))

This linearly increases the learning rate for the first warmup_steps steps, then decays proportionally to the inverse square root of the step number. Critical for Transformer stability — standard Adam without warmup causes instability early in training.

Loss Function

CrossEntropyLoss(ignore_index=0) — PAD tokens are excluded from loss computation so the model doesn't waste capacity learning to predict padding.

Hardware

Platform: Kaggle Notebooks
GPU: NVIDIA T4 (15GB VRAM)
Optimization: torch.compile() for kernel fusion speedup
Dataset pinned to GPU: entire tokenized dataset loaded into VRAM to eliminate CPU bottleneck

📈 Training Results

Loss & Accuracy Curves

Epoch Milestones

Epoch	Loss	Accuracy	Notes
1	8.50	9.0%	Random initialization
5	4.30	27.1%	Basic patterns emerging
10	3.10	38.5%	Structural learning
20	1.74	57.5%	Vocabulary mapping
30	0.86	76.9%	Strong token prediction
40	0.50	85.0%	Near convergence
55	0.37	88.93%	✅ Sweet spot — best generalization
60	0.13	96.5%	⚠️ Overfitting begins

Overfitting Analysis

A key experimental finding of this project was identifying the exact epoch where overfitting begins:

Epoch 55 → Loss: 0.366  | Accuracy: 88.93%  ✅ Best generalization
Epoch 57 → Loss: 0.092  | Accuracy: 98.09%  ⚠️ 4x loss drop in 2 epochs
Epoch 58 → Loss: 0.069  | Accuracy: 98.83%  ❌ Memorizing training data
Epoch 60 → Loss: 0.137  | Accuracy: 96.52%  ❌ Loss bouncing — unstable

A 6x loss drop in 3 epochs (0.37 → 0.06) is a clear signature of memorization rather than generalization. Inference quality at epoch 60 was demonstrably worse than epoch 55 on unseen sentences, confirming overfitting.

Conclusion: Epoch 55 checkpoint is used for inference. This experimentally validates the importance of early stopping in sequence-to-sequence tasks.

🔍 Inference — Beam Search

Implemented Beam Search decoding with configurable beam size:

translate_beam_search("He is playing football.", beam_size=5)

At each decoding step:

Expand each beam by top-k next tokens
Score candidates by cumulative log probability
Keep top beam_size candidates
Stop when all beams hit [EOS] or max_len reached

Sample Translations (Epoch 55)

English	Nepali (Model)	Notes
He is playing football.	झिक्नु पर्ने मार्ग हो ।	Partial — got structure right
She went to the market.	तिनी को घरमा जानु पर्नेहोल	Got "she" (तिनी) and "go" (जानु) ✅
The weather is very cold today.	धेरै वास्तविक मौसम अस्पष्ट छ	Got "very" (धेरै) and "weather" (मौसम) ✅
I love you.	म तपाईंलाई जान्छु	Got "I" (म) and "you" (तपाईंलाई) ✅
Now get some sleep.	केहीं सूत्र निदाई छ	Got "sleep" (निदाई) ✅

The model correctly identifies key content words (pronouns, nouns, some verbs) and produces grammatically structured Nepali output, despite imperfect semantic accuracy. This is expected behavior for a model of this scale trained on a moderately sized dataset.

🗂️ Repository Structure

├── translator.py           # Full training code
├── README.md               # This file
├── training_curves.png     # Loss & accuracy plots
├── spm_en.model            # English BPE tokenizer
├── spm_ne.model            # Nepali BPE tokenizer
└── checkpoint_ep55.pt      # Best model weights (epoch 55)

🔬 Key Learnings & Observations

CPU-GPU Bottleneck: Initial training ran at ~11 it/s due to CPU data loading overhead. Pinning the entire tokenized dataset to GPU VRAM eliminated this bottleneck entirely.
torch.compile(): PyTorch 2.0's compile significantly reduced per-step time by fusing operations. First epoch is slow (tracing) — subsequent epochs are fast.
Overfitting in NMT: Overfitting in translation is subtle — training accuracy can reach 98% while actual translation quality degrades. Always evaluate on held-out sentences, not just training metrics.
NoamOpt is critical: Standard Adam without the warmup schedule caused training instability in early epochs. The warmup period is essential for Transformer convergence.
BPE Tokenization: Separate tokenizers per language with character_coverage=1.0 is essential for Devanagari script — shared tokenizers or incomplete coverage silently degrade Nepali output.
Sweet Spot Identification: The loss curve showed clear convergence around epoch 40-50, with a sharp drop indicating memorization after epoch 55. Visual inspection of the loss curve combined with qualitative inference testing is the most reliable method for identifying early stopping point.

📚 References

Vaswani et al. (2017) — Attention is All You Need
Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer
The Annotated Transformer — Harvard NLP

👤 Author

Sajak Basnet ML Researcher

Built this to deeply understand Transformer internals by implementing every component from the paper — not just calling library functions.

"I love you" → "म तपाईंलाई जान्छु" — the model learned "I" and "you" correctly, just confused love with going. Training continues.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Readme.md		Readme.md
accuracyVSloss.png		accuracyVSloss.png
transformer.ipynb		transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔤 English → Nepali Neural Machine Translator

Transformer Implemented from Scratch — "Attention is All You Need"

📌 Overview

🏗️ Architecture

Encoder

Decoder

Key Design Choices

📊 Hyperparameters

🔤 Tokenization

⚙️ Training

Optimizer — NoamOpt (Warm-up Schedule)

Loss Function

Hardware

📈 Training Results

Loss & Accuracy Curves

Epoch Milestones

Overfitting Analysis

🔍 Inference — Beam Search

Sample Translations (Epoch 55)

🗂️ Repository Structure

🔬 Key Learnings & Observations

📚 References

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔤 English → Nepali Neural Machine Translator

Transformer Implemented from Scratch — "Attention is All You Need"

📌 Overview

🏗️ Architecture

Encoder

Decoder

Key Design Choices

📊 Hyperparameters

🔤 Tokenization

⚙️ Training

Optimizer — NoamOpt (Warm-up Schedule)

Loss Function

Hardware

📈 Training Results

Loss & Accuracy Curves

Epoch Milestones

Overfitting Analysis

🔍 Inference — Beam Search

Sample Translations (Epoch 55)

🗂️ Repository Structure

🔬 Key Learnings & Observations

📚 References

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages