Skip to content

Bansnetsajak007/Attention_is_all_you_need_Implementation_from_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

🔤 English → Nepali Neural Machine Translator

Transformer Implemented from Scratch — "Attention is All You Need"

Training Curves


📌 Overview

This project is a full ground-up implementation of the Transformer architecture for Neural Machine Translation (NMT), specifically for the English → Nepali language pair. Every component — from Multi-Head Attention to Beam Search decoding — was implemented by reading and understanding the original paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017. arXiv:1706.03762

No transformers library. No pre-trained weights. No shortcuts.


🏗️ Architecture

The model follows the original Transformer encoder-decoder architecture with the following components:

Encoder

Component Details
Input Embedding nn.Embedding(en_vocab=32000, d_model=256)
Positional Encoding Sinusoidal — fixed, not learned
Encoder Layers 6 stacked layers
Multi-Head Attention 8 heads, head_dim = 32
Feed Forward Network d_model=256 → ff_dim=1024 → d_model=256
Add & Norm Residual connection + LayerNorm

Decoder

Component Details
Input Embedding nn.Embedding(ne_vocab=16000, d_model=256)
Positional Encoding Sinusoidal — same as encoder
Decoder Layers 6 stacked layers
Masked Multi-Head Attention Causal mask prevents future token leakage
Cross-Attention Queries from decoder, Keys/Values from encoder
Feed Forward Network Same as encoder
Add & Norm Residual connection + LayerNorm
Output Projection nn.Linear(d_model=256, ne_vocab=16000)

Key Design Choices

  • Scaled Dot-Product Attention: scores divided by √(head_dim) to prevent vanishing gradients in softmax
  • Causal Masking: upper triangular mask in decoder self-attention ensures autoregressive generation
  • Embedding Scaling: embeddings multiplied by √(d_model) before adding positional encoding — as per paper
  • Weight Tying: encoder and decoder embeddings are separate (not tied) for this language pair

📊 Hyperparameters

Parameter Value Reason
d_model 256 Smaller than paper's 512 — fits dataset size
num_heads 8 Divides evenly into d_model
ff_dim 1024 4× d_model — follows paper ratio
num_layers 6 Same as paper
MAX_LEN 50 Covers ~95% of sentence lengths in dataset
batch_size 256 Maximizes T4 GPU utilization
warmup_steps 4000 NoamOpt schedule
en_vocab_size 32,000 BPE tokenizer
ne_vocab_size 16,000 BPE tokenizer — smaller for Devanagari

🔤 Tokenization

Used SentencePiece BPE (Byte Pair Encoding) tokenization separately for each language:

English  → spm_en.model  (vocab: 32,000 tokens)
Nepali   → spm_ne.model  (vocab: 16,000 tokens)

character_coverage=1.0 was used for both languages to ensure complete Devanagari script coverage — critical for Nepali since missing characters would silently corrupt translations.

Encoding scheme:

Encoder input  : [token_ids] + [EOS]          # encoder knows when sentence ends
Decoder input  : [BOS] + [token_ids]           # decoder starts with BOS
Decoder target : [token_ids] + [EOS]           # shifted right — classic teacher forcing

⚙️ Training

Optimizer — NoamOpt (Warm-up Schedule)

Implemented the learning rate schedule from the paper:

lr = d_model^(-0.5) × min(step^(-0.5), step × warmup_steps^(-1.5))

This linearly increases the learning rate for the first warmup_steps steps, then decays proportionally to the inverse square root of the step number. Critical for Transformer stability — standard Adam without warmup causes instability early in training.

Loss Function

CrossEntropyLoss(ignore_index=0) — PAD tokens are excluded from loss computation so the model doesn't waste capacity learning to predict padding.

Hardware

  • Platform: Kaggle Notebooks
  • GPU: NVIDIA T4 (15GB VRAM)
  • Optimization: torch.compile() for kernel fusion speedup
  • Dataset pinned to GPU: entire tokenized dataset loaded into VRAM to eliminate CPU bottleneck

📈 Training Results

Loss & Accuracy Curves

Training Curves

Epoch Milestones

Epoch Loss Accuracy Notes
1 8.50 9.0% Random initialization
5 4.30 27.1% Basic patterns emerging
10 3.10 38.5% Structural learning
20 1.74 57.5% Vocabulary mapping
30 0.86 76.9% Strong token prediction
40 0.50 85.0% Near convergence
55 0.37 88.93% ✅ Sweet spot — best generalization
60 0.13 96.5% ⚠️ Overfitting begins

Overfitting Analysis

A key experimental finding of this project was identifying the exact epoch where overfitting begins:

Epoch 55 → Loss: 0.366  | Accuracy: 88.93%  ✅ Best generalization
Epoch 57 → Loss: 0.092  | Accuracy: 98.09%  ⚠️ 4x loss drop in 2 epochs
Epoch 58 → Loss: 0.069  | Accuracy: 98.83%  ❌ Memorizing training data
Epoch 60 → Loss: 0.137  | Accuracy: 96.52%  ❌ Loss bouncing — unstable

A 6x loss drop in 3 epochs (0.37 → 0.06) is a clear signature of memorization rather than generalization. Inference quality at epoch 60 was demonstrably worse than epoch 55 on unseen sentences, confirming overfitting.

Conclusion: Epoch 55 checkpoint is used for inference. This experimentally validates the importance of early stopping in sequence-to-sequence tasks.


🔍 Inference — Beam Search

Implemented Beam Search decoding with configurable beam size:

translate_beam_search("He is playing football.", beam_size=5)

At each decoding step:

  1. Expand each beam by top-k next tokens
  2. Score candidates by cumulative log probability
  3. Keep top beam_size candidates
  4. Stop when all beams hit [EOS] or max_len reached

Sample Translations (Epoch 55)

English Nepali (Model) Notes
He is playing football. झिक्नु पर्ने मार्ग हो । Partial — got structure right
She went to the market. तिनी को घरमा जानु पर्नेहोल Got "she" (तिनी) and "go" (जानु) ✅
The weather is very cold today. धेरै वास्तविक मौसम अस्पष्ट छ Got "very" (धेरै) and "weather" (मौसम) ✅
I love you. म तपाईंलाई जान्छु Got "I" (म) and "you" (तपाईंलाई) ✅
Now get some sleep. केहीं सूत्र निदाई छ Got "sleep" (निदाई) ✅

The model correctly identifies key content words (pronouns, nouns, some verbs) and produces grammatically structured Nepali output, despite imperfect semantic accuracy. This is expected behavior for a model of this scale trained on a moderately sized dataset.


🗂️ Repository Structure

├── translator.py           # Full training code
├── README.md               # This file
├── training_curves.png     # Loss & accuracy plots
├── spm_en.model            # English BPE tokenizer
├── spm_ne.model            # Nepali BPE tokenizer
└── checkpoint_ep55.pt      # Best model weights (epoch 55)

🔬 Key Learnings & Observations

  1. CPU-GPU Bottleneck: Initial training ran at ~11 it/s due to CPU data loading overhead. Pinning the entire tokenized dataset to GPU VRAM eliminated this bottleneck entirely.

  2. torch.compile(): PyTorch 2.0's compile significantly reduced per-step time by fusing operations. First epoch is slow (tracing) — subsequent epochs are fast.

  3. Overfitting in NMT: Overfitting in translation is subtle — training accuracy can reach 98% while actual translation quality degrades. Always evaluate on held-out sentences, not just training metrics.

  4. NoamOpt is critical: Standard Adam without the warmup schedule caused training instability in early epochs. The warmup period is essential for Transformer convergence.

  5. BPE Tokenization: Separate tokenizers per language with character_coverage=1.0 is essential for Devanagari script — shared tokenizers or incomplete coverage silently degrade Nepali output.

  6. Sweet Spot Identification: The loss curve showed clear convergence around epoch 40-50, with a sharp drop indicating memorization after epoch 55. Visual inspection of the loss curve combined with qualitative inference testing is the most reliable method for identifying early stopping point.


📚 References

  1. Vaswani et al. (2017) — Attention is All You Need
  2. Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer
  3. The Annotated Transformer — Harvard NLP

👤 Author

Sajak Basnet ML Researcher

Built this to deeply understand Transformer internals by implementing every component from the paper — not just calling library functions.

"I love you" → "म तपाईंलाई जान्छु" — the model learned "I" and "you" correctly, just confused love with going. Training continues.


About

This project is a full ground-up implementation of the Transformer architecture for Neural Machine Translation (NMT), specifically for the English to Nepali language pair.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors