Phase 1: Setup and Baseline
Set up a Python environment (e.g., Conda).
Install necessary libraries:
pip install transformers datasets torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install sentence-transformers scikit-learn numpy pandas
Start with a smaller ByT5 model from the Hugging Face Hub:
google/byt5-small or google/byt5-base
These are more manageable for training on a single GPU (e.g., NVIDIA 4090).
Baseline Embedding Extraction
Load the pre-trained ByT5 model.
Implement Mean Pooling:
Get last hidden states from the encoder for all bytes in an input sentence.
Compute the mean across the sequence length dimension.
ByT5 has no [CLS] token; mean pooling is a standard approach (similar to Sentence-BERT and Sentence-T5).
Use the STS Benchmark (STS-B): stsb_multi_mt subset "en" (available in the datasets library).
Generate embeddings using your mean pooling method without fine-tuning.
Calculate cosine similarity between sentence pairs.
Compute the Spearman correlation with human similarity scores.
This gives you a baseline performance number.
Focus on Contrastive Learning – well-established for sentence embeddings and aligns with your proposal.
Follow the Sentence-T5 approach.
Use the sentence-transformers library.
Provides contrastive learning components: dataloaders, loss functions (e.g., MultipleNegativesRankingLoss).
You’ll need to adapt it for ByT5’s byte-level tokenization and model structure.
Train the model to:
Pull embeddings of similar sentences (positive pairs) closer.
Push embeddings of dissimilar sentences (negative pairs) further apart.
SNLI and MultiNLI (available via datasets):
Treat (premise, hypothesis) pairs labeled 'entailment' as positive pairs.
Use in-batch negatives.
Use QQP (from glue, subset qqp) as an alternative.
Use duplicate question pairs as positive pairs.
NLI datasets tend to provide more robust results.
Adapt a sentence-transformers training script:
Use the ByT5 tokenizer correctly.
Modify model loading to use the ByT5 encoder.
Ensure the pooling layer performs mean pooling on encoder outputs.
Use a contrastive loss function (e.g., MultipleNegativesRankingLoss).
Fine-tune byt5-small or byt5-base on contrastive datasets (e.g., SNLI + MNLI combined).
Monitor GPU memory usage and adjust batch size or use gradient accumulation.
Training may take hours to days depending on dataset size and hyperparameters.
Save model checkpoints regularly.
Evaluate your fine-tuned model on STS-B (same test set as baseline).
Compare Spearman correlation before and after fine-tuning.
Transfer Tasks (Optional)
Use a few GLUE tasks to assess generalization:
Examples: MRPC (paraphrase identification), QQP.
Use sentence embeddings as features for a logistic regression classifier (as done in SentEval).
Focused Evaluation (Choose ONE)
Create a noisy version of the STS-B test set (e.g., character swaps, deletions, additions).
Measure performance drop compared to clean version.
Optionally compare with a token-based model like Sentence-BERT.
Evaluate zero-shot performance on multilingual STS tasks:
Use stsb_multi_mt subsets for languages other than English.
Fine-tuned only on English → test cross-lingual transfer.
Implement other pooling strategies:
Max pooling
First byte representation
Compare performance to mean pooling on STS-B.
Phase 4: Analysis, Write-up, and Presentation
Compare baseline and fine-tuned STS-B performance.
Analyze transfer/generalization results.
Review findings from the focused evaluation if performed.
Only small models tested.
Limited hyperparameter tuning.
Specific contrastive setup