Phase 1: Setup and Baseline

Environment Setup

Set up a Python environment (e.g., Conda).

Install necessary libraries:

pip install transformers datasets torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install sentence-transformers scikit-learn numpy pandas

Model Selection

Start with a smaller ByT5 model from the Hugging Face Hub:
- google/byt5-small or google/byt5-base
These are more manageable for training on a single GPU (e.g., NVIDIA 4090).

Baseline Embedding Extraction

Load the pre-trained ByT5 model.
Implement Mean Pooling:
- Get last hidden states from the encoder for all bytes in an input sentence.
- Compute the mean across the sequence length dimension.
- ByT5 has no [CLS] token; mean pooling is a standard approach (similar to Sentence-BERT and Sentence-T5).

Baseline Evaluation

Use the STS Benchmark (STS-B): stsb_multi_mt subset "en" (available in the datasets library).
Generate embeddings using your mean pooling method without fine-tuning.
Calculate cosine similarity between sentence pairs.
Compute the Spearman correlation with human similarity scores.
This gives you a baseline performance number.

Phase 2: Fine-tuning

Fine-tuning Strategy

Focus on Contrastive Learning – well-established for sentence embeddings and aligns with your proposal.
Follow the Sentence-T5 approach.

Framework

Use the sentence-transformers library.
- Provides contrastive learning components: dataloaders, loss functions (e.g., MultipleNegativesRankingLoss).
- You’ll need to adapt it for ByT5’s byte-level tokenization and model structure.

Objective

Train the model to:
- Pull embeddings of similar sentences (positive pairs) closer.
- Push embeddings of dissimilar sentences (negative pairs) further apart.

Dataset Selection

NLI Datasets

SNLI and MultiNLI (available via datasets):
- Treat (premise, hypothesis) pairs labeled 'entailment' as positive pairs.
- Use in-batch negatives.

Paraphrase Datasets

Use QQP (from glue, subset qqp) as an alternative.
- Use duplicate question pairs as positive pairs.
NLI datasets tend to provide more robust results.

Implementation

Adapt a sentence-transformers training script:
- Use the ByT5 tokenizer correctly.
- Modify model loading to use the ByT5 encoder.
- Ensure the pooling layer performs mean pooling on encoder outputs.
- Use a contrastive loss function (e.g., MultipleNegativesRankingLoss).

Training

Fine-tune byt5-small or byt5-base on contrastive datasets (e.g., SNLI + MNLI combined).
Monitor GPU memory usage and adjust batch size or use gradient accumulation.
Training may take hours to days depending on dataset size and hyperparameters.
Save model checkpoints regularly.

Phase 3: Evaluation

Core Evaluation

Evaluate your fine-tuned model on STS-B (same test set as baseline).
Compare Spearman correlation before and after fine-tuning.

Transfer Tasks (Optional)

Use a few GLUE tasks to assess generalization:
- Examples: MRPC (paraphrase identification), QQP.
Use sentence embeddings as features for a logistic regression classifier (as done in SentEval).

Focused Evaluation (Choose ONE)

Robustness

Create a noisy version of the STS-B test set (e.g., character swaps, deletions, additions).
Measure performance drop compared to clean version.
Optionally compare with a token-based model like Sentence-BERT.

Multilingual

Evaluate zero-shot performance on multilingual STS tasks:
- Use stsb_multi_mt subsets for languages other than English.
- Fine-tuned only on English → test cross-lingual transfer.

Alternative Pooling

Implement other pooling strategies:
- Max pooling
- First byte representation
Compare performance to mean pooling on STS-B.

Phase 4: Analysis, Write-up, and Presentation

Analyze Results

Compare baseline and fine-tuned STS-B performance.
Analyze transfer/generalization results.
Review findings from the focused evaluation if performed.

Discuss Limitations

Only small models tested.
Limited hyperparameter tuning.
Specific contrastive setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: Setup and Baseline

Environment Setup

Model Selection

Baseline Embedding Extraction

Baseline Evaluation

Phase 2: Fine-tuning

Fine-tuning Strategy

Framework

Objective

Dataset Selection

NLI Datasets

Paraphrase Datasets

Implementation

Training

Phase 3: Evaluation

Core Evaluation

Transfer Tasks (Optional)

Focused Evaluation (Choose ONE)

Robustness

Multilingual

Alternative Pooling

Phase 4: Analysis, Write-up, and Presentation

Analyze Results

Discuss Limitations

FilesExpand file tree

Phases.md

Latest commit

History

Phases.md

File metadata and controls

Phase 1: Setup and Baseline

Environment Setup

Model Selection

Baseline Embedding Extraction

Baseline Evaluation

Phase 2: Fine-tuning

Fine-tuning Strategy

Framework

Objective

Dataset Selection

NLI Datasets

Paraphrase Datasets

Implementation

Training

Phase 3: Evaluation

Core Evaluation

Transfer Tasks (Optional)

Focused Evaluation (Choose ONE)

Robustness

Multilingual

Alternative Pooling

Phase 4: Analysis, Write-up, and Presentation

Analyze Results

Discuss Limitations