Skip to content

Latest commit

 

History

History
134 lines (90 loc) · 4.18 KB

File metadata and controls

134 lines (90 loc) · 4.18 KB

Phase 1: Setup and Baseline

Environment Setup

  • Set up a Python environment (e.g., Conda).
  • Install necessary libraries:
    pip install transformers datasets torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    pip install sentence-transformers scikit-learn numpy pandas

Model Selection

  • Start with a smaller ByT5 model from the Hugging Face Hub:
    • google/byt5-small or google/byt5-base
  • These are more manageable for training on a single GPU (e.g., NVIDIA 4090).

Baseline Embedding Extraction

  • Load the pre-trained ByT5 model.
  • Implement Mean Pooling:
    • Get last hidden states from the encoder for all bytes in an input sentence.
    • Compute the mean across the sequence length dimension.
    • ByT5 has no [CLS] token; mean pooling is a standard approach (similar to Sentence-BERT and Sentence-T5).

Baseline Evaluation

  • Use the STS Benchmark (STS-B): stsb_multi_mt subset "en" (available in the datasets library).
  • Generate embeddings using your mean pooling method without fine-tuning.
  • Calculate cosine similarity between sentence pairs.
  • Compute the Spearman correlation with human similarity scores.
  • This gives you a baseline performance number.

Phase 2: Fine-tuning

Fine-tuning Strategy

  • Focus on Contrastive Learning – well-established for sentence embeddings and aligns with your proposal.
  • Follow the Sentence-T5 approach.

Framework

  • Use the sentence-transformers library.
    • Provides contrastive learning components: dataloaders, loss functions (e.g., MultipleNegativesRankingLoss).
    • You’ll need to adapt it for ByT5’s byte-level tokenization and model structure.

Objective

  • Train the model to:
    • Pull embeddings of similar sentences (positive pairs) closer.
    • Push embeddings of dissimilar sentences (negative pairs) further apart.

Dataset Selection

NLI Datasets

  • SNLI and MultiNLI (available via datasets):
    • Treat (premise, hypothesis) pairs labeled 'entailment' as positive pairs.
    • Use in-batch negatives.

Paraphrase Datasets

  • Use QQP (from glue, subset qqp) as an alternative.
    • Use duplicate question pairs as positive pairs.
  • NLI datasets tend to provide more robust results.

Implementation

  • Adapt a sentence-transformers training script:
    • Use the ByT5 tokenizer correctly.
    • Modify model loading to use the ByT5 encoder.
    • Ensure the pooling layer performs mean pooling on encoder outputs.
    • Use a contrastive loss function (e.g., MultipleNegativesRankingLoss).

Training

  • Fine-tune byt5-small or byt5-base on contrastive datasets (e.g., SNLI + MNLI combined).
  • Monitor GPU memory usage and adjust batch size or use gradient accumulation.
  • Training may take hours to days depending on dataset size and hyperparameters.
  • Save model checkpoints regularly.

Phase 3: Evaluation

Core Evaluation

  • Evaluate your fine-tuned model on STS-B (same test set as baseline).
  • Compare Spearman correlation before and after fine-tuning.

Transfer Tasks (Optional)

  • Use a few GLUE tasks to assess generalization:
    • Examples: MRPC (paraphrase identification), QQP.
  • Use sentence embeddings as features for a logistic regression classifier (as done in SentEval).

Focused Evaluation (Choose ONE)

Robustness

  • Create a noisy version of the STS-B test set (e.g., character swaps, deletions, additions).
  • Measure performance drop compared to clean version.
  • Optionally compare with a token-based model like Sentence-BERT.

Multilingual

  • Evaluate zero-shot performance on multilingual STS tasks:
    • Use stsb_multi_mt subsets for languages other than English.
    • Fine-tuned only on English → test cross-lingual transfer.

Alternative Pooling

  • Implement other pooling strategies:
    • Max pooling
    • First byte representation
  • Compare performance to mean pooling on STS-B.

Phase 4: Analysis, Write-up, and Presentation

Analyze Results

  • Compare baseline and fine-tuned STS-B performance.
  • Analyze transfer/generalization results.
  • Review findings from the focused evaluation if performed.

Discuss Limitations

  • Only small models tested.
  • Limited hyperparameter tuning.
  • Specific contrastive setup