This project tackles a regression problem on tokenized prompts, where each input prompt is represented as:
input_ids(integer tokens)attention_mask- a continuous target label in
[0, 1]
The goal is to predict the label with minimum Mean Absolute Error (MAE).
This repository documents not only the final models, but the entire modeling journey:
- architecture exploration
- ensemble design
- Mixture-of-Experts (MoE)
- feature engineering
- leakage detection
- leaderboard overfitting analysis
Final Result:
- Top-10 finish on Private Leaderboard
- Team: ML Monarchs
Each sample contains:
input_ids: List[int] — tokenized prompt (max length 256)attention_mask: List[int] — valid token positionslabel: float ∈ [0, 1] (train only)
- Vocabulary size: ~50k tokens
- Variable sequence lengths
- Tokens are obfuscated (no semantic meaning)
- No external metadata
Since token semantics are unavailable, the task relies heavily on:
- structural patterns
- token repetition
- entropy
- sequence dynamics
- 5-fold cross-validation using predefined folds
- Strict separation of train / validation
- Extensive checks for:
- duplicate prompts
- memorization leakage
- fold contamination
Only CV-safe scores were trusted during development.
- Multi-kernel CNNs (kernel sizes 3, 5, 7)
- Mean pooling + max pooling
- Lightweight and strong baselines
- Bidirectional GRUs
- Mean + max pooling
- Segment-wise pooling (beginning / middle / end)
- Attention-pooled CNNs
- Attention-pooled GRUs
- Stability fallback using mean pooling
- Dilations: 1, 2, 4, 8
- Larger receptive fields without deeper stacks
Each model outputs a sigmoid-scaled regression value.
In addition to neural embeddings, multiple structural features were extracted:
- Normalized sequence length
- Unique token ratio
- Token entropy
- Adjacent repetition ratio
- Constraint score (tokens appearing ≥3 times)
- Relative position variance (RPC)
- Boundary density (tokens concentrated at 25/50/75%)
- Windowed entropy shifts (phase changes)
- Global token rarity profiles
- Compression ratio (zlib)
These features captured prompt regularity and rigidity, which showed moderate correlation with labels.
Eight diverse neural models were trained:
- CNN
- CNN (different seed)
- BiGRU
- BiGRU (different seed)
- Segmented GRU
- Attention GRU
- Attention CNN
- Dilated CNN
Out-of-fold (OOF) predictions were collected for all models.
Multiple stacking approaches were tested:
- LightGBM regression
- Logit-space regression
- Disagreement features (mean, std, range)
- Per-model deviation features
- Defensive regularization variants
- Ridge and ElasticNet baselines
Best CV MAE achieved: ~0.146
Best Private LB MAE: ~0.150
To address heterogeneous prompt behavior, several MoE strategies were explored.
Experts trained on bins defined by:
- sequence length
- entropy
- constraint score
Predictions were averaged across experts.
- Experts trained per bin
- Separate gate model predicts expert weights
- Softmax-normalized expert routing
MoE yielded small but consistent gains, confirming that prompt sub-types exist, though gains were bounded by representation limits.
- KNN correction on embeddings → overfitting
- Excessive feature stacking → diminishing returns
- Over-calibration (isotonic, quantile) → hurt generalization
- Deep meta-stacks → unstable CV
These experiments are documented to show what not to do.
- Structural signals matter, but are weak individually
- Ensembling provides large gains early, small gains later
- MoE helps only when experts see meaningfully different data
- Validation discipline is critical — many signals looked strong but failed on private LB
- Transformer-style token interactions were likely the missing inductive bias
- 🏆 Top-10 Private Leaderboard
- 📉 Private MAE: 0.15016
- 👥 Authors: MonarchOfCoding