Skip to content

Commit c4aa426

Browse files
committed
feat: add glomerulus weight vector fitting functionality
- Introduced a new CLI entry point for fitting glomerulus weights via `door-fit-glomerulus-weights`. - Added `fit_glomerulus_weights.py` script as a wrapper for the new CLI functionality. - Implemented `cli_glomerulus.py` to handle command-line arguments and integrate with the fitting process. - Created `glomerulus_features.py` for mapping receptor responses to glomerulus-level features. - Developed `glomerulus_regression.py` for fitting regression models and exporting results. - Added tests for glomerulus feature construction and weight vector regression in `test_glomerulus_pipeline.py`. - Updated `pyproject.toml` to include new dependencies and entry points.
1 parent 4be90f7 commit c4aa426

11 files changed

Lines changed: 1574 additions & 0 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@ __pycache__/
33
*.py[cod]
44
*$py.class
55

6+
out/
67
# C extensions
78
*.so
89

10+
outputs/
911
# Distribution / packaging
1012
.Python
1113
build/

README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -682,6 +682,95 @@ door-pathways --cache door_cache --predict-behavior "ethyl butyrate"
682682

683683
---
684684

685+
## Baseline Glomerulus Weight Vector from Control PER
686+
687+
Fit a single regularized (LASSO/Ridge/ElasticNet) regression from **glomerulus-level** odor feature vectors to baseline/control PER (Proboscis Extension Response) labels. The pipeline maps DoOR receptor responses to glomeruli using the authoritative `door_to_flywire_mapping.csv`, then produces a ranked weight vector with each glomerulus classified as **positive (+)**, **negative (−)**, or **zero (0)**.
688+
689+
### Concept
690+
691+
1. **Receptor → Glomerulus mapping**: Each of the 78 DoOR receptors is mapped to its target glomerulus. When multiple receptors converge on the same glomerulus, their responses are aggregated (default: `max`).
692+
2. **Feature-set filtering**: Glomeruli can be filtered to those "active" (response > threshold) for the odors of interest — using `union` (active in any odor), `intersection` (active in all), or `all` (no filtering).
693+
3. **Regression**: A regularized regression fits the design matrix **(n_odors × n_glomeruli)** to the PER labels. LASSO is the default for sparsity + sign interpretability.
694+
4. **Weight classification**: Weights are classified as `+1` (w > ε), `-1` (w < −ε), or `0` (|w| ≤ ε), where ε is configurable (default: 1e-6).
695+
696+
### Limitations
697+
698+
- **Only 4 datapoints** — the model is severely underdetermined. Results are exploratory and hypothesis-generating, not predictive.
699+
- Glomerulus-level aggregation loses receptor-level resolution (e.g., co-expressed receptors in the same sensillum).
700+
- The mapping has a few ambiguous entries (see `is_ambiguous` column in the mapping CSV).
701+
702+
### CLI Usage
703+
704+
```bash
705+
# Union of active glomeruli (default)
706+
python scripts/fit_glomerulus_weights.py \
707+
--config configs/glomerulus_weight_baseline.yaml \
708+
--feature-set union \
709+
--outdir out/glomerulus_baseline_union
710+
711+
# Intersection of active glomeruli
712+
python scripts/fit_glomerulus_weights.py \
713+
--config configs/glomerulus_weight_baseline.yaml \
714+
--feature-set intersection \
715+
--outdir out/glomerulus_baseline_intersection
716+
717+
# All glomeruli, Ridge regression, custom threshold
718+
python scripts/fit_glomerulus_weights.py \
719+
--config configs/glomerulus_weight_baseline.yaml \
720+
--feature-set all --model ridge \
721+
--activation-threshold 0.1 \
722+
--outdir out/glomerulus_ridge_all
723+
```
724+
725+
Or via the installed console script:
726+
727+
```bash
728+
door-fit-glomerulus-weights --config configs/glomerulus_weight_baseline.yaml --outdir out/baseline
729+
```
730+
731+
### Configuration
732+
733+
See `configs/glomerulus_weight_baseline.yaml` for the default config with 4 odors and their control PER labels. CLI flags override config defaults.
734+
735+
### Output Files
736+
737+
| File | Description |
738+
|------|-------------|
739+
| `weights.csv` | Glomerulus, weight, sign (+1/−1/0), abs_weight, rank |
740+
| `model_summary.json` | Model type, alpha, R², MSE, sign counts, full config |
741+
742+
### Python API
743+
744+
```python
745+
from door_toolkit.encoder import DoOREncoder
746+
from door_toolkit.glomerulus_features import (
747+
load_receptor_to_glomerulus_mapping,
748+
build_design_matrix,
749+
)
750+
from door_toolkit.glomerulus_regression import (
751+
fit_glomerulus_weight_vector,
752+
export_weight_report,
753+
)
754+
755+
encoder = DoOREncoder("door_cache", use_torch=False)
756+
mapping, _ = load_receptor_to_glomerulus_mapping("data/mappings/door_to_flywire_mapping.csv")
757+
758+
odors = ["ethyl butyrate", "1-hexanol", "benzaldehyde", "3-octanol"]
759+
y = [0.48, 0.07, 0.04, 0.14]
760+
761+
X, glom_names, meta = build_design_matrix(
762+
odors, encoder, mapping, feature_set="union", activation_threshold=0.05,
763+
)
764+
results = fit_glomerulus_weight_vector(X, y, glom_names, model="lasso")
765+
export_weight_report(results, "out/my_analysis")
766+
```
767+
768+
### How to Extend Later
769+
770+
This pipeline produces a **single weight vector** from baseline PER. A natural extension is a **two-pathway opponent model** that fits separate excitatory/inhibitory weight vectors (e.g., approach vs. avoidance pathways) using opto vs. control PER differences. That is out of scope for this module but the glomerulus feature construction (`build_design_matrix`) is designed to be reusable.
771+
772+
---
773+
685774
## Neural Network Preprocessing
686775

687776
Prepare DoOR data for neural network training with sparse encoding and augmentation.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Baseline glomerulus weight vector configuration
2+
# ================================================
3+
# Fits a single regularized regression from glomerulus-level odor features
4+
# to control/baseline PER (Proboscis Extension Response) labels.
5+
#
6+
# NOTE: With only 4 datapoints this is severely underdetermined.
7+
# Results are exploratory / hypothesis-generating only.
8+
9+
odors:
10+
- name: ethyl butyrate
11+
per_label: 0.48
12+
- name: 1-hexanol
13+
per_label: 0.07
14+
- name: benzaldehyde
15+
per_label: 0.04
16+
- name: 3-octanol
17+
per_label: 0.14
18+
19+
defaults:
20+
feature_set: intersection # all | union | intersection
21+
activation_threshold: 0.05 # glomerulus considered "active" if aggregated response > this
22+
aggregation: max # max | mean | sum (when multiple receptors → same glomerulus)
23+
model: ridge # lasso | ridge | elasticnet (ridge recommended for small n_samples)
24+
standardize: true
25+
random_state: 0
26+
zero_eps: 1.0e-6 # |weight| < eps → classified as zero
27+
target_mode: centered # centered: fit y - mean(y); raw: fit raw PER labels
28+
door_cache_path: door_cache
29+
mapping_csv_path: data/mappings/door_to_flywire_mapping.csv
30+
# NOTE: With only 4 samples, use intersection feature set to reduce dimensions.
31+
# Ridge is preferred over LASSO for small samples (less overfitting).
32+
# Centered mode is recommended: intercept ≈ 0, weights represent deviations from mean PER.
33+
# Use --target-mode raw to get legacy behavior (intercept ≈ mean PER).
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Why Are Glomerulus Weights So Small? — A Complete Explanation
2+
3+
## The Problem You Observed
4+
5+
Running the baseline pipeline produces weights like **0.01-0.02** even with Ridge regression:
6+
7+
```
8+
Top weights from glomerulus_ridge_union:
9+
glomerulus weight
10+
ORN_VM7v 0.014136
11+
ORN_DM1 0.013120
12+
ORN_VA3 -0.012905
13+
```
14+
15+
You might expect weights like 0.1–1.0, but 0.01 is actually correct. Here's why.
16+
17+
---
18+
19+
## Root Cause: Severe Underdetermination
20+
21+
### The Problem Setup
22+
23+
| Factor | Value | Implication |
24+
|--------|-------|------------|
25+
| **Samples** | 4 odors | 4 independent equations |
26+
| **Features** (union) | 44 glomeruli | 44 unknown weights to fit |
27+
| **Degrees of freedom** | 4 − 44 = −40 | **40× more unknowns than equations** |
28+
| **Feature/Sample Ratio** | 44/4 = 11 | Extremely ill-posed |
29+
30+
With least-squares regression (or Ridge/LASSO), you're trying to fit:
31+
32+
```
33+
y = X·w (4×1 = 44×4 · 44×1)
34+
```
35+
36+
This is a **heavily underdetermined** system. Ridge CV chooses a weak regularization (α=1e-4) to fit the 4 points nearly perfectly (R²≈1), but the weights must be **distributed across 44 dimensions**, so each weight is tiny.
37+
38+
### Mathematical Intuition
39+
40+
If you had 1 odor with response `[1, 1, ..., 1]` (all glomeruli equally active) and target PER = 0.1:
41+
42+
```
43+
0.1 = w₁ + w₂ + ... + w₄₄ (sum of 44 weights)
44+
```
45+
46+
To satisfy this, the weights might each be ≈ 0.1/44 ≈ **0.002**. With 4 different odors, the weights are different, but still small.
47+
48+
---
49+
50+
## Why Intersection Helps (But Not Enough)
51+
52+
When you use **intersection** instead of **union**:
53+
- Union: 44 features → max weight ≈ 0.014
54+
- Intersection: 25 features → max weight ≈ 0.024
55+
56+
Halving the features nearly **doubles** the weights, which is the correct mathematical relationship. But we're still at ~0.02 because **4 samples is fundamentally too few**.
57+
58+
---
59+
60+
## Why Ridge's CV Alpha Is So Small
61+
62+
RidgeCV uses cross-validation to pick the best regularization strength. With only 4 samples:
63+
64+
1. CV tries α values: [1e-4, 1e-3, 0.01, 0.1, 1, 10, ...]
65+
2. For each α, it evaluates generalization error on left-out samples
66+
3. **With n=4, the model has so much capacity that even weak regularization (α=1e-4) fits perfectly**
67+
4. CV can't distinguish good regularization from bad (only 4 train/test splits)
68+
5. **Result: α stays at 1e-4** (weak regularization) → distributed, tiny weights
69+
70+
---
71+
72+
## Why LASSO Gives All Zeros
73+
74+
LASSO (elastic net with L1) **enforces sparsity** aggressively. With the tiny scale (~0.01 per weight) and default ε = 1e-6:
75+
76+
- Most weights fall below the sparsity threshold
77+
- LASSO CV selects high regularization (α=10) to avoid overfitting
78+
- **Result: almost all weights are exactly zero**, and remaining ones are tiny
79+
80+
---
81+
82+
## Solutions: How to Get "Bigger" Weights
83+
84+
### 1. **Reduce Features (Recommended)**
85+
Use **intersection** instead of **union**:
86+
```bash
87+
python scripts/fit_glomerulus_weights.py \
88+
--config configs/glomerulus_weight_baseline.yaml \
89+
--feature-set intersection
90+
```
91+
92+
**Result**: Weights ≈ 0.02 (2× larger). Still small, but interpretable.
93+
94+
### 2. **Manually Increase Regularization**
95+
Ridge CV is too weak. Force stronger regularization to get sparser, larger weights. You'd need to modify the code or use a config override, e.g.:
96+
97+
```python
98+
# In glomerulus_regression.py, modify fit_glomerulus_weight_vector:
99+
alpha = 0.1 # Instead of CV-selected 1e-4
100+
reg = Ridge(alpha=alpha) # Fixed alpha instead of RidgeCV
101+
```
102+
103+
With α=0.1, weights might be 0.05–0.1 (more interpretable, but trades fit for interpretability).
104+
105+
### 3. **Collect More Data**
106+
With only 4 datapoints, the problem is inherently underdetermined. Collecting 10–20 odors would:
107+
- Give enough equations to constrain the solution
108+
- Allow meaningful weight differences
109+
- Enable proper cross-validation
110+
111+
---
112+
113+
## What the Small Weights Actually Mean
114+
115+
**Small weights don't mean the glomeruli are unimportant.** They mean:
116+
117+
- Each glomerulus contributes a tiny fractional amount to the PER prediction
118+
- The contributions sum across all active glomeruli to produce the PER
119+
- Example: if 25 glomeruli are active and average weight is 0.02, their combined contribution is ~0.5 PER units
120+
121+
The **sign** (positive/negative) is more meaningful than the **magnitude**:
122+
- **Positive weight** → glomerulus promotes approach (attractive)
123+
- **Negative weight** → glomerulus promotes avoidance (aversive)
124+
125+
---
126+
127+
## Verification: Is the Small Scale Correct?
128+
129+
Here's a sanity check. Ridge intersection on the 4 odors:
130+
131+
```
132+
Odor PER_actual PER_predicted Error
133+
ethyl butyrate 0.48 0.48 0.00
134+
1-hexanol 0.07 0.07 0.00
135+
benzaldehyde 0.04 0.04 0.00
136+
3-octanol 0.14 0.14 0.00
137+
```
138+
139+
The model fits perfectly (R² = 1) with tiny weights because:
140+
1. 4 odors = 4 equations
141+
2. 25 glomeruli = 25 unknowns (in intersection mode)
142+
3. 21 degrees of freedom → can fit perfectly with many weight combinations
143+
4. Ridge selects the "smoothest" solution (smallest total weight magnitude)
144+
145+
If the weights were larger (e.g., 0.1–0.5), the fit would still be perfect (same 4 equations), but that solution would be less likely under Ridge's preference for smaller weights.
146+
147+
---
148+
149+
## Recommended Interpretation Strategy
150+
151+
1. **Trust the sign, not the magnitude.**
152+
- Positive weights: ORN_DM1, ORN_DM2, ORN_VM7v (attractant-like)
153+
- Negative weights: ORN_VA3, ORN_DL4 (aversant-like)
154+
155+
2. **Focus on relative differences.**
156+
- ORN_DM1 (0.024) is ~2× the weight of ORN_VA2 (0.012)
157+
- This ranking is meaningful for hypothesis generation
158+
159+
3. **Use as exploratory, not predictive.**
160+
- 4 datapoints → model is exploratory/hypothesis-generating
161+
- Validate with more data before drawing conclusions
162+
163+
---
164+
165+
## Expected Output for Small-Sample Regression
166+
167+
This pipeline is **working correctly**. Small weights are the **correct and expected output** for this problem. Consider this your baseline; improvements require more data or different problem formulation (e.g., two-pathway opponent model with opto/control contrasts).

0 commit comments

Comments
 (0)