|
| 1 | +# Taxonomic Level Classification |
| 2 | + |
| 3 | +Train tree species classification models at different taxonomic levels (species or genus) with the same codebase. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +```python |
| 8 | +from neon_tree_classification.core.datamodule import NeonCrownDataModule |
| 9 | + |
| 10 | +# Species-level classification (167 classes - more challenging) |
| 11 | +datamodule = NeonCrownDataModule( |
| 12 | + csv_path="data/metadata/combined_dataset.csv", |
| 13 | + hdf5_path="data/combined_dataset.h5", |
| 14 | + modalities=["rgb"], |
| 15 | + taxonomic_level="species", # Default |
| 16 | + batch_size=32, |
| 17 | +) |
| 18 | + |
| 19 | +# Genus-level classification (60 classes - easier, better for initial experiments) |
| 20 | +datamodule = NeonCrownDataModule( |
| 21 | + csv_path="data/metadata/combined_dataset.csv", |
| 22 | + hdf5_path="data/combined_dataset.h5", |
| 23 | + modalities=["rgb"], |
| 24 | + taxonomic_level="genus", # Extract genus from species names |
| 25 | + batch_size=32, |
| 26 | +) |
| 27 | +``` |
| 28 | + |
| 29 | +## Taxonomic Levels |
| 30 | + |
| 31 | +### Species Level (Default) |
| 32 | +- **Classes**: 167 unique species |
| 33 | +- **Label format**: USDA plant codes (e.g., "ACRU", "PSMEM") |
| 34 | +- **Full names**: e.g., "Acer rubrum L.", "Pseudotsuga menziesii" |
| 35 | +- **Use when**: You need fine-grained species identification |
| 36 | + |
| 37 | +### Genus Level |
| 38 | +- **Classes**: 60 unique genera |
| 39 | +- **Label format**: Genus names (e.g., "Acer", "Pseudotsuga") |
| 40 | +- **Extraction**: First word from species_name column |
| 41 | +- **Use when**: |
| 42 | + - Initial model development and testing (~3x fewer classes) |
| 43 | + - Evaluating model architectures |
| 44 | + - Limited training data or compute |
| 45 | + - Ecological studies at genus level |
| 46 | + |
| 47 | +## Class Distribution |
| 48 | + |
| 49 | +| Level | Classes | Top Class | Samples | Rare Classes (< 10 samples) | |
| 50 | +|-------|---------|-----------|---------|----------------------------| |
| 51 | +| **Species** | 167 | Acer rubrum | 5,684 (11.8%) | 14 (8.4%) | |
| 52 | +| **Genus** | 60 | Quercus | 7,479 (15.6%) | 5 (8.3%) | |
| 53 | + |
| 54 | +**Expected Performance Difference**: Genus-level accuracy typically 10-20% higher than species-level due to: |
| 55 | +- Fewer classes (60 vs 167) |
| 56 | +- More samples per class (average ~800 vs ~287) |
| 57 | +- Less inter-class confusion |
| 58 | + |
| 59 | +## Data Quality Check |
| 60 | + |
| 61 | +**⚠️ IMPORTANT**: Always inspect your labels before training at genus level! |
| 62 | + |
| 63 | +### Step 1: Run Label Inspection |
| 64 | + |
| 65 | +```bash |
| 66 | +python processing/misc/inspect_labels.py |
| 67 | +``` |
| 68 | + |
| 69 | +This will show: |
| 70 | +- All 60 genus names with sample counts |
| 71 | +- Complete genus → species mappings |
| 72 | +- Special cases and potential issues |
| 73 | +- Edge cases (Unknown, Pinaceae, etc.) |
| 74 | + |
| 75 | +### Step 2: Review Output |
| 76 | + |
| 77 | +Look for potential issues: |
| 78 | + |
| 79 | +✅ **Normal cases** (59 genera): |
| 80 | +``` |
| 81 | +Acer 6,635 samples 10 species (Maples) |
| 82 | +Quercus 7,479 samples 27 species (Oaks) |
| 83 | +Pinus 6,600 samples 19 species (Pines) |
| 84 | +``` |
| 85 | + |
| 86 | +⚠️ **Edge cases to be aware of**: |
| 87 | + |
| 88 | +1. **Unknown species** (147 samples, 0.31%) |
| 89 | + - Label: "Unknown plant", "Unknown softwood plant" |
| 90 | + - Genus extracted: "Unknown" |
| 91 | + - **Status**: Valid class representing unidentified species |
| 92 | + - **Action**: Keep or filter - your choice |
| 93 | + |
| 94 | +2. **Pinaceae** (26 samples, 0.05%) |
| 95 | + - Label: "Pinaceae sp." |
| 96 | + - Genus extracted: "Pinaceae" (actually a **family name**, not genus) |
| 97 | + - Represents truly unidentified conifers from WREF site |
| 98 | + - **Status**: Minor edge case, negligible impact |
| 99 | + - **Action**: Keep (recommended) or filter |
| 100 | + |
| 101 | +### Step 3: Filtering (Optional) |
| 102 | + |
| 103 | +If you want taxonomically pure genus-level training: |
| 104 | + |
| 105 | +```python |
| 106 | +# Option A: Filter specific species codes |
| 107 | +datamodule = NeonCrownDataModule( |
| 108 | + ..., |
| 109 | + taxonomic_level="genus", |
| 110 | + species_filter=["PINACE"], # Exclude Pinaceae (will filter BEFORE genus extraction) |
| 111 | +) |
| 112 | + |
| 113 | +# Option B: Filter after inspecting |
| 114 | +# See inspect_labels.py output for USDA codes to exclude |
| 115 | +species_to_exclude = ["PINACE", "2PLANT", "2PLANT-S"] # Example |
| 116 | +datamodule = NeonCrownDataModule( |
| 117 | + ..., |
| 118 | + taxonomic_level="genus", |
| 119 | + species_filter=species_to_exclude, |
| 120 | +) |
| 121 | +``` |
| 122 | + |
| 123 | +## Genus Extraction Method |
| 124 | + |
| 125 | +The genus extraction is simple and robust: |
| 126 | + |
| 127 | +```python |
| 128 | +genus = species_name.split()[0] |
| 129 | +``` |
| 130 | + |
| 131 | +**Examples**: |
| 132 | +``` |
| 133 | +"Acer rubrum L." → "Acer" |
| 134 | +"Pseudotsuga menziesii (Mirb.) Franco var. menziesii" → "Pseudotsuga" |
| 135 | +"Betula papyrifera Marshall" → "Betula" |
| 136 | +"Pinaceae sp." → "Pinaceae" (family name, but treated as genus) |
| 137 | +``` |
| 138 | + |
| 139 | +This method: |
| 140 | +- ✅ Works for all 167 species in the dataset |
| 141 | +- ✅ Handles varieties and subspecies automatically |
| 142 | +- ✅ Requires no manual mapping or preprocessing |
| 143 | +- ✅ Validated against 47,971 samples with 99.7% consistency |
| 144 | + |
| 145 | +## Training Examples |
| 146 | + |
| 147 | +### Basic Training |
| 148 | + |
| 149 | +```python |
| 150 | +import lightning as L |
| 151 | +from neon_tree_classification.core.datamodule import NeonCrownDataModule |
| 152 | +from neon_tree_classification.models.lightning_modules import RGBClassifier |
| 153 | + |
| 154 | +# Setup data at genus level |
| 155 | +datamodule = NeonCrownDataModule( |
| 156 | + csv_path="data/metadata/combined_dataset.csv", |
| 157 | + hdf5_path="data/combined_dataset.h5", |
| 158 | + modalities=["rgb"], |
| 159 | + taxonomic_level="genus", # 60 classes |
| 160 | + batch_size=64, |
| 161 | +) |
| 162 | + |
| 163 | +# Create model (num_classes will be auto-set by Lightning from datamodule) |
| 164 | +model = RGBClassifier( |
| 165 | + model_type="resnet50", # Use pretrained ResNet50 |
| 166 | + num_classes=60, # Will match datamodule |
| 167 | + learning_rate=1e-3, |
| 168 | +) |
| 169 | + |
| 170 | +# Train |
| 171 | +trainer = L.Trainer(max_epochs=50, accelerator="gpu") |
| 172 | +trainer.fit(model, datamodule) |
| 173 | +``` |
| 174 | + |
| 175 | +### With Filtering |
| 176 | + |
| 177 | +```python |
| 178 | +# Clean genus-level training (exclude edge cases) |
| 179 | +datamodule = NeonCrownDataModule( |
| 180 | + csv_path="data/metadata/combined_dataset.csv", |
| 181 | + hdf5_path="data/combined_dataset.h5", |
| 182 | + modalities=["rgb"], |
| 183 | + taxonomic_level="genus", |
| 184 | + species_filter=["PINACE"], # Exclude Pinaceae family |
| 185 | + batch_size=64, |
| 186 | +) |
| 187 | +# Now training on 59 true genera only |
| 188 | +``` |
| 189 | + |
| 190 | +### Progressive Training Strategy |
| 191 | + |
| 192 | +```python |
| 193 | +# Phase 1: Genus-level baseline (fast iteration) |
| 194 | +genus_datamodule = NeonCrownDataModule(..., taxonomic_level="genus") |
| 195 | +genus_model = RGBClassifier(model_type="resnet50", num_classes=60) |
| 196 | +trainer.fit(genus_model, genus_datamodule) |
| 197 | +# Expected: ~75-85% test accuracy |
| 198 | + |
| 199 | +# Phase 2: Species-level fine-tuning (final model) |
| 200 | +species_datamodule = NeonCrownDataModule(..., taxonomic_level="species") |
| 201 | +species_model = RGBClassifier(model_type="resnet50", num_classes=167) |
| 202 | +trainer.fit(species_model, species_datamodule) |
| 203 | +# Expected: ~65-75% test accuracy |
| 204 | +``` |
| 205 | + |
| 206 | +## Command-Line Usage |
| 207 | + |
| 208 | +```bash |
| 209 | +# Train at genus level |
| 210 | +python examples/train.py \ |
| 211 | + --csv_path data/metadata/combined_dataset.csv \ |
| 212 | + --hdf5_path data/combined_dataset.h5 \ |
| 213 | + --modality rgb \ |
| 214 | + --taxonomic_level genus \ |
| 215 | + --model_type resnet50 \ |
| 216 | + --batch_size 64 \ |
| 217 | + --epochs 50 |
| 218 | + |
| 219 | +# Train at species level |
| 220 | +python examples/train.py \ |
| 221 | + --csv_path data/metadata/combined_dataset.csv \ |
| 222 | + --hdf5_path data/combined_dataset.h5 \ |
| 223 | + --modality rgb \ |
| 224 | + --taxonomic_level species \ |
| 225 | + --model_type resnet50 \ |
| 226 | + --batch_size 64 \ |
| 227 | + --epochs 50 |
| 228 | +``` |
| 229 | + |
| 230 | +## Model Considerations |
| 231 | + |
| 232 | +### num_classes Parameter |
| 233 | + |
| 234 | +**Important**: Make sure your model's `num_classes` matches your taxonomic level! |
| 235 | + |
| 236 | +```python |
| 237 | +# Species level |
| 238 | +datamodule = NeonCrownDataModule(..., taxonomic_level="species") # 167 classes |
| 239 | +model = RGBClassifier(num_classes=167) # ✓ Correct |
| 240 | + |
| 241 | +# Genus level |
| 242 | +datamodule = NeonCrownDataModule(..., taxonomic_level="genus") # 60 classes |
| 243 | +model = RGBClassifier(num_classes=60) # ✓ Correct |
| 244 | +``` |
| 245 | + |
| 246 | +The number of classes will vary slightly based on your filtering: |
| 247 | +- Species level: 167 classes (default) |
| 248 | +- Genus level: 60 classes (default), 59 if filtering Pinaceae |
| 249 | + |
| 250 | +## Validation Warnings |
| 251 | + |
| 252 | +When using `taxonomic_level="genus"`, the DataModule automatically validates genus extraction and warns about: |
| 253 | + |
| 254 | +1. **Non-alphabetic genus names** (e.g., "Unknown", "2PLANT") |
| 255 | +2. **Known family names** (e.g., "Pinaceae") |
| 256 | +3. **Sample counts for edge cases** |
| 257 | + |
| 258 | +Example warning: |
| 259 | +``` |
| 260 | +UserWarning: Found family names treated as genera: {'Pinaceae': 26}. |
| 261 | +These represent unidentified species within that family. |
| 262 | +See docs/taxonomic_levels.md for more information. |
| 263 | +``` |
| 264 | + |
| 265 | +**These are informational** - training will proceed normally. Filter if desired using `species_filter`. |
| 266 | + |
| 267 | +## FAQ |
| 268 | + |
| 269 | +**Q: Should I train at genus or species level?** |
| 270 | +- Start with **genus** for faster iteration and architecture selection |
| 271 | +- Move to **species** for final production models and fine-grained identification |
| 272 | + |
| 273 | +**Q: Can I use pretrained weights from genus-level for species-level?** |
| 274 | +- Yes! Transfer learning between taxonomic levels works well |
| 275 | +- The backbone features transfer, just replace the classification head |
| 276 | + |
| 277 | +**Q: What about Pinaceae?** |
| 278 | +- It's a family name, not genus, but only 26 samples (0.05%) |
| 279 | +- Keep it (recommended): Represents "unidentified conifer" class |
| 280 | +- Filter it: Use `species_filter=["PINACE"]` if you need taxonomic purity |
| 281 | + |
| 282 | +**Q: How do I know how many classes I have?** |
| 283 | +```python |
| 284 | +datamodule.setup() |
| 285 | +print(f"Number of classes: {datamodule.full_dataset.num_classes}") |
| 286 | +print(f"Class names: {datamodule.full_dataset.idx_to_label}") |
| 287 | +``` |
| 288 | + |
| 289 | +**Q: Can I add more taxonomic levels (family, order)?** |
| 290 | +- Yes! The same pattern extends to any taxonomic level |
| 291 | +- Would need to modify genus extraction logic |
| 292 | +- Contact maintainers if this is needed |
| 293 | + |
| 294 | +## Performance Benchmarks |
| 295 | + |
| 296 | +Expected accuracy ranges on NEON combined dataset (RGB only, ResNet50): |
| 297 | + |
| 298 | +| Taxonomic Level | Classes | Baseline | With Pretrained | With Tuning | |
| 299 | +|-----------------|---------|----------|----------------|-------------| |
| 300 | +| **Genus** | 60 | 70-75% | 75-80% | 80-85% | |
| 301 | +| **Species** | 167 | 50-55% | 65-70% | 70-75% | |
| 302 | + |
| 303 | +*Note: Actual performance depends on data quality, hyperparameters, and training strategy* |
| 304 | + |
| 305 | +## Additional Resources |
| 306 | + |
| 307 | +- **Data inspection**: `python processing/misc/inspect_labels.py` |
| 308 | +- **Training examples**: `examples/train.py` |
| 309 | +- **Model architectures**: `docs/training.md` |
| 310 | +- **Data processing**: `docs/processing.md` |
| 311 | + |
| 312 | +## Citation |
| 313 | + |
| 314 | +If you use genus-level classification in your research, please cite both the package and note the taxonomic level in your methods. |
0 commit comments