Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ __pycache__/
lightning_logs/
results_temp_dir/
.comet.config
GSoC_2025_Final_Submission.md

# Training outputs
outputs/
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ A comprehensive toolkit for multi-modal tree species classification using NEON e

This repository aims to provide an end-to-end solution for tree species classification:

- [x] **Dataset**: Ready-to-use multi-modal tree crown dataset with 167 species
- [x] **Dataset**: Ready-to-use multi-modal tree crown dataset with 167 species. It's curated using the code in preprocessing directory in this repo.
- [ ] **Data Processing**: Tools for downloading and processing raw NEON data products
- [ ] **Classification Models**: Pre-trained models and training pipelines
- [ ] **Classification Models**: Pre-trained models and training pipelines (Ongoing. ETA End of Feb 2026)
- [ ] **DeepForest Integration**: Automated crown detection and classification workflow

## What's Available Now
Expand Down
314 changes: 314 additions & 0 deletions docs/taxonomic_levels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
# Taxonomic Level Classification

Train tree species classification models at different taxonomic levels (species or genus) with the same codebase.

## Quick Start

```python
from neon_tree_classification.core.datamodule import NeonCrownDataModule

# Species-level classification (167 classes - more challenging)
datamodule = NeonCrownDataModule(
csv_path="data/metadata/combined_dataset.csv",
hdf5_path="data/combined_dataset.h5",
modalities=["rgb"],
taxonomic_level="species", # Default
batch_size=32,
)

# Genus-level classification (60 classes - easier, better for initial experiments)
datamodule = NeonCrownDataModule(
csv_path="data/metadata/combined_dataset.csv",
hdf5_path="data/combined_dataset.h5",
modalities=["rgb"],
taxonomic_level="genus", # Extract genus from species names
batch_size=32,
)
```

## Taxonomic Levels

### Species Level (Default)
- **Classes**: 167 unique species
- **Label format**: USDA plant codes (e.g., "ACRU", "PSMEM")
- **Full names**: e.g., "Acer rubrum L.", "Pseudotsuga menziesii"
- **Use when**: You need fine-grained species identification

### Genus Level
- **Classes**: 60 unique genera
- **Label format**: Genus names (e.g., "Acer", "Pseudotsuga")
- **Extraction**: First word from species_name column
- **Use when**:
- Initial model development and testing (~3x fewer classes)
- Evaluating model architectures
- Limited training data or compute
- Ecological studies at genus level

## Class Distribution

| Level | Classes | Top Class | Samples | Rare Classes (< 10 samples) |
|-------|---------|-----------|---------|----------------------------|
| **Species** | 167 | Acer rubrum | 5,684 (11.8%) | 14 (8.4%) |
| **Genus** | 60 | Quercus | 7,479 (15.6%) | 5 (8.3%) |

**Expected Performance Difference**: Genus-level accuracy typically 10-20% higher than species-level due to:
- Fewer classes (60 vs 167)
- More samples per class (average ~800 vs ~287)
- Less inter-class confusion

## Data Quality Check

**⚠️ IMPORTANT**: Always inspect your labels before training at genus level!

### Step 1: Run Label Inspection

```bash
python processing/misc/inspect_labels.py
Comment thread
Ritesh313 marked this conversation as resolved.
Outdated
```

This will show:
- All 60 genus names with sample counts
- Complete genus → species mappings
- Special cases and potential issues
- Edge cases (Unknown, Pinaceae, etc.)

### Step 2: Review Output

Look for potential issues:

✅ **Normal cases** (59 genera):
```
Acer 6,635 samples 10 species (Maples)
Quercus 7,479 samples 27 species (Oaks)
Pinus 6,600 samples 19 species (Pines)
```

⚠️ **Edge cases to be aware of**:

1. **Unknown species** (147 samples, 0.31%)
- Label: "Unknown plant", "Unknown softwood plant"
- Genus extracted: "Unknown"
- **Status**: Valid class representing unidentified species
- **Action**: Keep or filter - your choice

2. **Pinaceae** (26 samples, 0.05%)
- Label: "Pinaceae sp."
- Genus extracted: "Pinaceae" (actually a **family name**, not genus)
- Represents truly unidentified conifers from WREF site
- **Status**: Minor edge case, negligible impact
- **Action**: Keep (recommended) or filter

### Step 3: Filtering (Optional)

If you want taxonomically pure genus-level training:

```python
# Option A: Filter specific species codes
datamodule = NeonCrownDataModule(
...,
taxonomic_level="genus",
species_filter=["PINACE"], # Exclude Pinaceae (will filter BEFORE genus extraction)
)

# Option B: Filter after inspecting
# See inspect_labels.py output for USDA codes to exclude
species_to_exclude = ["PINACE", "2PLANT", "2PLANT-S"] # Example
datamodule = NeonCrownDataModule(
...,
taxonomic_level="genus",
species_filter=species_to_exclude,
Comment thread
Ritesh313 marked this conversation as resolved.
Outdated
)
```

## Genus Extraction Method

The genus extraction is simple and robust:

```python
genus = species_name.split()[0]
```

**Examples**:
```
"Acer rubrum L." → "Acer"
"Pseudotsuga menziesii (Mirb.) Franco var. menziesii" → "Pseudotsuga"
"Betula papyrifera Marshall" → "Betula"
"Pinaceae sp." → "Pinaceae" (family name, but treated as genus)
```

This method:
- ✅ Works for all 167 species in the dataset
- ✅ Handles varieties and subspecies automatically
- ✅ Requires no manual mapping or preprocessing
- ✅ Validated against 47,971 samples with 99.7% consistency

## Training Examples

### Basic Training

```python
import lightning as L
from neon_tree_classification.core.datamodule import NeonCrownDataModule
from neon_tree_classification.models.lightning_modules import RGBClassifier

# Setup data at genus level
datamodule = NeonCrownDataModule(
csv_path="data/metadata/combined_dataset.csv",
hdf5_path="data/combined_dataset.h5",
modalities=["rgb"],
taxonomic_level="genus", # 60 classes
batch_size=64,
)

# Create model (num_classes will be auto-set by Lightning from datamodule)
model = RGBClassifier(
model_type="resnet50", # Use pretrained ResNet50
num_classes=60, # Will match datamodule
learning_rate=1e-3,
)

# Train
trainer = L.Trainer(max_epochs=50, accelerator="gpu")
trainer.fit(model, datamodule)
```

### With Filtering

```python
# Clean genus-level training (exclude edge cases)
datamodule = NeonCrownDataModule(
csv_path="data/metadata/combined_dataset.csv",
hdf5_path="data/combined_dataset.h5",
modalities=["rgb"],
taxonomic_level="genus",
species_filter=["PINACE"], # Exclude Pinaceae family
batch_size=64,
)
# Now training on 59 true genera only
```

### Progressive Training Strategy

```python
# Phase 1: Genus-level baseline (fast iteration)
genus_datamodule = NeonCrownDataModule(..., taxonomic_level="genus")
genus_model = RGBClassifier(model_type="resnet50", num_classes=60)
trainer.fit(genus_model, genus_datamodule)
# Expected: ~75-85% test accuracy

# Phase 2: Species-level fine-tuning (final model)
species_datamodule = NeonCrownDataModule(..., taxonomic_level="species")
species_model = RGBClassifier(model_type="resnet50", num_classes=167)
trainer.fit(species_model, species_datamodule)
# Expected: ~65-75% test accuracy
```

## Command-Line Usage

```bash
# Train at genus level
python examples/train.py \
--csv_path data/metadata/combined_dataset.csv \
--hdf5_path data/combined_dataset.h5 \
--modality rgb \
--taxonomic_level genus \
--model_type resnet50 \
--batch_size 64 \
--epochs 50

# Train at species level
python examples/train.py \
--csv_path data/metadata/combined_dataset.csv \
--hdf5_path data/combined_dataset.h5 \
--modality rgb \
--taxonomic_level species \
--model_type resnet50 \
--batch_size 64 \
--epochs 50
```

## Model Considerations

### num_classes Parameter

**Important**: Make sure your model's `num_classes` matches your taxonomic level!

```python
# Species level
datamodule = NeonCrownDataModule(..., taxonomic_level="species") # 167 classes
model = RGBClassifier(num_classes=167) # ✓ Correct

# Genus level
datamodule = NeonCrownDataModule(..., taxonomic_level="genus") # 60 classes
model = RGBClassifier(num_classes=60) # ✓ Correct
```

The number of classes will vary slightly based on your filtering:
- Species level: 167 classes (default)
- Genus level: 60 classes (default), 59 if filtering Pinaceae

## Validation Warnings

When using `taxonomic_level="genus"`, the DataModule automatically validates genus extraction and warns about:

1. **Non-alphabetic genus names** (e.g., "Unknown", "2PLANT")
2. **Known family names** (e.g., "Pinaceae")
3. **Sample counts for edge cases**

Example warning:
```
UserWarning: Found family names treated as genera: {'Pinaceae': 26}.
These represent unidentified species within that family.
See docs/taxonomic_levels.md for more information.
```

**These are informational** - training will proceed normally. Filter if desired using `species_filter`.

## FAQ

**Q: Should I train at genus or species level?**
- Start with **genus** for faster iteration and architecture selection
- Move to **species** for final production models and fine-grained identification

**Q: Can I use pretrained weights from genus-level for species-level?**
- Yes! Transfer learning between taxonomic levels works well
- The backbone features transfer, just replace the classification head

**Q: What about Pinaceae?**
- It's a family name, not genus, but only 26 samples (0.05%)
- Keep it (recommended): Represents "unidentified conifer" class
- Filter it: Use `species_filter=["PINACE"]` if you need taxonomic purity

**Q: How do I know how many classes I have?**
```python
datamodule.setup()
print(f"Number of classes: {datamodule.full_dataset.num_classes}")
print(f"Class names: {datamodule.full_dataset.idx_to_label}")
```

**Q: Can I add more taxonomic levels (family, order)?**
- Yes! The same pattern extends to any taxonomic level
- Would need to modify genus extraction logic
- Contact maintainers if this is needed

## Performance Benchmarks

Expected accuracy ranges on NEON combined dataset (RGB only, ResNet50):

| Taxonomic Level | Classes | Baseline | With Pretrained | With Tuning |
|-----------------|---------|----------|----------------|-------------|
| **Genus** | 60 | 70-75% | 75-80% | 80-85% |
| **Species** | 167 | 50-55% | 65-70% | 70-75% |

*Note: Actual performance depends on data quality, hyperparameters, and training strategy*

## Additional Resources

- **Data inspection**: `python processing/misc/inspect_labels.py`
- **Training examples**: `examples/train.py`
- **Model architectures**: `docs/training.md`
- **Data processing**: `docs/processing.md`

## Citation

If you use genus-level classification in your research, please cite both the package and note the taxonomic level in your methods.
18 changes: 9 additions & 9 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,18 @@ uv run python examples/train.py \

## Baseline Results

Preliminary single-modality baseline results for 167-species classification using the `combined` dataset configuration (seed=42, no hyperparameter optimization):
Single-modality baseline results using the `combined` dataset configuration (47,971 samples, seed=42):

| Modality | Test Accuracy | Model | Notes |
|----------|---------------|-------|-------|
| RGB | 53.5% | ResNet | Standard computer vision approach |
| HSI | 27.3% | Spectral CNN | 369-band hyperspectral data |
| LiDAR | 11.5% | Structural CNN | Canopy height model |
| Modality | Test Accuracy | Model | Hyperparameters | Notes |
|----------|---------------|-------|-----------------|-------|
| **RGB (Species)** | **75.9%** | ResNetRGB | lr=5e-5, wd=5e-4, bs=256 | 167 species classes, optimized |
| **RGB (Genus)** | **72.2%** | ResNetRGB | lr=5e-5, wd=5e-4, bs=256 | 60 genus classes, coarser taxonomy |
| HSI | 27.3% | Spectral CNN | Default params | 369-band hyperspectral data |
| LiDAR | 11.5% | Structural CNN | Default params | Canopy height model |

**Important Notes:**
- 167-species classification is inherently challenging
- These are basic preliminary results with default parameters
- Significant improvements possible with hyperparameter tuning, data augmentation, and architectural improvements
- RGB performance achieved through config: lr=5e-5, weight_decay=5e-4, batch_size=256, AdamW optimizer
- HSI and LiDAR results are preliminary with default parameters - significant improvement expected with optimization
- Multi-modal fusion is expected to significantly improve performance

## Reproducing Baseline Results
Expand Down
Loading
Loading