Skip to content

Commit dd2686c

Browse files
authored
Simplify README with project vision and create comprehensive documentation structure (#9)
* Simplify README and create comprehensive documentation * Add note to update_readme.py script about new documentation structure * Improve deprecation notice in update_readme.py based on code review * Restructure README to focus on project vision and clarify available features vs future plans * Add checkboxes to project vision and remove citation section
1 parent 80b8f85 commit dd2686c

7 files changed

Lines changed: 1491 additions & 405 deletions

File tree

README.md

Lines changed: 61 additions & 403 deletions
Large diffs are not rendered by default.

docs/README.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# NEON Tree Classification Dataset - Documentation
2+
3+
Welcome to the comprehensive documentation for the NEON Multi-Modal Tree Species Classification Dataset.
4+
5+
## Quick Links
6+
7+
- [README](../README.md) - Main README with dataset overview and quick start
8+
- [Advanced Usage](advanced_usage.md) - Custom filtering, Lightning DataModule, and advanced features
9+
- [Training Guide](training.md) - Model training examples and baseline results
10+
- [Visualization Guide](visualization.md) - Data visualization tools and examples
11+
- [Processing Pipeline](processing.md) - NEON data processing workflow
12+
13+
## Getting Started
14+
15+
1. **New Users**: Start with the [main README](../README.md) for installation and basic usage
16+
2. **Training Models**: See the [Training Guide](training.md) for model training and baseline results
17+
3. **Data Exploration**: Check out the [Visualization Guide](visualization.md) for exploring the dataset
18+
4. **Advanced Features**: Read [Advanced Usage](advanced_usage.md) for custom configurations
19+
5. **Data Processing**: For processing raw NEON data, see the [Processing Pipeline](processing.md)
20+
21+
## Documentation Structure
22+
23+
### [Advanced Usage](advanced_usage.md)
24+
- Custom data filtering with Lightning DataModule
25+
- Split methods (random, site-based, year-based)
26+
- External test sets
27+
- Advanced dataloader configuration
28+
- Direct dataset usage
29+
- Multi-GPU training
30+
- Custom training loops
31+
32+
### [Training Guide](training.md)
33+
- Quick training with examples script
34+
- Baseline results and reproduction steps
35+
- Custom model architectures
36+
- Training best practices
37+
- Multi-modal training
38+
- Experiment tracking (Comet ML, W&B)
39+
- Common issues and solutions
40+
41+
### [Visualization Guide](visualization.md)
42+
- Overview of visualization tools
43+
- RGB, HSI, and LiDAR visualization
44+
- Interactive Jupyter notebook
45+
- Custom visualizations
46+
- Multi-modal comparisons
47+
- Advanced spectral analysis
48+
49+
### [Processing Pipeline](processing.md)
50+
- Complete data processing workflow
51+
- NEON data product details
52+
- Quality control procedures
53+
- HDF5 dataset creation
54+
- Configuration subset creation
55+
- Processing best practices
56+
57+
## Support
58+
59+
For issues, questions, or contributions:
60+
- GitHub Issues: [Report a bug or request a feature](https://github.com/Ritesh313/NeonTreeClassification/issues)
61+
- Contributing: See [CONTRIBUTING.md](../CONTRIBUTING.md)
62+
63+
## Citation
64+
65+
If you use this dataset in your research, please cite:
66+
67+
```bibtex
68+
@dataset{neon_tree_classification_2024,
69+
title={NEON Multi-Modal Tree Species Classification Dataset},
70+
author={[Author Names]},
71+
year={2024},
72+
publisher={GitHub},
73+
url={https://github.com/Ritesh313/NeonTreeClassification}
74+
}
75+
```
76+
77+
## License
78+
79+
See [LICENSE](../LICENSE) file for details.
80+
81+
## Acknowledgments
82+
83+
- National Ecological Observatory Network (NEON)
84+
- Dataset statistics generated on 2025-08-28

docs/advanced_usage.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Advanced Usage
2+
3+
This guide covers advanced features for experienced users who need custom data filtering, specialized training configurations, or want to use the PyTorch Lightning DataModule directly.
4+
5+
## Custom Data Filtering with Lightning DataModule
6+
7+
The `NeonCrownDataModule` provides flexible filtering and splitting options for advanced use cases.
8+
9+
### Basic Configuration
10+
11+
```python
12+
from neon_tree_classification.core.datamodule import NeonCrownDataModule
13+
14+
# Basic configuration with species/site filtering
15+
datamodule = NeonCrownDataModule(
16+
csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
17+
hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
18+
modalities=["rgb"], # Single modality training
19+
batch_size=32,
20+
# Filtering options
21+
species_filter=["PSMEM", "TSHE"], # Train on specific species
22+
site_filter=["HARV", "OSBS"], # Train on specific sites
23+
year_filter=[2018, 2019], # Train on specific years
24+
# Split method options
25+
split_method="random", # Options: "random", "site", "year"
26+
val_ratio=0.15,
27+
test_ratio=0.15
28+
)
29+
30+
datamodule.setup("fit")
31+
```
32+
33+
### Split Methods
34+
35+
The DataModule supports three splitting strategies:
36+
37+
**1. Random Split** (default)
38+
```python
39+
datamodule = NeonCrownDataModule(
40+
csv_path="path/to/dataset.csv",
41+
hdf5_path="path/to/dataset.h5",
42+
split_method="random",
43+
val_ratio=0.15,
44+
test_ratio=0.15
45+
)
46+
```
47+
48+
**2. Site-Based Split**
49+
50+
Useful for testing generalization across geographic locations:
51+
```python
52+
datamodule = NeonCrownDataModule(
53+
csv_path="path/to/dataset.csv",
54+
hdf5_path="path/to/dataset.h5",
55+
split_method="site",
56+
val_ratio=0.15,
57+
test_ratio=0.15
58+
)
59+
```
60+
61+
**3. Year-Based Split**
62+
63+
Useful for testing temporal generalization:
64+
```python
65+
datamodule = NeonCrownDataModule(
66+
csv_path="path/to/dataset.csv",
67+
hdf5_path="path/to/dataset.h5",
68+
split_method="year",
69+
val_ratio=0.15,
70+
test_ratio=0.15
71+
)
72+
```
73+
74+
### External Test Sets
75+
76+
For domain adaptation or cross-site validation:
77+
78+
```python
79+
datamodule = NeonCrownDataModule(
80+
csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
81+
hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
82+
external_test_csv_path="path/to/external_test.csv",
83+
external_test_hdf5_path="path/to/external_test.h5", # Optional, uses main HDF5 if not provided
84+
modalities=["rgb"]
85+
)
86+
87+
datamodule.setup("fit") # Auto-filters species for compatibility
88+
```
89+
90+
## Advanced DataLoader Configuration
91+
92+
### Custom Normalization
93+
94+
Each modality supports different normalization methods:
95+
96+
**RGB Normalization:**
97+
- `"0_1"`: Scale to [0, 1] range (default)
98+
- `"imagenet"`: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
99+
- `"per_sample"`: Normalize each sample independently
100+
101+
**HSI Normalization:**
102+
- `"per_sample"`: Normalize each sample independently (default)
103+
- `"global"`: Use global dataset statistics
104+
- `"none"`: No normalization
105+
106+
**LiDAR Normalization:**
107+
- `"height"`: Normalize by maximum canopy height (default)
108+
- `"per_sample"`: Normalize each sample independently
109+
- `"none"`: No normalization
110+
111+
Example:
112+
```python
113+
train_loader, test_loader = get_dataloaders(
114+
config='large',
115+
modalities=['rgb', 'hsi', 'lidar'],
116+
batch_size=32,
117+
rgb_norm_method='imagenet',
118+
hsi_norm_method='global',
119+
lidar_norm_method='height'
120+
)
121+
```
122+
123+
### Custom Image Sizes
124+
125+
Adjust the spatial resolution for each modality:
126+
127+
```python
128+
train_loader, test_loader = get_dataloaders(
129+
config='large',
130+
modalities=['rgb', 'hsi', 'lidar'],
131+
batch_size=32,
132+
rgb_size=(224, 224), # Larger RGB for fine-grained features
133+
hsi_size=(16, 16), # Higher HSI resolution
134+
lidar_size=(16, 16) # Higher LiDAR resolution
135+
)
136+
```
137+
138+
## Direct Dataset Usage
139+
140+
For maximum control, use the `NeonCrownDataset` class directly:
141+
142+
```python
143+
from neon_tree_classification.core.dataset import NeonCrownDataset
144+
from torch.utils.data import DataLoader
145+
146+
# Create dataset with custom parameters
147+
dataset = NeonCrownDataset(
148+
csv_path="_neon_tree_classification_dataset_files/metadata/large_dataset.csv",
149+
hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
150+
modalities=['rgb', 'hsi'],
151+
species_filter=['ACRU', 'TSCA'], # Limit to specific species
152+
site_filter=['HARV', 'MLBS'], # Limit to specific sites
153+
year_filter=[2018, 2019, 2020], # Limit to specific years
154+
include_metadata=True, # Include crown_id, species names, etc.
155+
rgb_size=(128, 128),
156+
hsi_size=(12, 12),
157+
rgb_norm_method='imagenet',
158+
hsi_norm_method='per_sample'
159+
)
160+
161+
# Create custom DataLoader
162+
train_loader = DataLoader(
163+
dataset,
164+
batch_size=64,
165+
shuffle=True,
166+
num_workers=8,
167+
pin_memory=True
168+
)
169+
```
170+
171+
## Accessing Metadata
172+
173+
Enable metadata in batches to access crown IDs, species names, and site information:
174+
175+
```python
176+
from scripts.get_dataloaders import get_dataloaders
177+
178+
# Note: get_dataloaders doesn't support include_metadata yet
179+
# Use NeonCrownDataset directly:
180+
from neon_tree_classification.core.dataset import NeonCrownDataset
181+
182+
dataset = NeonCrownDataset(
183+
csv_path="path/to/dataset.csv",
184+
hdf5_path="path/to/dataset.h5",
185+
modalities=['rgb'],
186+
include_metadata=True
187+
)
188+
189+
# Access metadata in batches
190+
for batch in DataLoader(dataset, batch_size=32):
191+
rgb = batch['rgb']
192+
labels = batch['species_idx']
193+
crown_ids = batch['crown_id']
194+
species_names = batch['species']
195+
sites = batch['site']
196+
```
197+
198+
## Multi-GPU Training
199+
200+
For distributed training with PyTorch Lightning:
201+
202+
```python
203+
import pytorch_lightning as pl
204+
from neon_tree_classification.core.datamodule import NeonCrownDataModule
205+
206+
# Configure DataModule
207+
datamodule = NeonCrownDataModule(
208+
csv_path="path/to/dataset.csv",
209+
hdf5_path="path/to/dataset.h5",
210+
modalities=["rgb"],
211+
batch_size=32 # Per-GPU batch size
212+
)
213+
214+
# Create trainer with multi-GPU support
215+
trainer = pl.Trainer(
216+
devices=4, # Number of GPUs
217+
strategy='ddp', # Distributed Data Parallel
218+
precision=16, # Mixed precision training
219+
max_epochs=100
220+
)
221+
222+
# Your Lightning module
223+
trainer.fit(model, datamodule=datamodule)
224+
```
225+
226+
## Custom Training Loop
227+
228+
Example of a custom training loop without PyTorch Lightning:
229+
230+
```python
231+
import torch
232+
from scripts.get_dataloaders import get_dataloaders
233+
234+
# Get dataloaders
235+
train_loader, test_loader = get_dataloaders(
236+
config='large',
237+
modalities=['rgb'],
238+
batch_size=64
239+
)
240+
241+
# Your model
242+
model = YourModel().cuda()
243+
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
244+
criterion = torch.nn.CrossEntropyLoss()
245+
246+
# Training loop
247+
for epoch in range(100):
248+
model.train()
249+
for batch in train_loader:
250+
rgb = batch['rgb'].cuda()
251+
labels = batch['species_idx'].cuda()
252+
253+
optimizer.zero_grad()
254+
outputs = model(rgb)
255+
loss = criterion(outputs, labels)
256+
loss.backward()
257+
optimizer.step()
258+
259+
# Validation
260+
model.eval()
261+
correct = 0
262+
total = 0
263+
with torch.no_grad():
264+
for batch in test_loader:
265+
rgb = batch['rgb'].cuda()
266+
labels = batch['species_idx'].cuda()
267+
outputs = model(rgb)
268+
_, predicted = outputs.max(1)
269+
total += labels.size(0)
270+
correct += predicted.eq(labels).sum().item()
271+
272+
accuracy = 100. * correct / total
273+
print(f'Epoch {epoch}: Accuracy = {accuracy:.2f}%')
274+
```
275+
276+
## Performance Tips
277+
278+
1. **Use larger batch sizes**: The dataset fits in memory efficiently due to HDF5 compression
279+
2. **Increase num_workers**: More workers can significantly speed up data loading
280+
3. **Enable pin_memory**: Speeds up CPU-to-GPU transfer
281+
4. **Use persistent_workers**: Reduces worker initialization overhead
282+
283+
```python
284+
train_loader, test_loader = get_dataloaders(
285+
config='large',
286+
modalities=['rgb'],
287+
batch_size=256, # Larger batch size
288+
num_workers=16, # More workers (adjust based on CPU cores)
289+
)
290+
```

0 commit comments

Comments
 (0)