This guide documents the NEON data processing pipeline for advanced users who want to process raw NEON data products or understand how the dataset was created.
The processing/ folder contains a comprehensive pipeline for converting raw NEON data products into the final training-ready dataset. This is useful for:
- Processing new NEON data
- Customizing the dataset creation process
- Understanding data quality control steps
- Creating similar datasets from other sources
Download RGB, HSI, and LiDAR data from the NEON API:
python processing/neon_data_processing/neon_downloader.py \
--site HARV \
--year 2018 \
--output_dir /path/to/outputWhat it does:
- Downloads NEON data products from the API
- Organizes files by site, year, and modality
- Validates downloads and checks file integrity
Key parameters:
--site: NEON site code (e.g., HARV, MLBS, GRSM)--year: Data collection year--products: Which products to download (rgb, hsi, lidar)
Quality control and tile selection:
python processing/neon_data_processing/curate_tiles.py \
--input_dir /path/to/downloaded/tiles \
--output_dir /path/to/curated/tiles \
--quality_threshold 0.8What it does:
- Checks for data completeness (all three modalities present)
- Validates spatial alignment between modalities
- Filters out low-quality or corrupted tiles
- Creates metadata about tile quality
Quality checks:
- Spatial overlap between RGB, HSI, and LiDAR
- Missing data percentage
- Coordinate system consistency
- File format validation
Extract crown metadata and validate annotations:
cd processing/shapefile_processing
python process_shapefiles.py \
--shapefile_dir /path/to/shapefiles \
--output_csv crowns_metadata.csvWhat it does:
- Extracts tree crown polygons from shapefiles
- Links crowns to individual tree measurements
- Validates crown annotations
- Merges ecological metadata (height, diameter, etc.)
Output format:
- CSV with crown ID, species, location, measurements
- Validated polygon geometries
- Quality flags for each annotation
Extract individual tree crowns from tiles:
python processing/neon_data_processing/crop_crowns_multimodal.py \
--tiles_dir /path/to/curated/tiles \
--crowns_csv crowns_metadata.csv \
--output_dir /path/to/cropped/crowns \
--rgb_size 128 \
--hsi_size 12 \
--lidar_size 12What it does:
- Extracts bounding boxes around each crown
- Crops corresponding regions from RGB, HSI, and LiDAR tiles
- Resamples to target resolutions
- Handles coordinate transformations between modalities
Parameters:
--rgb_size: Target RGB resolution (default: 128x128)--hsi_size: Target HSI resolution (default: 12x12)--lidar_size: Target LiDAR resolution (default: 12x12)--padding: Additional padding around crowns (default: 0)
Optimize data storage:
# Convert TIF to NumPy
python processing/neon_data_processing/convert_tif_to_npy.py \
--input_dir /path/to/cropped/crowns \
--output_dir /path/to/numpy/arrays
# Convert HSI H5 to TIF
python processing/neon_data_processing/hsi_convert_h5_to_tif.py \
--input_dir /path/to/hsi/h5 \
--output_dir /path/to/hsi/tifWhat it does:
- Converts various formats to efficient storage
- Applies compression where appropriate
- Validates converted data
Create final training/test CSVs:
python processing/neon_data_processing/create_training_csv.py \
--crowns_dir /path/to/cropped/crowns \
--metadata_csv crowns_metadata.csv \
--output_csv training_dataset.csvWhat it does:
- Combines all metadata
- Validates data availability for each sample
- Adds file paths to HDF5 dataset
- Creates train/val/test splits
Dataset refinement:
# Filter rare species
python processing/misc/filter_rare_species.py \
--input_csv training_dataset.csv \
--output_csv filtered_dataset.csv \
--min_samples 50
# Combine multiple datasets
python processing/misc/dataset_combiner.py \
--input_csvs dataset1.csv dataset2.csv dataset3.csv \
--output_csv combined_dataset.csvWhat it does:
- Removes species with insufficient samples
- Combines datasets from different sites/years
- Ensures species compatibility across datasets
- Creates configuration subsets (large, high_quality, combined)
processing/
├── neon_data_processing/ # Main processing scripts
│ ├── neon_downloader.py # Download NEON data
│ ├── curate_tiles.py # Quality control
│ ├── crop_crowns_multimodal.py # Extract crowns
│ ├── convert_tif_to_npy.py # Format conversion
│ ├── hsi_convert_h5_to_tif.py # HSI format conversion
│ └── create_training_csv.py # Generate training CSVs
├── shapefile_processing/ # Shapefile tools
│ ├── process_shapefiles.py # Extract crown metadata
│ └── README.md # Shapefile processing guide
└── misc/ # Utility scripts
├── filter_rare_species.py # Species filtering
└── dataset_combiner.py # Combine datasets
High-Resolution Orthorectified Camera Imagery
- Resolution: 10cm
- Format: GeoTIFF
- Bands: RGB (3 channels)
- Coverage: Full site mosaics
Surface Directional Reflectance
- Resolution: 1m
- Format: HDF5
- Bands: 426 spectral bands (380-2510 nm)
- Processing: Atmospheric correction applied
- Note: Reduced to 369 bands in dataset (removed noisy bands)
Ecosystem Structure
- Resolution: 1m
- Format: GeoTIFF
- Data: Canopy Height Model (CHM)
- Derived from: Point cloud classification
Ensure all modalities are properly aligned:
from processing.utils import check_spatial_alignment
# Verify alignment
aligned = check_spatial_alignment(
rgb_path='path/to/rgb.tif',
hsi_path='path/to/hsi.h5',
lidar_path='path/to/lidar.tif',
tolerance=0.5 # meters
)
if not aligned:
print("Warning: Modalities not aligned!")Handle missing or corrupted data:
from processing.utils import validate_data
# Check data quality
quality = validate_data(
crown_id='HARV_123',
rgb_path='path/to/rgb.npy',
hsi_path='path/to/hsi.npy',
lidar_path='path/to/lidar.npy'
)
print(f"Quality score: {quality['score']:.2f}")
print(f"Issues: {quality['issues']}")Convert processed crowns to HDF5:
import h5py
import numpy as np
# Create HDF5 dataset
with h5py.File('neon_dataset.h5', 'w') as f:
# Create groups
rgb_group = f.create_group('rgb')
hsi_group = f.create_group('hsi')
lidar_group = f.create_group('lidar')
# Add crown data
for crown_id, data in processed_crowns.items():
rgb_group.create_dataset(
crown_id,
data=data['rgb'],
compression='gzip',
compression_opts=9
)
hsi_group.create_dataset(
crown_id,
data=data['hsi'],
compression='gzip',
compression_opts=9
)
lidar_group.create_dataset(
crown_id,
data=data['lidar'],
compression='gzip',
compression_opts=9
)Create different dataset configurations:
All available data (47,971 samples, 167 species)
python processing/misc/create_config.py \
--input_csv all_crowns.csv \
--output_csv combined_dataset.csv \
--config combinedMain training set (~42,000 samples, ~162 species)
python processing/misc/create_config.py \
--input_csv all_crowns.csv \
--output_csv large_dataset.csv \
--config large \
--min_samples_per_species 50Curated subset (~5,500 samples, ~96 species)
python processing/misc/create_config.py \
--input_csv all_crowns.csv \
--output_csv high_quality_dataset.csv \
--config high_quality \
--quality_threshold 0.9 \
--min_samples_per_species 100Process one site first to validate the pipeline:
# Test with single site
python process_site.sh HARV 2018Use parallel processing for large-scale operations:
# Process multiple sites in parallel
parallel -j 4 python process_site.sh ::: HARV MLBS GRSM DELAMonitor disk usage - raw and processed data can be large:
- Raw tiles: ~100GB per site
- Processed crowns: ~50GB per site
- Final HDF5 dataset: ~600MB (compressed)
Always validate processed data:
python processing/utils/validate_dataset.py \
--csv_path training_dataset.csv \
--hdf5_path neon_dataset.h5Solution: Check coordinate reference systems (CRS) and reproject if needed
Solution: Verify HSI data download and band extraction
Solution: Increase quality thresholds in curation step
Solution: Process in batches or use more efficient data types
For custom processing workflows:
from processing.pipeline import ProcessingPipeline
# Create custom pipeline
pipeline = ProcessingPipeline(
sites=['HARV', 'MLBS'],
years=[2018, 2019, 2020],
output_dir='custom_dataset'
)
# Configure processing
pipeline.set_quality_threshold(0.85)
pipeline.set_crown_sizes(rgb=224, hsi=16, lidar=16)
# Run pipeline
pipeline.run()