This guide will get you started with the NTSB Aviation Accident Database in under 15 minutes. Choose your workflow:
- Option A: PostgreSQL Database (Recommended) - Full-featured database with optimized queries, geospatial analysis, and monitoring
- Option B: CSV/DuckDB Analysis - Quick exploration without database setup
- Operating System: Linux (tested on CachyOS/Arch), macOS, or Windows WSL2
- Disk Space: 2-5 GB free
- Memory: 4 GB+ RAM recommended
- PostgreSQL: Version 18.0+ (for Option A)
- Python: Version 3.11+ with venv
- Tools: Git, mdbtools (for .mdb extraction)
# Check PostgreSQL (Option A)
psql --version
# Expected: PostgreSQL 18.0 or higher
# Check Python
python --version
# Expected: Python 3.11 or higher
# Check mdbtools
mdb-ver --version
# Expected: mdbtools version info
# Check disk space
df -h .
# Ensure 2-5 GB free spaceFor optimal query performance, advanced analytics, geospatial queries, and production-ready monitoring infrastructure.
git clone https://github.com/doublegate/NTSB-Dataset_Analysis.git
cd NTSB-Dataset_Analysis# Run the automated setup script
./scripts/setup_database.sh
# This will:
# - Start PostgreSQL service
# - Create 'ntsb_aviation' database
# - Enable PostGIS, pg_trgm, pgcrypto extensions
# - Create 11 tables + indexes
# - Set up staging tables and load tracking
# - Transfer ownership to current user (NO SUDO after setup)# 1. Start PostgreSQL
sudo systemctl start postgresql
sudo systemctl enable postgresql # Optional: auto-start on boot
# 2. Create database
sudo -u postgres createdb ntsb_aviation
# 3. Enable extensions
sudo -u postgres psql -d ntsb_aviation -c "CREATE EXTENSION IF NOT EXISTS postgis;"
sudo -u postgres psql -d ntsb_aviation -c "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
sudo -u postgres psql -d ntsb_aviation -c "CREATE EXTENSION IF NOT EXISTS pgcrypto;"
# 4. Apply schema
sudo -u postgres psql -d ntsb_aviation -f scripts/schema.sql
# 5. Grant permissions to your user
sudo -u postgres psql -d ntsb_aviation -f scripts/transfer_ownership.sql
# 6. Verify setup
psql -d ntsb_aviation -c "\dt" # Should show 11 tables
psql -d ntsb_aviation -c "\di" # Should show 59+ indexes# 1. Activate Python environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# 2. Install dependencies
pip install -r requirements.txt
# 3. Load current data (2008-present, ~29,773 events)
python scripts/load_with_staging.py --source datasets/avall.mdb
# Expected output:
# - Loading 11 tables in dependency order
# - Progress updates for each table
# - Total: ~478,631 rows loaded
# - Duration: 5-10 minutes
# - Success confirmation
# 4. Load historical data (optional, 1982-2007)
python scripts/load_with_staging.py --source datasets/Pre2008.mdb# Create materialized views and additional indexes
psql -d ntsb_aviation -f scripts/optimize_queries.sql
# This creates:
# - 6 materialized views (yearly, state, aircraft, decade, crew, findings)
# - 9 additional performance indexes
# - Analyzes all tables for query planner
# - Result: 30-114x speedup for analytical queries# Run comprehensive data quality checks
psql -d ntsb_aviation -f scripts/validate_data.sql > validation_report.txt
# Review results
less validation_report.txt
# Key checks:
# ✓ Row counts match expectations
# ✓ Primary keys are unique
# ✓ Foreign keys are valid
# ✓ Coordinates are within valid ranges
# ✓ Dates are within expected range
# ✓ Partitions are populated
# ✓ Indexes are created
# Quick manual verification
psql -d ntsb_aviation -c "SELECT COUNT(*) FROM events;"
# Expected: ~92,771 events (if both avall.mdb and Pre2008.mdb loaded)
psql -d ntsb_aviation -c "SELECT * FROM vw_database_health;"
# Shows overall database health snapshotFor production deployments with automated data updates, set up monitoring infrastructure:
# 1. Create monitoring views
psql -d ntsb_aviation -f scripts/create_monitoring_views.sql
# 2. Run anomaly detection
python scripts/detect_anomalies.py --lookback-days 30 --output json
# 3. Check data quality
psql -d ntsb_aviation -c "SELECT * FROM vw_data_quality_checks;"
# 4. View monthly trends
psql -d ntsb_aviation -c "SELECT * FROM vw_monthly_event_trends ORDER BY year DESC, month DESC LIMIT 12;"# 1. Configure PostgreSQL for Docker access (ONE-TIME SETUP)
# Edit postgresql.conf: set listen_addresses = '*'
# Edit pg_hba.conf: add entry for Docker bridge (172.17.0.0/16)
# Restart PostgreSQL: sudo systemctl restart postgresql
# 2. Start Airflow services
cd airflow/
docker compose up -d
# 3. Access Web UI
open http://localhost:8080 # Login: airflow/airflow
# 4. Configure notifications (optional)
# Edit airflow/.env:
# - Add Slack webhook URL
# - Add email SMTP settings
# - See docs/MONITORING_SETUP_GUIDE.md for details
# 5. Trigger production DAG
docker compose exec airflow-scheduler airflow dags trigger monthly_sync_ntsb_data
# 6. Stop services
docker compose downSee AIRFLOW_SETUP_GUIDE.md and MONITORING_SETUP_GUIDE.md for detailed documentation.
Run a test query to confirm everything is working:
psql -d ntsb_aviation -c "
SELECT
ev_year,
COUNT(*) as accidents,
SUM(CASE WHEN ev_highest_injury = 'FATL' THEN 1 ELSE 0 END) as fatal,
SUM(inj_tot_f) as fatalities
FROM events
WHERE ev_year >= 2020
GROUP BY ev_year
ORDER BY ev_year DESC;
"Expected: Yearly statistics for 2020-2025 with sub-second response time.
✅ Congratulations! Your PostgreSQL database is ready for analysis.
For quick exploration without database setup. Ideal for initial data exploration, one-off analyses, or users who prefer working with CSV files.
git clone https://github.com/doublegate/NTSB-Dataset_Analysis.git
cd NTSB-Dataset_Analysis# Setup tools (Fish shell required)
./setup.fish # Install tools: mdbtools, Python, Rust tools
# Extract all tables from databases
./scripts/extract_all_tables.fish datasets/avall.mdb
./scripts/extract_all_tables.fish datasets/Pre2008.mdb
./scripts/extract_all_tables.fish datasets/PRE1982.MDB
# Or extract single table
mdb-export datasets/avall.mdb events > data/events.csv# View first 10 rows
head -n 10 data/avall-events.csv
# Count rows
wc -l data/avall-events.csv
# View column names
head -n 1 data/avall-events.csv | tr ',' '\n'
# Quick statistics (requires csvkit)
csvstat data/avall-events.csv
# Filter by state
csvgrep -c ev_state -m "CA" data/avall-events.csv > data/ca_events.csv# Launch interactive DuckDB
duckdb
# Or one-liner queries
duckdb -c "SELECT COUNT(*) FROM 'data/avall-events.csv'"
duckdb -c "
SELECT ev_state, COUNT(*) as count
FROM 'data/avall-events.csv'
GROUP BY ev_state
ORDER BY count DESC
LIMIT 10
"
# Join multiple tables
duckdb -c "
SELECT e.ev_id, e.ev_date, e.ev_state, a.acft_make, a.acft_model
FROM 'data/avall-events.csv' e
LEFT JOIN 'data/avall-aircraft.csv' a ON e.ev_id = a.ev_id
WHERE e.ev_year >= 2020
LIMIT 100
"# Activate Python environment
source .venv/bin/activate
# Start Jupyter
jupyter lab
# Or run example scripts
python examples/quick_analysis.py
python examples/advanced_analysis.py
python examples/geospatial_analysis.pyimport pandas as pd
import duckdb
# Load data with DuckDB (fast)
df = duckdb.query("SELECT * FROM 'data/avall-events.csv' WHERE ev_year >= 2020").to_df()
# Or with pandas
df = pd.read_csv('data/avall-events.csv')
# Quick exploration
print(df.shape)
print(df.columns)
print(df.head())
print(df.describe())
# Filter and analyze
recent = df[df['ev_year'] >= 2020]
print(f"Accidents since 2020: {len(recent)}")
print(f"Total fatalities: {recent['inj_tot_f'].sum()}")
# Group by state
by_state = df.groupby('ev_state').size().sort_values(ascending=False)
print(by_state.head(10))# Count rows
xsv count data/avall-events.csv
# Select specific columns
xsv select ev_id,ev_date,ev_state,ev_city data/avall-events.csv | head
# Search
xsv search -s ev_state "CA" data/avall-events.csv
# Statistics
xsv stats data/avall-events.csv
# Frequency count
xsv frequency -s ev_state data/avall-events.csv-- Recent fatal accidents with aircraft details
SELECT e.ev_id, e.ev_date, e.ev_type, e.ev_state, e.inj_tot_f,
a.acft_make, a.acft_model
FROM events e
JOIN aircraft a ON e.ev_id = a.ev_id
WHERE e.inj_tot_f > 0 AND e.ev_year >= 2020
ORDER BY e.ev_date DESC
LIMIT 100;
-- Yearly accident statistics (from materialized view)
SELECT * FROM mv_yearly_stats
WHERE year >= 2020
ORDER BY year DESC;
-- Top 10 aircraft types by accident count
SELECT * FROM mv_aircraft_stats
ORDER BY total_accidents DESC
LIMIT 10;
-- Geospatial query: accidents near a location (requires PostGIS)
SELECT ev_id, ev_date, ev_state,
ST_Distance(location_geom, ST_MakePoint(-122.4194, 37.7749)) as distance_meters
FROM events
WHERE location_geom IS NOT NULL
AND ST_DWithin(location_geom, ST_MakePoint(-122.4194, 37.7749), 50000)
ORDER BY distance_meters;-- Recent fatal accidents
SELECT ev_id, ev_date, ev_type, ev_state, inj_tot_f
FROM 'data/avall-events.csv'
WHERE inj_tot_f > 0 AND ev_year >= 2020
ORDER BY ev_date DESC;
-- Most common occurrence types
SELECT occurrence_code, COUNT(*) as count
FROM 'data/avall-Occurrences.csv'
GROUP BY occurrence_code
ORDER BY count DESC
LIMIT 10;
-- Events with aircraft details
SELECT e.ev_id, e.ev_date, e.ev_state, a.acft_make, a.acft_model
FROM 'data/avall-events.csv' e
LEFT JOIN 'data/avall-aircraft.csv' a ON e.ev_id = a.ev_id
WHERE e.ev_year >= 2020
LIMIT 100;# Start service
sudo systemctl start postgresql
# Enable auto-start on boot
sudo systemctl enable postgresql
# Check status
systemctl status postgresql# Check if port 5432 is listening
sudo netstat -tlnp | grep 5432
# Check PostgreSQL logs
sudo journalctl -u postgresql -n 50# Re-grant permissions to your user
sudo -u postgres psql -d ntsb_aviation -f scripts/transfer_ownership.sql
# Or manually
sudo -u postgres psql -d ntsb_aviation -c "
GRANT ALL PRIVILEGES ON DATABASE ntsb_aviation TO $USER;
GRANT ALL PRIVILEGES ON SCHEMA public TO $USER;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO $USER;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO $USER;
"# Run ANALYZE to update statistics
psql -d ntsb_aviation -c "ANALYZE;"
# Check if indexes are being used
psql -d ntsb_aviation -c "
EXPLAIN ANALYZE
SELECT * FROM events WHERE ev_state = 'CA' LIMIT 10;
"
# Look for "Index Scan" in output (not "Seq Scan")
# Check buffer cache hit ratio (should be >95%)
psql -d ntsb_aviation -c "
SELECT
ROUND(100.0 * sum(blks_hit) / NULLIF(sum(blks_hit) + sum(blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'ntsb_aviation';
"# Load tables one at a time
python scripts/load_with_staging.py --source datasets/avall.mdb
# Or reduce chunk size
python scripts/load_with_staging.py --source datasets/avall.mdb --chunk-size 500# mdbtools is in AUR, install with:
paru -S mdbtools
# Or fix PKGBUILD if autoconf errors:
./fix_mdbtools_pkgbuild.fishsource .venv/bin/activate
pip install <module_name># Use Polars instead of pandas (10x faster, lower memory)
import polars as pl
df = pl.read_csv('data/avall-events.csv')
# Or use DuckDB for querying
import duckdb
result = duckdb.query("SELECT * FROM 'data/avall-events.csv' WHERE ev_year >= 2020").to_df()
# Or read in chunks
for chunk in pd.read_csv('data/avall-events.csv', chunksize=10000):
process(chunk)# Try different encodings
df = pd.read_csv('file.csv', encoding='latin1')
df = pd.read_csv('file.csv', encoding='utf-8')# Check database size
psql -d ntsb_aviation -c "SELECT pg_size_pretty(pg_database_size('ntsb_aviation'));"
# Count events
psql -d ntsb_aviation -c "SELECT COUNT(*) FROM events;"
# Refresh materialized views
psql -d ntsb_aviation -c "SELECT * FROM refresh_all_materialized_views();"
# Run anomaly detection
python scripts/detect_anomalies.py --lookback-days 30
# Check data quality
psql -d ntsb_aviation -c "SELECT * FROM vw_data_quality_checks;"
# View database health
psql -d ntsb_aviation -c "SELECT * FROM vw_database_health;"# Extract table from MDB
mdb-export datasets/avall.mdb events > data/events.csv
# Count rows
wc -l data/avall-events.csv
# Filter by column
csvgrep -c ev_state -m "CA" data/avall-events.csv
# Quick statistics
csvstat data/avall-events.csv
# Convert to JSON
csvjson data/avall-events.csv > events.jsonQuick reference for common NTSB codes (see ref_docs/codman.pdf for complete list):
- 100: ABRUPT MANEUVER
- 110: AIRFRAME/COMPONENT/SYSTEM FAILURE/MALFUNCTION
- 200: ENGINE FAILURE
- 250: FIRE
- 300: FUEL EXHAUSTION
- 350: MIDAIR COLLISION
- 400: WEATHER
- 500: STANDING
- 510: TAXI
- 520: TAKEOFF
- 530: CLIMB
- 540: CRUISE
- 560: DESCENT
- 580: APPROACH
- 590: LANDING
- 600: MANEUVERING
- 10000-11700: Airframe (wings, fuselage, landing gear, flight controls)
- 12000-13500: Systems (hydraulic, electrical, environmental, fuel)
- 14000-17710: Powerplant (engines, propellers, turbines, exhaust)
- Explore Examples: See
examples/directory for Jupyter notebooks and Python scripts - Read Documentation: Check TOOLS_AND_UTILITIES.md for comprehensive tool catalog
- View Schema: See
scripts/schema.sqlfor complete database structure - Learn Coding System: Read
ref_docs/codman.pdffor NTSB coding manual
- Set Up Airflow: See AIRFLOW_SETUP_GUIDE.md
- Configure Monitoring: See MONITORING_SETUP_GUIDE.md
- Performance Tuning: See PERFORMANCE_BENCHMARKS.md
- Historical Data: Load PRE1982.MDB (1962-1981) - see PRE1982_ANALYSIS.md
- Machine Learning: See MACHINE_LEARNING_APPLICATIONS.md
- Geospatial Analysis: See GEOSPATIAL_ADVANCED.md
- NLP/Text Mining: See NLP_TEXT_MINING.md
- Project Roadmap: See to-dos/PHASE_2_ANALYTICS.md
After completing setup, verify:
- PostgreSQL service is running
- Database
ntsb_aviationexists - 11 tables created (events, aircraft, Flight_Crew, etc.)
- 59+ indexes created
- ~92,771 events loaded (both databases)
- Foreign key relationships valid (0 orphaned records)
- Query performance <100ms for indexed queries
- Spatial queries working (PostGIS enabled)
- Full-text search working (pg_trgm enabled)
- Materialized views created (6 views)
- Monitoring views created (4 views)
- Data quality: 9/9 checks passed
- MDB files extracted to CSV
- CSV files in
data/directory - DuckDB queries working
- Python environment activated
- Example scripts running successfully
- Jupyter notebooks accessible
- p50 Latency: <10ms for simple queries
- p95 Latency: <100ms for complex analytical queries
- p99 Latency: <500ms for heavy aggregations
- Buffer Cache Hit Ratio: >95%
- Index Usage: >99% on primary tables
- avall.mdb: ~30 seconds for full load (29,773 events)
- Pre2008.mdb: ~90 seconds for full load (63,000+ events)
- Throughput: 15,000-45,000 rows/sec (varies by table)
- Complete Documentation: See README.md
- Database Schema: See
scripts/schema.sqlandref_docs/eadmspub.pdf - Coding Manual: See
ref_docs/codman.pdffor NTSB coding system - Installation Issues: See INSTALLATION.md
- Contributing: See CONTRIBUTING.md
- Security: See SECURITY.md
Last Updated: November 7, 2025 Version: 2.1.0 Sprint: Phase 1 Sprint 3 Week 3 - COMPLETE (Monitoring & Observability) Production Ready: December 1st, 2025