Complete installation guide for NTSB Aviation Accident Database analysis environment on CachyOS Linux with Fish shell.
- CachyOS Linux (Arch-based)
- Fish shell
- Internet connection
- ~5GB free disk space
The easiest way to get started:
# Clone or download this repository
cd /path/to/NTSB_Datasets
# Run automated setup
./setup.fishThis will install:
- ✅ System packages (sqlite, python, gdal, etc.)
- ✅ AUR packages (mdbtools, duckdb, DBeaver, Quarto) if paru is installed
- ✅ Python virtual environment
- ✅ All Python packages (pandas, polars, duckdb, jupyter, etc.)
- ✅ Rust tools (xsv, qsv, polars-cli) if cargo is installed
- ✅ NLP models for text analysis
If you prefer to install components individually:
sudo pacman -S --needed \
sqlite \
postgresql \
python \
python-pip \
gdal \
hdf5 \
jq \
yq \
bat \
ripgrep \
fd \
fzf \
cmake \
base-devel \
gettext \
autoconf-archive \
txt2manNote:
cmakeandbase-develare required for building Rust tools likeqsvgettext,autoconf-archive, andtxt2manare required for buildingmdbtoolsfrom AUR
# Requires paru (recommended AUR helper)
# mdbtools and duckdb are REQUIRED for this project
# dbeaver and quarto-cli are optional
paru -S --needed mdbtools duckdb dbeaver quarto-cliImportant: Both mdbtools and duckdb are not in the official Arch repositories and must be installed from AUR:
- mdbtools - Required for extracting data from .mdb database files
- duckdb - Required for fast SQL queries on CSV files (used by scripts)
# Create virtual environment
python -m venv .venv
# Activate it
source .venv/bin/activate.fish
# Upgrade pip
pip install --upgrade pip wheel setuptoolsCore Data Science:
pip install pandas polars numpy scipy statsmodels scikit-learnVisualization:
pip install matplotlib seaborn plotly altairGeospatial:
pip install geopandas folium geopy shapelyText Analysis:
pip install nltk spacy wordcloud textblob
python -m spacy download en_core_web_smJupyter & Notebooks:
pip install jupyter jupyterlab ipython jupyterlab-gitDashboards:
pip install streamlit dash panelHigh Performance:
pip install dask[complete] pyarrow fastparquetDatabase Tools:
pip install duckdb sqlalchemyCLI Tools:
pip install csvkitUtilities:
pip install python-dateutil pytzIf you have Rust/Cargo installed:
# Install each tool separately for better control
cargo install xsv --locked # CSV toolkit (simpler, stable)
cargo install qsv --locked --features feature_capable # CSV toolkit (advanced features)
cargo install polars-cli --locked # Polars CLI for DataFrames
cargo install datafusion-cli --locked # SQL query engineOption 1 - Use xsv instead (recommended):
cargo install xsv --locked # Similar tool, more stableOption 2 - Install qsv from git (latest fixes):
cargo install --git https://github.com/jqnatividad/qsv qsv --features='feature_capable'Option 3 - Wait for qsv v9.1.1: The issue is known to maintainers and will be fixed in the next release.
Note: qsv requires the feature_capable feature for the full binary. Use --features lite for a lighter version.
To install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shAfter installation, extract data from the MDB databases:
# Extract all tables from current database (2008-present)
./scripts/extract_all_tables.fish datasets/avall.mdb
# Optional: Extract from historical databases
./scripts/extract_all_tables.fish datasets/Pre2008.mdb
./scripts/extract_all_tables.fish datasets/PRE1982.MDB# mdbtools
mdb-tables --version
# DuckDB
duckdb --version
# SQLite
sqlite3 --version
# Python
python --version
# Cargo (optional)
cargo --versionsource .venv/bin/activate.fish
# List installed packages
pip list
# Test imports
python -c "import pandas, polars, duckdb, matplotlib; print('All packages OK')"# List all scripts
ls -lh scripts/*.fish examples/*.py
# Test database info script
./scripts/show_database_info.fish datasets/avall.mdb
# Test extract script (dry run)
./scripts/extract_table.fish datasets/avall.mdb eventsEvery time you start a new shell session:
source .venv/bin/activate.fishOptional: Add to ~/.config/fish/config.fish for auto-activation:
# Auto-activate NTSB venv when in project directory
if test -d ~/Code/NTSB_Datasets/.venv; and test "$PWD" = ~/Code/NTSB_Datasets
source .venv/bin/activate.fish
endAdd to ~/.config/fish/config.fish:
# NTSB shortcuts
abbr -a ntsb-setup 'cd ~/Code/NTSB_Datasets && source .venv/bin/activate.fish'
abbr -a ntsb-query './scripts/quick_query.fish'
abbr -a ntsb-extract './scripts/extract_table.fish'
abbr -a ntsb-search './scripts/search_data.fish'
abbr -a ntsb-jupyter 'source .venv/bin/activate.fish && jupyter lab'# 1. Extract data
./scripts/extract_all_tables.fish datasets/avall.mdb
# 2. Run quick analysis
source .venv/bin/activate.fish
python examples/quick_analysis.py
# 3. Open Jupyter
jupyter labmdbtools is in AUR, not in the official Arch repositories:
# Install using paru (recommended)
paru -S mdbtools
# Or using yay
yay -S mdbtools
# Or manually from AUR
git clone https://aur.archlinux.org/mdbtools.git
cd mdbtools
makepkg -siNote: If you don't have an AUR helper installed, install paru first:
sudo pacman -S --needed base-devel git
git clone https://aur.archlinux.org/paru.git
cd paru
makepkg -siThis error occurs because the mdbtools AUR PKGBUILD needs to be patched to find gettext m4 macros.
Root Cause: The PKGBUILD uses autoreconf -i -f but needs -I /usr/share/gettext/m4 to locate gettext macros.
Quick Fix - Use provided script:
./fix_mdbtools_pkgbuild.fishManual Fix:
# 1. Clone mdbtools from AUR
cd /tmp
git clone https://aur.archlinux.org/mdbtools.git
cd mdbtools
# 2. Edit PKGBUILD - change line in prepare() function from:
# autoreconf -i -f
# to:
# autoreconf -i -f -I /usr/share/gettext/m4
# 3. Build and install
makepkg -siError message looks like:
configure:21182: error: possibly undefined macro: AC_LIB_PREPARE_PREFIX
configure:21183: error: possibly undefined macro: AC_LIB_RPATH
autoreconf: error: /usr/bin/autoconf failed with exit status: 1
==> ERROR: A failure occurred in prepare().
Related Issue: mdbtools/mdbtools#370
duckdb is in AUR, not in the official Arch repositories:
# Install using paru (recommended)
paru -S duckdb
# Or using yay
yay -S duckdb
# Or manually from AUR
git clone https://aur.archlinux.org/duckdb.git
cd duckdb
makepkg -sisource .venv/bin/activate.fish
pip install <module_name>chmod +x scripts/*.fish examples/*.py# Recreate venv
rm -rf .venv
python -m venv .venv
source .venv/bin/activate.fish
pip install -r requirements.txt # if you create oneGDAL dependency issues:
# Install system GDAL first
sudo pacman -S gdal
# Then install Python packages
pip install geopandasThe databases are large (1.6GB total). CSV exports will add ~2-3GB more.
# Check disk space
df -h
# Extract only needed tables
./scripts/extract_table.fish datasets/avall.mdb events
./scripts/extract_table.fish datasets/avall.mdb aircraftUse faster mirror:
pip install --index-url https://pypi.org/simple <package>Or install in batches instead of all at once.
If qsv fails to compile during setup, clean up and use alternatives:
Quick cleanup using script:
./cleanup_qsv.fishManual cleanup:
# 1. Uninstall qsv (if partially installed)
cargo uninstall qsv
# 2. Remove temporary build directories
rm -rf /tmp/cargo-install*
# 3. Clean qsv from cargo registry
find ~/.cargo/registry/cache -name "qsv-*.crate" -delete
find ~/.cargo/registry/src -type d -name "qsv-*" -exec rm -rf {} +
# 4. Check cache size
du -sh ~/.cargo/registryDeep cleanup of all unused Rust dependencies (optional):
# Install cargo-cache tool
cargo install cargo-cache
# Auto-clean unused dependencies
cargo cache --autoclean
# Or for more aggressive cleanup
cargo cache --autoclean-expensiveAlternatives to qsv:
- Use xsv (already installed) - Similar functionality, more stable
- Install qsv from git (has fixes):
cargo install --git https://github.com/jqnatividad/qsv qsv --features='feature_capable' - Wait for qsv v9.1.1 - Official fix coming soon
After successful installation:
-
Read the documentation:
README.md- Project overviewQUICKSTART.md- Quick referenceSCRIPTS_REFERENCE.md- Complete script guide
-
Extract and explore data:
./scripts/show_database_info.fish datasets/avall.mdb ./scripts/extract_all_tables.fish datasets/avall.mdb
-
Run example analyses:
source .venv/bin/activate.fish python examples/quick_analysis.py python examples/advanced_analysis.py
-
Start Jupyter for interactive work:
jupyter lab # Open examples/starter_notebook.ipynb -
Explore the tools:
- Review
TOOLS_AND_UTILITIES.mdfor advanced tools - Try different Fish scripts in
scripts/ - Experiment with SQL queries using
quick_query.fish
- Review
-
Official Docs:
-
This Repository:
CLAUDE.md- Database schema referencescripts/README.md- Script documentationexamples/README.md- Python examples guideTOOLS_AND_UTILITIES.md- Comprehensive tool list
If you encounter issues:
- Check the troubleshooting section above
- Review error messages carefully
- Verify all prerequisites are installed
- Check file permissions (
chmod +x scripts/*.fish) - Ensure you're in the correct directory
- Verify virtual environment is activated for Python commands
Common fixes:
# Reset permissions
chmod +x scripts/*.fish examples/*.py setup.fish
# Recreate virtual environment
rm -rf .venv && python -m venv .venv && source .venv/bin/activate.fish
# Update packages
pip install --upgrade pip
pip install --upgrade pandas duckdb jupyter