AirBNB Review Scraper & Analyzer

An end-to-end pipeline for short-term rental market analysis. Given a geographic center point (latitude/longitude) and search radius, it scrapes hundreds of Airbnb listings and their reviews, generates AI-powered summaries using GPT-4.1-mini, enriches each listing with AirDNA financial metrics via per-listing rentalizer lookups, and produces market-intelligence reports that identify what drives higher nightly rates and occupancy. The final output includes correlation analyses (e.g., "Jacuzzi prevalence is +34.3pp higher among listings that outperform their size-predicted ADR"), description quality scoring via OLS regression, an ADR prediction model, and actionable recommendations for hosts — all generated automatically from a single config.json.

Prerequisites

Before setting up the project, ensure you have the following. These are required for a full pipeline run:

1. OpenAI API Key and Credits

An active OpenAI account with API access and a funded balance. The pipeline uses gpt-4.1-mini by default and makes hundreds of API calls across four separate LLM use cases:

Use Case	Description
Property summaries	Each listing's reviews → structured summary with mention counts
Area summary	All property summaries → area-level insights
Correlation insights	Statistical data + descriptions → market analysis for ADR/Occupancy
Description scoring	Each listing description scored on 7 quality dimensions, then synthesized

For a typical run of ~300 listings, expect roughly $2–5 in API costs (GPT-4.1-mini at $0.40/1M input, $1.60/1M output tokens). The pipeline includes a built-in cost tracker (utils/cost_tracker.py) that logs every request to logs/cost_tracking.json so you can monitor spend in real time.

2. Paid AirDNA Subscription

AirDNA is a third-party short-term rental analytics platform. A paid subscription is required to access rentalizer data, which provides financial metrics not available from Airbnb directly:

ADR (Average Daily Rate) — average price per booked night
Occupancy — percentage of available days that are booked
Revenue — annual rental revenue
Days Available — how many days the property is listed per year

The pipeline automatically looks up each listing discovered in the Airbnb search on AirDNA's rentalizer page — no manual comp set creation required. Without AirDNA data, the correlation analysis and description quality analysis stages will lack the financial metrics needed to function.

Features

AirDNA Per-Listing Lookup — Enrich each listing with ADR, Occupancy, Revenue, and Days Available via Playwright/CDP rentalizer pages
Property Search — Find Airbnb listings within a geographic area using lat/long center + radius via pyairbnb
Review Scraping — Pull all reviews for discovered listings with per-file caching
Property Details Scraping — Scrape amenities, descriptions, house rules, and neighborhood info
Details Fileset Build — Transform raw details + AirDNA financials into structured CSVs (amenity matrix, descriptions, house rules)
AI Property Summaries — Generate structured summaries per property using GPT-4.1-mini: pros/cons with mention percentages, amenity analysis, and rating context vs. area average
AI Area Summary — Roll up all property summaries into area-level trends and insights
Data Extraction — LLM-powered parsing of summaries into structured numeric data with sentiment categories
Correlation Analysis — Statistical comparison of top/bottom percentile tiers by ADR or Occupancy, with LLM-generated market insights
Description Quality Analysis — OLS regression of ADR against 160+ features, LLM scoring of descriptions on 7 quality dimensions, and correlation of language quality with pricing premiums
ADR Prediction Model — Two-stage residual XGBoost model predicting nightly rates from property features and amenities
Prediction Web App — Flask UI for interactive ADR predictions using the trained model
TTL-Based Caching — Pipeline cache with cascade invalidation prevents redundant scraping and API calls
Cost Tracking — Monitor OpenAI API usage and costs per session

Installation

Requires Python 3.13.

# Clone the repository
git clone https://github.com/threnjen/AirBNB-Review-Scraper.git
cd AirBNB-Review-Scraper

# Install dependencies
pipenv install --dev

# Install Playwright browsers (required for AirDNA scraping)
pipenv run playwright install chromium

# Set OpenAI API key (or copy .env.example to .env and fill it in)
cp .env.example .env
# Then edit .env with your key, or export directly:
export OPENAI_API_KEY="your-api-key-here"

How It Works

The pipeline combines two data sources and four LLM use cases to produce market intelligence:

Data Sources:

Airbnb (via pyairbnb) — listing search by geographic bounding box, review text, property details, amenities, descriptions, and house rules
AirDNA (via Playwright/CDP) — financial metrics (ADR, Occupancy, Revenue, Days Available) scraped per-listing from AirDNA's rentalizer page via a logged-in Chrome session connected over Chrome DevTools Protocol

AI Processing:

All OpenAI calls go through review_aggregator/openai_aggregator.py, which handles token estimation via tiktoken, automatic chunking at 120K tokens with a merge step, and 3 retries with exponential backoff
Property-level prompts produce structured summaries with mention counts in "X of Y Reviews (Z%)" format
Area-level prompts aggregate all property summaries into a single area analysis
Correlation prompts receive statistical comparison data and produce actionable market insights
Description prompts use a two-phase approach: first scoring each description on 7 dimensions (1–10), then synthesizing findings with top/bottom examples

Caching:

TTL-based cache (utils/pipeline_cache_manager.py) checks file existence and os.path.getmtime() against the configured TTL — no metadata file needed
Refreshing an early stage automatically cascades invalidation to all downstream stages
Per-file caching for reviews and details allows incremental scraping of new listings

Respectful Scraping

This pipeline automates data collection, but it is designed to browse no faster than a human would. Every scraping stage inserts randomized delays between requests, so the pace of data retrieval is consistent with a person manually clicking through listings. The AirDNA scraper goes further — it connects to a real Chrome browser session via Chrome DevTools Protocol, meaning it appears on the network as a normal logged-in user navigating page by page.

Key principles:

Human-speed pacing — Randomized pauses between every request ensure automated browsing is no faster than manual browsing. There is no parallel request fan-out; listings are visited one at a time.
Caching prevents redundant requests — TTL-based caching means previously scraped listings are never re-fetched unless their cache expires. A second run against the same search area hits zero external endpoints if all data is still fresh.
Backoff on rate-limit signals — If the pipeline detects signs of rate limiting (e.g., AirDNA returning empty results), it pauses for an extended cooldown period before retrying, rather than retrying immediately.
No API abuse — OpenAI calls use exponential backoff with retry limits. Token usage and costs are tracked per-session so users can monitor spend.

This project is intended for personal market research. The scraping approach is deliberately conservative — the automation saves time by running unattended, not by going faster.

Example Output

The reports/ directory contains example analytical output from a full pipeline run on the Mount Hood, Oregon area (843 properties analyzed):

Area Summary — `reports/area_summary_mt_hood.md`

Aggregated area-level insights from all property summaries. Identifies a diverse selection of accommodations — cabins, lodges, mountain homes, condos, chalets, yurts, tiny homes, treehouses, and glamping tents — set in forested, riverfront, and mountain settings. Top positives: hot tubs, location, cleanliness, host communication, well-stocked kitchens. Top issues: maintenance inconsistencies, hot tub problems, noise/privacy concerns, heating limitations, Wi-Fi reliability.

ADR Correlation Analysis — `reports/correlation_insights_adr_mt_hood.md`

Identifies what drives higher nightly rates after adjusting for property size. Uses XGBoost regression on capacity, bedrooms, beds, and bathrooms (R² = 0.747) to compute size-adjusted residuals, then compares the top 25% (residual +$42.98, n=207) against the bottom 25% (residual −$53.10, n=207). Key finding: luxury amenities and experiential features — not additional space per guest — drive premium pricing.

Feature High Tier Low Tier Difference (pp)

Jacuzzi 67.6% 33.3% +34.3%

Ocean View 38.6% 15.0% +23.7%

Firepit 59.4% 44.9% +14.5%

Grill 82.6% 69.1% +13.5%

Dishwasher 86.5% 74.4% +12.1%

The pipeline also generates Occupancy Correlation Analysis and Description Quality Analysis reports (see stages 8 and 9 in Pipeline Flow), which are not included as checked-in examples.

Configuration

Edit config.json to configure the pipeline. All pipeline behavior is controlled here — there is no CLI argument interface.

Pipeline Steps

Key	Type	Description
`search_results`	bool	Search for Airbnb listings by lat/long bounding box
`details_scrape`	bool	Scrape property details (amenities, rules)
`airdna_data`	bool	Scrape AirDNA per-listing metrics
`reviews_scrape`	bool	Scrape reviews for discovered listings
`details_results`	bool	Transform scraped details + AirDNA financials into structured datasets
`listing_summaries`	bool	Generate AI summaries for each property
`area_summary`	bool	Generate area-level summary + extract structured data from summaries
`correlation_results`	bool	Run correlation analysis of amenities/capacity vs. ADR and Occupancy
`description_analysis`	bool	Run description quality scoring and regression analysis
`machine_learning_model`	bool	Train two-stage XGBoost ADR prediction model

Stage dependencies: Stages run in order and depend on upstream outputs. Stages 1–5 produce the raw data; stages 6–9 consume it. Stage 10 (ML model) depends on stage 5. For example, listing_summaries (6) requires search_results (1) and reviews_scrape (4); correlation_results (8) and description_analysis (9) require details_results (5) and airdna_data (3). If you enable a downstream stage without having run its upstream stages first, the pipeline will fail or produce empty results.

AirDNA data required for stages 8–9: The correlation_results and description_analysis stages silently filter out properties without AirDNA financial data. If you skip the airdna_data stage, these analysis stages will have no properties to analyze. Run airdna_data first to populate AirDNA metrics.

Entire-home filter: The details_results stage only includes listings with room type "Entire home/apt". Shared rooms, private rooms, and hotel rooms are silently excluded. All downstream analysis operates on entire-home listings only.

Minimum property requirements: The correlation analyzer requires at least 4 properties with valid metric values per tier. The description analyzer's OLS regression requires more properties than features (~161+ for the full amenity set). Small search areas with few listings may produce empty or degraded results.

Search Parameters

Key	Type	Description
`search_zone_name`	string	Required. A label for this search area (e.g., `"mt_hood"`). Used in output filenames. Alphanumeric, underscores, and hyphens only.
`start_lat`	float	Required. Center latitude of the search area (e.g., `45.347161`)
`start_long`	float	Required. Center longitude of the search area (e.g., `-121.9646838`)
`search_radius_miles`	float	Required. Search radius in miles from the center point (e.g., `15`)
`poi_lat`	float	Point-of-interest latitude, used for distance calculations in the ML model
`poi_long`	float	Point-of-interest longitude
`iso_code`	string	Country code (e.g., `"us"`)
`num_listings_to_search`	int	Max listings to find in search
`num_listings_to_summarize`	int	Max listings to process with AI
`review_thresh_to_include_prop`	int	Minimum reviews required to process a listing
`num_summary_to_process`	int	Max property summaries to process in downstream stages
`dataset_use_categoricals`	bool	Use categorical encoding for amenity features in analysis

Tip: To find coordinates for your area, right-click any location in Google Maps and select the lat/long values.

AirDNA Settings

Key	Type	Description
`min_days_available`	int	Minimum Days Available to include a listing in the cleaned amenities matrix used by description analysis (default: `100`). The raw matrix used by correlation analysis is unfiltered.
`airdna_cdp_url`	string	Chrome DevTools Protocol URL (default: `http://localhost:9222`)
`airdna_inspect_mode`	bool	Pause after navigation for DOM selector discovery

Correlation Settings

Key	Type	Description
`correlation_metrics`	array	Metrics to analyze (e.g., `["adr", "occupancy"]`)
`correlation_top_percentile`	int	Top percentile tier threshold (e.g., `25` = top 25%)
`correlation_bottom_percentile`	int	Bottom percentile tier threshold (e.g., `25` = bottom 25%)

OpenAI Settings

Key	Type	Description
`openai.model`	string	Model to use (default: `"gpt-4.1-mini"`)
`openai.temperature`	float	Response randomness (0.0–1.0)
`openai.max_tokens`	int	Max tokens per response
`openai.chunk_token_limit`	int	Token limit per chunk sent to the API
`openai.enable_cost_tracking`	bool	Log API costs to `logs/cost_tracking.json`

Pipeline Caching (TTL)

The pipeline includes a TTL-based cache that prevents redundant scraping and processing. When enabled, each stage's outputs are tracked with timestamps. If all outputs for a stage are still within the TTL window, the stage is skipped entirely on the next run. For per-file stages (reviews, details), individual listings are skipped when their cached files are fresh.

Key	Type	Default	Description
`pipeline_cache_enabled`	bool	`true`	Enable/disable pipeline-level TTL caching
`pipeline_cache_ttl_days`	int	`90`	Number of days before cached outputs expire
`force_refresh_search_results`	bool	`false`	Force re-run area search
`force_refresh_details_scrape`	bool	`false`	Force re-scrape all property details
`force_refresh_details_results`	bool	`false`	Force rebuild details fileset
`force_refresh_reviews_scrape`	bool	`false`	Force re-scrape all reviews
`force_refresh_airdna_data`	bool	`false`	Force re-run AirDNA scraping even if cached
`force_refresh_listing_summaries`	bool	`false`	Force regenerate property summaries
`force_refresh_area_summary`	bool	`false`	Force regenerate area summary + data extraction
`force_refresh_correlation_results`	bool	`false`	Force re-run correlation analysis
`force_refresh_description_analysis`	bool	`false`	Force re-run description quality analysis
`force_refresh_machine_learning_model`	bool	`false`	Force retrain ADR prediction model

How it works:

Freshness is determined by file existence and os.path.getmtime() — each stage declares its expected output files, and a stage is fresh when all files exist with mtime within the TTL
On each run, the pipeline checks whether outputs exist and are within the TTL before executing a stage
Refreshing an early stage cascades invalidation to all downstream stages
The force_refresh_* flags let you bypass the cache for specific stages

Example: Run the full pipeline, then re-run immediately — all stages will be skipped:

pipenv run python main.py   # First run: executes all enabled stages
pipenv run python main.py   # Second run: skips cached stages

To force a single stage to re-run:

{"force_refresh_reviews_scrape": true}

Usage

AirDNA Per-Listing Lookup

Enrich each discovered listing with financial metrics (ADR, Occupancy, Revenue, Days Available) from AirDNA's rentalizer page. The pipeline automatically feeds listing IDs from the Airbnb search into AirDNA — no manual comp set creation needed.

Setup:

Quit Chrome completely first (Cmd+Q on macOS) — the debug port won't open if Chrome is already running.

Launch Chrome with remote debugging:

# macOS:
make chrome-debug
# Or manually:
open -a "Google Chrome" --args --remote-debugging-port=9222

# Linux:
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug-profile &

# Windows (PowerShell):
& "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="$env:TEMP\chrome-debug-profile"

In the Chrome window that opens, navigate to AirDNA and log in with your account

Run:

# Set config.json: "airdna_data": true
pipenv run python main.py
# Or:
make scrape-airdna

The scraper visits https://app.airdna.co/data/rentalizer?&listing_id=abnb_{id} for each listing and extracts header metrics (Bedrooms, Bathrooms, Max Guests, Rating, Review Count) and KPI cards (Revenue, Days Available, Annual Revenue, Occupancy, ADR). All listings are saved regardless of Days Available; filtering by min_days_available (default: 100) is applied later when the cleaned amenities matrix is built in the details_results stage.

Output: listing_{id}.json — one file per listing in outputs/03_airdna_data/:

{
    "1050769200886027711": {"ADR": 487.5, "Occupancy": 32, "Revenue": 51700.0, "Days_Available": 333, "Bedrooms": 4, "Bathrooms": 3, "Max_Guests": 15, "Rating": 4.7, "Review_Count": 287, "LY_Revenue": 0.0}
}

Note: LY_Revenue (Last Year Revenue) is a placeholder field — it is always 0.0 in the current implementation and can be ignored.

Inspect mode: If selectors break (AirDNA UI changes), enable "airdna_inspect_mode": true to pause the browser and use Playwright Inspector to discover new selectors.

Basic Workflow

# 1. Configure your search area in config.json:
#    "search_zone_name": "my_area",
#    "start_lat": 45.347161,
#    "start_long": -121.9646838,
#    "search_radius_miles": 15

# 2. Scrape listings and reviews
# Set config.json: "search_results": true, "reviews_scrape": true
pipenv run python main.py

# 3. Generate property summaries
# Set config.json: "listing_summaries": true
pipenv run python main.py

# 4. Generate area summary
# Set config.json: "area_summary": true
pipenv run python main.py

Full Pipeline Example

Enable all stages in config.json:

{
  "search_zone_name": "mt_hood",
  "start_lat": 45.347161,
  "start_long": -121.9646838,
  "search_radius_miles": 15,
  "search_results": true,
  "details_scrape": true,
  "airdna_data": true,
  "reviews_scrape": true,
  "details_results": true,
  "listing_summaries": true,
  "area_summary": true,
  "correlation_results": true,
  "description_analysis": true,
  "machine_learning_model": true
}

Then run:

pipenv run python main.py

Pipeline Flow

Lat/Long + config.json
        ↓
┌───────────────────────────────────────┐
│  1. Search Results                    │
│     pyairbnb.search_all() by lat/long│
│     → outputs/01_search_results/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  2. Details Scrape                    │
│     pyairbnb.get_details() per listing│
│     → outputs/02_details_scrape/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  3. Comp Sets (AirDNA)                │
│     Playwright/CDP → Chrome → AirDNA  │
│     → outputs/03_airdna_data/           │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  4. Reviews Scrape                    │
│     pyairbnb.get_reviews() per listing│
│     → outputs/04_reviews_scrape/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  5. Details Results                   │
│     Raw details + AirDNA financials   │
│     → amenity matrix, descriptions    │
│     → outputs/07_details_results/     │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  6. Listing Summaries (GPT)           │
│     Reviews → structured summaries    │
│     → outputs/05_listing_summaries/   │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  7. Area Summary (GPT)                │
│     Summaries → area insights         │
│     → reports/area_summary_*.md       │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  8. Correlation Results (GPT)         │
│     Top/bottom percentile comparison  │
│     → outputs/08_correlation_results/ │
│     → reports/correlation_insights_*  │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  9. Description Analysis              │
│     OLS regression + GPT scoring      │
│     → outputs/09_description_analysis/│
│     → reports/description_quality_*   │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  10. ML Model (XGBoost)                │
│      Two-stage residual ADR predictor │
│      → ml/model/residual/             │
└───────────────────────────────────────┘

Output Files

Directory	Content
`outputs/01_search_results/`	Search results by geographic area
`outputs/02_details_scrape/`	Property details (amenities, rules, descriptions)
`outputs/03_airdna_data/`	AirDNA per-listing metrics (ADR, Occupancy, Revenue, Days Available) + master comp set
`outputs/04_reviews_scrape/`	Raw review JSON per listing
`outputs/07_details_results/`	Structured CSVs and JSON: amenity matrix, house rules, descriptions, neighborhood highlights
`outputs/05_listing_summaries/`	AI-generated summary per property
`outputs/08_correlation_results/`	Correlation statistics (JSON) for each metric
`outputs/09_description_analysis/`	Description quality statistics (JSON)
`reports/`	Markdown and JSON reports: area summaries, correlation insights, description quality analysis
`ml/model/residual/`	Trained XGBoost model artifacts and metrics
`logs/cost_tracking.json`	OpenAI API cost logs per session

Architecture

main.py                          # Entry point — config-driven pipeline orchestrator
├── steps/
│   ├── __init__.py              # Shared helper (load_search_results)
│   ├── 01_search_results.py     # Listing discovery by lat/long bounding box
│   ├── 02_details_scrape.py     # Scrape property details
│   ├── 03_airdna_data.py        # AirDNA per-listing lookup + master comp set
│   ├── 04_reviews_scrape.py     # Scrape reviews per listing
│   ├── 07_details_results.py    # Transform details + AirDNA → structured data
│   ├── 05_listing_summaries.py  # Per-property AI summaries
│   ├── 06_area_summary.py       # Area-level AI summary
│   ├── 08_correlation_results.py # Percentile-based metric correlation
│   ├── 09_description_analysis.py # OLS regression + description scoring
│   └── 10_ml_model.py           # Two-stage XGBoost ADR prediction model
├── scraper/
│   ├── airbnb_searcher.py       # Lat/long center → bounding box → listing search
│   ├── airdna_scraper.py        # AirDNA per-listing rentalizer scraper (Playwright/CDP)
│   ├── reviews_scraper.py       # Fetch reviews per listing
│   ├── details_scraper.py       # Fetch property details
│   ├── details_fileset_build.py # Transform to structured data + merge AirDNA financials
│   └── location_calculator.py   # Lat/long + radius → bounding box
├── review_aggregator/
│   ├── property_review_aggregator.py  # Per-property AI summaries
│   ├── area_review_aggregator.py      # Area-level AI summaries
│   ├── openai_aggregator.py           # OpenAI client with chunking, retry, cost tracking
│   ├── correlation_analyzer.py        # Percentile-based metric correlation analysis
│   └── description_analyzer.py        # OLS regression + LLM description quality scoring
├── ml/
│   ├── train_2_level.py         # Two-stage residual XGBoost training pipeline
│   └── ml_utils.py              # Shared ML utilities (train, save, refit)
├── app/
│   └── app.py                   # Flask web app — ADR prediction UI
├── utils/
│   ├── cost_tracker.py          # OpenAI API cost tracking (per-session)
│   ├── pipeline_cache_manager.py # TTL-based caching with cascade invalidation
│   ├── local_file_handler.py    # File system utilities
│   └── tiny_file_handler.py     # JSON I/O + config validation
└── prompts/
    ├── prompt.json              # Property-level prompt template
    ├── zone_prompt.json         # Area-level prompt template
    ├── correlation_prompt.json  # Correlation analysis prompts (ADR + Occupancy)
    └── description_analysis_prompt.json # Description scoring + synthesis prompts

Running Tests

# Run full test suite with coverage (75% minimum required)
make test

# Or directly:
pipenv run pytest

# Fast mode — fail on first error, no coverage:
make test-fast

# Generate HTML coverage report:
make coverage
# → opens coverage_html/index.html

Makefile Targets

Target	Description
`make setup`	Install dependencies and Playwright browsers
`make test`	Run pytest with coverage
`make test-fast`	Run pytest, fail-fast, no coverage
`make coverage`	Generate HTML coverage report
`make chrome-debug`	Launch Chrome with remote debugging port for AirDNA scraping
`make scrape-airdna`	Run AirDNA scraper standalone
`make train`	Train the ADR prediction model
`make run-app`	Start the ADR prediction web app

Notebooks

Notebook	Purpose
`model_notebook.ipynb`	Modeling experiments on listing features and pricing relationships

Web Application

The project includes a Flask web app that serves interactive ADR (Average Daily Rate) predictions using the trained two-stage XGBoost model.

Prerequisites: Train the model first by running pipeline step 10 ("machine_learning_model": true) or directly via make train. This produces model artifacts in ml/model/residual/.

Run:

make run-app
# Or manually:
pipenv run flask --app app/app run --debug

The app provides a form where you enter property features (bedrooms, bathrooms, capacity, amenity categories) and returns a predicted ADR with error margin. See app/RUNNING.md for production deployment options.

Troubleshooting

Problem	Cause	Solution
AirDNA scraper can't connect	Chrome not running with `--remote-debugging-port` or regular Chrome already open	Quit Chrome fully (Cmd+Q), then `make chrome-debug`
AirDNA returns empty results	Rate limiting or session expired	Wait 3 minutes and retry; re-login to AirDNA in the debug Chrome window
Correlation/description analysis produces empty results	Missing AirDNA data or too few properties	Ensure `airdna_data` stage ran first; check that the search area has enough entire-home listings (4+ for correlation, 161+ for full OLS regression)
OpenAI API errors	Insufficient credits or rate limits	Check your OpenAI balance; the pipeline retries 3x with exponential backoff automatically
Search returns fewer listings than expected	`pyairbnb` caps at ~280 listings per geographic bounding box	This is a library limitation; the pipeline subdivides the search area into a 2×2 grid automatically, but very dense areas may still hit caps
`make chrome-debug` hangs	Chrome debug port already in use	Quit all Chrome processes and retry

Data Privacy

The pipeline scrapes publicly visible Airbnb review text, which may include reviewer names and personally identifiable information. All scraped data is stored locally in the outputs/ directory. Users are responsible for handling this data in compliance with applicable privacy regulations and should avoid sharing raw review data publicly.

Security Notes

Chrome DevTools Protocol (CDP) port exposure: The AirDNA scraper requires Chrome to be launched with --remote-debugging-port=9222. This opens a CDP endpoint on localhost:9222 that allows any local process to attach to the browser and control it — including accessing the authenticated AirDNA session, reading cookies, and navigating to arbitrary URLs. On a single-user development machine this is low risk, but on shared servers, multi-user environments, or CI runners it can be exploited by other users or processes on the same host. Mitigations:

Only launch Chrome with --remote-debugging-port when actively running the AirDNA scraper, and close it immediately afterward.
Do not expose port 9222 to the network (the default localhost binding is not reachable externally, but verify your firewall configuration).
Avoid running the AirDNA scraper on shared or multi-tenant machines.

Disclaimer

This tool scrapes data from Airbnb and AirDNA. Use of web scraping may be subject to the terms of service of these platforms. This project is intended for personal market research and analysis. Users are responsible for ensuring their use complies with applicable terms of service and laws.

Contributing

Issues and pull requests are welcome. Please ensure all tests pass (make test) before submitting.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
.planning		.planning
app		app
dev		dev
ml		ml
prompts		prompts
reports		reports
review_aggregator		review_aggregator
scraper		scraper
steps		steps
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
config.json		config.json
main.py		main.py
pytest.ini		pytest.ini

Feature	High Tier	Low Tier	Difference (pp)
Jacuzzi	67.6%	33.3%	+34.3%
Ocean View	38.6%	15.0%	+23.7%
Firepit	59.4%	44.9%	+14.5%
Grill	82.6%	69.1%	+13.5%
Dishwasher	86.5%	74.4%	+12.1%

Folders and files

Latest commit

History

Repository files navigation

AirBNB Review Scraper & Analyzer

Prerequisites

1. OpenAI API Key and Credits

2. Paid AirDNA Subscription

Features

Installation

How It Works

Respectful Scraping

Example Output

Area Summary — reports/area_summary_mt_hood.md

ADR Correlation Analysis — reports/correlation_insights_adr_mt_hood.md

Configuration

Pipeline Steps

Search Parameters

AirDNA Settings

Correlation Settings

OpenAI Settings

Pipeline Caching (TTL)

Usage

AirDNA Per-Listing Lookup

Basic Workflow

Full Pipeline Example

Pipeline Flow

Output Files

Architecture

Running Tests

Makefile Targets

Notebooks

Web Application

Troubleshooting

Data Privacy

Security Notes

Disclaimer

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Area Summary — `reports/area_summary_mt_hood.md`

ADR Correlation Analysis — `reports/correlation_insights_adr_mt_hood.md`

Packages