Skip to content

threnjen/AirBNB-Review-Scraper

Repository files navigation

AirBNB Review Scraper & Analyzer

An end-to-end pipeline for short-term rental market analysis. Given a geographic center point (latitude/longitude) and search radius, it scrapes hundreds of Airbnb listings and their reviews, generates AI-powered summaries using GPT-4.1-mini, enriches each listing with AirDNA financial metrics via per-listing rentalizer lookups, and produces market-intelligence reports that identify what drives higher nightly rates and occupancy. The final output includes correlation analyses (e.g., "Jacuzzi prevalence is +34.3pp higher among listings that outperform their size-predicted ADR"), description quality scoring via OLS regression, an ADR prediction model, and actionable recommendations for hosts — all generated automatically from a single config.json.

Prerequisites

Before setting up the project, ensure you have the following. These are required for a full pipeline run:

1. OpenAI API Key and Credits

An active OpenAI account with API access and a funded balance. The pipeline uses gpt-4.1-mini by default and makes hundreds of API calls across four separate LLM use cases:

Use Case Description
Property summaries Each listing's reviews → structured summary with mention counts
Area summary All property summaries → area-level insights
Correlation insights Statistical data + descriptions → market analysis for ADR/Occupancy
Description scoring Each listing description scored on 7 quality dimensions, then synthesized

For a typical run of ~300 listings, expect roughly $2–5 in API costs (GPT-4.1-mini at $0.40/1M input, $1.60/1M output tokens). The pipeline includes a built-in cost tracker (utils/cost_tracker.py) that logs every request to logs/cost_tracking.json so you can monitor spend in real time.

2. Paid AirDNA Subscription

AirDNA is a third-party short-term rental analytics platform. A paid subscription is required to access rentalizer data, which provides financial metrics not available from Airbnb directly:

  • ADR (Average Daily Rate) — average price per booked night
  • Occupancy — percentage of available days that are booked
  • Revenue — annual rental revenue
  • Days Available — how many days the property is listed per year

The pipeline automatically looks up each listing discovered in the Airbnb search on AirDNA's rentalizer page — no manual comp set creation required. Without AirDNA data, the correlation analysis and description quality analysis stages will lack the financial metrics needed to function.

Features

  • AirDNA Per-Listing Lookup — Enrich each listing with ADR, Occupancy, Revenue, and Days Available via Playwright/CDP rentalizer pages
  • Property Search — Find Airbnb listings within a geographic area using lat/long center + radius via pyairbnb
  • Review Scraping — Pull all reviews for discovered listings with per-file caching
  • Property Details Scraping — Scrape amenities, descriptions, house rules, and neighborhood info
  • Details Fileset Build — Transform raw details + AirDNA financials into structured CSVs (amenity matrix, descriptions, house rules)
  • AI Property Summaries — Generate structured summaries per property using GPT-4.1-mini: pros/cons with mention percentages, amenity analysis, and rating context vs. area average
  • AI Area Summary — Roll up all property summaries into area-level trends and insights
  • Data Extraction — LLM-powered parsing of summaries into structured numeric data with sentiment categories
  • Correlation Analysis — Statistical comparison of top/bottom percentile tiers by ADR or Occupancy, with LLM-generated market insights
  • Description Quality Analysis — OLS regression of ADR against 160+ features, LLM scoring of descriptions on 7 quality dimensions, and correlation of language quality with pricing premiums
  • ADR Prediction Model — Two-stage residual XGBoost model predicting nightly rates from property features and amenities
  • Prediction Web App — Flask UI for interactive ADR predictions using the trained model
  • TTL-Based Caching — Pipeline cache with cascade invalidation prevents redundant scraping and API calls
  • Cost Tracking — Monitor OpenAI API usage and costs per session

Installation

Requires Python 3.13.

# Clone the repository
git clone https://github.com/threnjen/AirBNB-Review-Scraper.git
cd AirBNB-Review-Scraper

# Install dependencies
pipenv install --dev

# Install Playwright browsers (required for AirDNA scraping)
pipenv run playwright install chromium

# Set OpenAI API key (or copy .env.example to .env and fill it in)
cp .env.example .env
# Then edit .env with your key, or export directly:
export OPENAI_API_KEY="your-api-key-here"

How It Works

The pipeline combines two data sources and four LLM use cases to produce market intelligence:

Data Sources:

  • Airbnb (via pyairbnb) — listing search by geographic bounding box, review text, property details, amenities, descriptions, and house rules
  • AirDNA (via Playwright/CDP) — financial metrics (ADR, Occupancy, Revenue, Days Available) scraped per-listing from AirDNA's rentalizer page via a logged-in Chrome session connected over Chrome DevTools Protocol

AI Processing:

  • All OpenAI calls go through review_aggregator/openai_aggregator.py, which handles token estimation via tiktoken, automatic chunking at 120K tokens with a merge step, and 3 retries with exponential backoff
  • Property-level prompts produce structured summaries with mention counts in "X of Y Reviews (Z%)" format
  • Area-level prompts aggregate all property summaries into a single area analysis
  • Correlation prompts receive statistical comparison data and produce actionable market insights
  • Description prompts use a two-phase approach: first scoring each description on 7 dimensions (1–10), then synthesizing findings with top/bottom examples

Caching:

  • TTL-based cache (utils/pipeline_cache_manager.py) checks file existence and os.path.getmtime() against the configured TTL — no metadata file needed
  • Refreshing an early stage automatically cascades invalidation to all downstream stages
  • Per-file caching for reviews and details allows incremental scraping of new listings

Respectful Scraping

This pipeline automates data collection, but it is designed to browse no faster than a human would. Every scraping stage inserts randomized delays between requests, so the pace of data retrieval is consistent with a person manually clicking through listings. The AirDNA scraper goes further — it connects to a real Chrome browser session via Chrome DevTools Protocol, meaning it appears on the network as a normal logged-in user navigating page by page.

Key principles:

  • Human-speed pacing — Randomized pauses between every request ensure automated browsing is no faster than manual browsing. There is no parallel request fan-out; listings are visited one at a time.
  • Caching prevents redundant requests — TTL-based caching means previously scraped listings are never re-fetched unless their cache expires. A second run against the same search area hits zero external endpoints if all data is still fresh.
  • Backoff on rate-limit signals — If the pipeline detects signs of rate limiting (e.g., AirDNA returning empty results), it pauses for an extended cooldown period before retrying, rather than retrying immediately.
  • No API abuse — OpenAI calls use exponential backoff with retry limits. Token usage and costs are tracked per-session so users can monitor spend.

This project is intended for personal market research. The scraping approach is deliberately conservative — the automation saves time by running unattended, not by going faster.

Example Output

The reports/ directory contains example analytical output from a full pipeline run on the Mount Hood, Oregon area (843 properties analyzed):

Aggregated area-level insights from all property summaries. Identifies a diverse selection of accommodations — cabins, lodges, mountain homes, condos, chalets, yurts, tiny homes, treehouses, and glamping tents — set in forested, riverfront, and mountain settings. Top positives: hot tubs, location, cleanliness, host communication, well-stocked kitchens. Top issues: maintenance inconsistencies, hot tub problems, noise/privacy concerns, heating limitations, Wi-Fi reliability.

ADR Correlation Analysis — reports/correlation_insights_adr_mt_hood.md

Identifies what drives higher nightly rates after adjusting for property size. Uses XGBoost regression on capacity, bedrooms, beds, and bathrooms (R² = 0.747) to compute size-adjusted residuals, then compares the top 25% (residual +$42.98, n=207) against the bottom 25% (residual −$53.10, n=207). Key finding: luxury amenities and experiential features — not additional space per guest — drive premium pricing.

Feature High Tier Low Tier Difference (pp)
Jacuzzi 67.6% 33.3% +34.3%
Ocean View 38.6% 15.0% +23.7%
Firepit 59.4% 44.9% +14.5%
Grill 82.6% 69.1% +13.5%
Dishwasher 86.5% 74.4% +12.1%

The pipeline also generates Occupancy Correlation Analysis and Description Quality Analysis reports (see stages 8 and 9 in Pipeline Flow), which are not included as checked-in examples.

Configuration

Edit config.json to configure the pipeline. All pipeline behavior is controlled here — there is no CLI argument interface.

Pipeline Steps

Key Type Description
search_results bool Search for Airbnb listings by lat/long bounding box
details_scrape bool Scrape property details (amenities, rules)
airdna_data bool Scrape AirDNA per-listing metrics
reviews_scrape bool Scrape reviews for discovered listings
details_results bool Transform scraped details + AirDNA financials into structured datasets
listing_summaries bool Generate AI summaries for each property
area_summary bool Generate area-level summary + extract structured data from summaries
correlation_results bool Run correlation analysis of amenities/capacity vs. ADR and Occupancy
description_analysis bool Run description quality scoring and regression analysis
machine_learning_model bool Train two-stage XGBoost ADR prediction model

Stage dependencies: Stages run in order and depend on upstream outputs. Stages 1–5 produce the raw data; stages 6–9 consume it. Stage 10 (ML model) depends on stage 5. For example, listing_summaries (6) requires search_results (1) and reviews_scrape (4); correlation_results (8) and description_analysis (9) require details_results (5) and airdna_data (3). If you enable a downstream stage without having run its upstream stages first, the pipeline will fail or produce empty results.

AirDNA data required for stages 8–9: The correlation_results and description_analysis stages silently filter out properties without AirDNA financial data. If you skip the airdna_data stage, these analysis stages will have no properties to analyze. Run airdna_data first to populate AirDNA metrics.

Entire-home filter: The details_results stage only includes listings with room type "Entire home/apt". Shared rooms, private rooms, and hotel rooms are silently excluded. All downstream analysis operates on entire-home listings only.

Minimum property requirements: The correlation analyzer requires at least 4 properties with valid metric values per tier. The description analyzer's OLS regression requires more properties than features (~161+ for the full amenity set). Small search areas with few listings may produce empty or degraded results.

Search Parameters

Key Type Description
search_zone_name string Required. A label for this search area (e.g., "mt_hood"). Used in output filenames. Alphanumeric, underscores, and hyphens only.
start_lat float Required. Center latitude of the search area (e.g., 45.347161)
start_long float Required. Center longitude of the search area (e.g., -121.9646838)
search_radius_miles float Required. Search radius in miles from the center point (e.g., 15)
poi_lat float Point-of-interest latitude, used for distance calculations in the ML model
poi_long float Point-of-interest longitude
iso_code string Country code (e.g., "us")
num_listings_to_search int Max listings to find in search
num_listings_to_summarize int Max listings to process with AI
review_thresh_to_include_prop int Minimum reviews required to process a listing
num_summary_to_process int Max property summaries to process in downstream stages
dataset_use_categoricals bool Use categorical encoding for amenity features in analysis

Tip: To find coordinates for your area, right-click any location in Google Maps and select the lat/long values.

AirDNA Settings

Key Type Description
min_days_available int Minimum Days Available to include a listing in the cleaned amenities matrix used by description analysis (default: 100). The raw matrix used by correlation analysis is unfiltered.
airdna_cdp_url string Chrome DevTools Protocol URL (default: http://localhost:9222)
airdna_inspect_mode bool Pause after navigation for DOM selector discovery

Correlation Settings

Key Type Description
correlation_metrics array Metrics to analyze (e.g., ["adr", "occupancy"])
correlation_top_percentile int Top percentile tier threshold (e.g., 25 = top 25%)
correlation_bottom_percentile int Bottom percentile tier threshold (e.g., 25 = bottom 25%)

OpenAI Settings

Key Type Description
openai.model string Model to use (default: "gpt-4.1-mini")
openai.temperature float Response randomness (0.0–1.0)
openai.max_tokens int Max tokens per response
openai.chunk_token_limit int Token limit per chunk sent to the API
openai.enable_cost_tracking bool Log API costs to logs/cost_tracking.json

Pipeline Caching (TTL)

The pipeline includes a TTL-based cache that prevents redundant scraping and processing. When enabled, each stage's outputs are tracked with timestamps. If all outputs for a stage are still within the TTL window, the stage is skipped entirely on the next run. For per-file stages (reviews, details), individual listings are skipped when their cached files are fresh.

Key Type Default Description
pipeline_cache_enabled bool true Enable/disable pipeline-level TTL caching
pipeline_cache_ttl_days int 90 Number of days before cached outputs expire
force_refresh_search_results bool false Force re-run area search
force_refresh_details_scrape bool false Force re-scrape all property details
force_refresh_details_results bool false Force rebuild details fileset
force_refresh_reviews_scrape bool false Force re-scrape all reviews
force_refresh_airdna_data bool false Force re-run AirDNA scraping even if cached
force_refresh_listing_summaries bool false Force regenerate property summaries
force_refresh_area_summary bool false Force regenerate area summary + data extraction
force_refresh_correlation_results bool false Force re-run correlation analysis
force_refresh_description_analysis bool false Force re-run description quality analysis
force_refresh_machine_learning_model bool false Force retrain ADR prediction model

How it works:

  • Freshness is determined by file existence and os.path.getmtime() — each stage declares its expected output files, and a stage is fresh when all files exist with mtime within the TTL
  • On each run, the pipeline checks whether outputs exist and are within the TTL before executing a stage
  • Refreshing an early stage cascades invalidation to all downstream stages
  • The force_refresh_* flags let you bypass the cache for specific stages

Example: Run the full pipeline, then re-run immediately — all stages will be skipped:

pipenv run python main.py   # First run: executes all enabled stages
pipenv run python main.py   # Second run: skips cached stages

To force a single stage to re-run:

{"force_refresh_reviews_scrape": true}

Usage

AirDNA Per-Listing Lookup

Enrich each discovered listing with financial metrics (ADR, Occupancy, Revenue, Days Available) from AirDNA's rentalizer page. The pipeline automatically feeds listing IDs from the Airbnb search into AirDNA — no manual comp set creation needed.

Setup:

  1. Quit Chrome completely first (Cmd+Q on macOS) — the debug port won't open if Chrome is already running.
  2. Launch Chrome with remote debugging:
    # macOS:
    make chrome-debug
    # Or manually:
    open -a "Google Chrome" --args --remote-debugging-port=9222
    
    # Linux:
    google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug-profile &
    
    # Windows (PowerShell):
    & "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="$env:TEMP\chrome-debug-profile"
  3. In the Chrome window that opens, navigate to AirDNA and log in with your account

Run:

# Set config.json: "airdna_data": true
pipenv run python main.py
# Or:
make scrape-airdna

The scraper visits https://app.airdna.co/data/rentalizer?&listing_id=abnb_{id} for each listing and extracts header metrics (Bedrooms, Bathrooms, Max Guests, Rating, Review Count) and KPI cards (Revenue, Days Available, Annual Revenue, Occupancy, ADR). All listings are saved regardless of Days Available; filtering by min_days_available (default: 100) is applied later when the cleaned amenities matrix is built in the details_results stage.

Output: listing_{id}.json — one file per listing in outputs/03_airdna_data/:

{
    "1050769200886027711": {"ADR": 487.5, "Occupancy": 32, "Revenue": 51700.0, "Days_Available": 333, "Bedrooms": 4, "Bathrooms": 3, "Max_Guests": 15, "Rating": 4.7, "Review_Count": 287, "LY_Revenue": 0.0}
}

Note: LY_Revenue (Last Year Revenue) is a placeholder field — it is always 0.0 in the current implementation and can be ignored.

Inspect mode: If selectors break (AirDNA UI changes), enable "airdna_inspect_mode": true to pause the browser and use Playwright Inspector to discover new selectors.

Basic Workflow

# 1. Configure your search area in config.json:
#    "search_zone_name": "my_area",
#    "start_lat": 45.347161,
#    "start_long": -121.9646838,
#    "search_radius_miles": 15

# 2. Scrape listings and reviews
# Set config.json: "search_results": true, "reviews_scrape": true
pipenv run python main.py

# 3. Generate property summaries
# Set config.json: "listing_summaries": true
pipenv run python main.py

# 4. Generate area summary
# Set config.json: "area_summary": true
pipenv run python main.py

Full Pipeline Example

Enable all stages in config.json:

{
  "search_zone_name": "mt_hood",
  "start_lat": 45.347161,
  "start_long": -121.9646838,
  "search_radius_miles": 15,
  "search_results": true,
  "details_scrape": true,
  "airdna_data": true,
  "reviews_scrape": true,
  "details_results": true,
  "listing_summaries": true,
  "area_summary": true,
  "correlation_results": true,
  "description_analysis": true,
  "machine_learning_model": true
}

Then run:

pipenv run python main.py

Pipeline Flow

Lat/Long + config.json
        ↓
┌───────────────────────────────────────┐
│  1. Search Results                    │
│     pyairbnb.search_all() by lat/long│
│     → outputs/01_search_results/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  2. Details Scrape                    │
│     pyairbnb.get_details() per listing│
│     → outputs/02_details_scrape/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  3. Comp Sets (AirDNA)                │
│     Playwright/CDP → Chrome → AirDNA  │
│     → outputs/03_airdna_data/           │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  4. Reviews Scrape                    │
│     pyairbnb.get_reviews() per listing│
│     → outputs/04_reviews_scrape/      │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  5. Details Results                   │
│     Raw details + AirDNA financials   │
│     → amenity matrix, descriptions    │
│     → outputs/07_details_results/     │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  6. Listing Summaries (GPT)           │
│     Reviews → structured summaries    │
│     → outputs/05_listing_summaries/   │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  7. Area Summary (GPT)                │
│     Summaries → area insights         │
│     → reports/area_summary_*.md       │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  8. Correlation Results (GPT)         │
│     Top/bottom percentile comparison  │
│     → outputs/08_correlation_results/ │
│     → reports/correlation_insights_*  │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  9. Description Analysis              │
│     OLS regression + GPT scoring      │
│     → outputs/09_description_analysis/│
│     → reports/description_quality_*   │
└───────────────────────────────────────┘
        ↓
┌───────────────────────────────────────┐
│  10. ML Model (XGBoost)                │
│      Two-stage residual ADR predictor │
│      → ml/model/residual/             │
└───────────────────────────────────────┘

Output Files

Directory Content
outputs/01_search_results/ Search results by geographic area
outputs/02_details_scrape/ Property details (amenities, rules, descriptions)
outputs/03_airdna_data/ AirDNA per-listing metrics (ADR, Occupancy, Revenue, Days Available) + master comp set
outputs/04_reviews_scrape/ Raw review JSON per listing
outputs/07_details_results/ Structured CSVs and JSON: amenity matrix, house rules, descriptions, neighborhood highlights
outputs/05_listing_summaries/ AI-generated summary per property
outputs/08_correlation_results/ Correlation statistics (JSON) for each metric
outputs/09_description_analysis/ Description quality statistics (JSON)
reports/ Markdown and JSON reports: area summaries, correlation insights, description quality analysis
ml/model/residual/ Trained XGBoost model artifacts and metrics
logs/cost_tracking.json OpenAI API cost logs per session

Architecture

main.py                          # Entry point — config-driven pipeline orchestrator
├── steps/
│   ├── __init__.py              # Shared helper (load_search_results)
│   ├── 01_search_results.py     # Listing discovery by lat/long bounding box
│   ├── 02_details_scrape.py     # Scrape property details
│   ├── 03_airdna_data.py        # AirDNA per-listing lookup + master comp set
│   ├── 04_reviews_scrape.py     # Scrape reviews per listing
│   ├── 07_details_results.py    # Transform details + AirDNA → structured data
│   ├── 05_listing_summaries.py  # Per-property AI summaries
│   ├── 06_area_summary.py       # Area-level AI summary
│   ├── 08_correlation_results.py # Percentile-based metric correlation
│   ├── 09_description_analysis.py # OLS regression + description scoring
│   └── 10_ml_model.py           # Two-stage XGBoost ADR prediction model
├── scraper/
│   ├── airbnb_searcher.py       # Lat/long center → bounding box → listing search
│   ├── airdna_scraper.py        # AirDNA per-listing rentalizer scraper (Playwright/CDP)
│   ├── reviews_scraper.py       # Fetch reviews per listing
│   ├── details_scraper.py       # Fetch property details
│   ├── details_fileset_build.py # Transform to structured data + merge AirDNA financials
│   └── location_calculator.py   # Lat/long + radius → bounding box
├── review_aggregator/
│   ├── property_review_aggregator.py  # Per-property AI summaries
│   ├── area_review_aggregator.py      # Area-level AI summaries
│   ├── openai_aggregator.py           # OpenAI client with chunking, retry, cost tracking
│   ├── correlation_analyzer.py        # Percentile-based metric correlation analysis
│   └── description_analyzer.py        # OLS regression + LLM description quality scoring
├── ml/
│   ├── train_2_level.py         # Two-stage residual XGBoost training pipeline
│   └── ml_utils.py              # Shared ML utilities (train, save, refit)
├── app/
│   └── app.py                   # Flask web app — ADR prediction UI
├── utils/
│   ├── cost_tracker.py          # OpenAI API cost tracking (per-session)
│   ├── pipeline_cache_manager.py # TTL-based caching with cascade invalidation
│   ├── local_file_handler.py    # File system utilities
│   └── tiny_file_handler.py     # JSON I/O + config validation
└── prompts/
    ├── prompt.json              # Property-level prompt template
    ├── zone_prompt.json         # Area-level prompt template
    ├── correlation_prompt.json  # Correlation analysis prompts (ADR + Occupancy)
    └── description_analysis_prompt.json # Description scoring + synthesis prompts

Running Tests

# Run full test suite with coverage (75% minimum required)
make test

# Or directly:
pipenv run pytest

# Fast mode — fail on first error, no coverage:
make test-fast

# Generate HTML coverage report:
make coverage
# → opens coverage_html/index.html

Makefile Targets

Target Description
make setup Install dependencies and Playwright browsers
make test Run pytest with coverage
make test-fast Run pytest, fail-fast, no coverage
make coverage Generate HTML coverage report
make chrome-debug Launch Chrome with remote debugging port for AirDNA scraping
make scrape-airdna Run AirDNA scraper standalone
make train Train the ADR prediction model
make run-app Start the ADR prediction web app

Notebooks

Notebook Purpose
model_notebook.ipynb Modeling experiments on listing features and pricing relationships

Web Application

The project includes a Flask web app that serves interactive ADR (Average Daily Rate) predictions using the trained two-stage XGBoost model.

Prerequisites: Train the model first by running pipeline step 10 ("machine_learning_model": true) or directly via make train. This produces model artifacts in ml/model/residual/.

Run:

make run-app
# Or manually:
pipenv run flask --app app/app run --debug

The app provides a form where you enter property features (bedrooms, bathrooms, capacity, amenity categories) and returns a predicted ADR with error margin. See app/RUNNING.md for production deployment options.

Troubleshooting

Problem Cause Solution
AirDNA scraper can't connect Chrome not running with --remote-debugging-port or regular Chrome already open Quit Chrome fully (Cmd+Q), then make chrome-debug
AirDNA returns empty results Rate limiting or session expired Wait 3 minutes and retry; re-login to AirDNA in the debug Chrome window
Correlation/description analysis produces empty results Missing AirDNA data or too few properties Ensure airdna_data stage ran first; check that the search area has enough entire-home listings (4+ for correlation, 161+ for full OLS regression)
OpenAI API errors Insufficient credits or rate limits Check your OpenAI balance; the pipeline retries 3x with exponential backoff automatically
Search returns fewer listings than expected pyairbnb caps at ~280 listings per geographic bounding box This is a library limitation; the pipeline subdivides the search area into a 2×2 grid automatically, but very dense areas may still hit caps
make chrome-debug hangs Chrome debug port already in use Quit all Chrome processes and retry

Data Privacy

The pipeline scrapes publicly visible Airbnb review text, which may include reviewer names and personally identifiable information. All scraped data is stored locally in the outputs/ directory. Users are responsible for handling this data in compliance with applicable privacy regulations and should avoid sharing raw review data publicly.

Security Notes

Chrome DevTools Protocol (CDP) port exposure: The AirDNA scraper requires Chrome to be launched with --remote-debugging-port=9222. This opens a CDP endpoint on localhost:9222 that allows any local process to attach to the browser and control it — including accessing the authenticated AirDNA session, reading cookies, and navigating to arbitrary URLs. On a single-user development machine this is low risk, but on shared servers, multi-user environments, or CI runners it can be exploited by other users or processes on the same host. Mitigations:

  • Only launch Chrome with --remote-debugging-port when actively running the AirDNA scraper, and close it immediately afterward.
  • Do not expose port 9222 to the network (the default localhost binding is not reachable externally, but verify your firewall configuration).
  • Avoid running the AirDNA scraper on shared or multi-tenant machines.

Disclaimer

This tool scrapes data from Airbnb and AirDNA. Use of web scraping may be subject to the terms of service of these platforms. This project is intended for personal market research and analysis. Users are responsible for ensuring their use complies with applicable terms of service and laws.

Contributing

Issues and pull requests are welcome. Please ensure all tests pass (make test) before submitting.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors