An end-to-end pipeline for short-term rental market analysis. Given a geographic center point (latitude/longitude) and search radius, it scrapes hundreds of Airbnb listings and their reviews, generates AI-powered summaries using GPT-4.1-mini, enriches each listing with AirDNA financial metrics via per-listing rentalizer lookups, and produces market-intelligence reports that identify what drives higher nightly rates and occupancy. The final output includes correlation analyses (e.g., "Jacuzzi prevalence is +34.3pp higher among listings that outperform their size-predicted ADR"), description quality scoring via OLS regression, an ADR prediction model, and actionable recommendations for hosts — all generated automatically from a single config.json.
Before setting up the project, ensure you have the following. These are required for a full pipeline run:
An active OpenAI account with API access and a funded balance. The pipeline uses gpt-4.1-mini by default and makes hundreds of API calls across four separate LLM use cases:
| Use Case | Description |
|---|---|
| Property summaries | Each listing's reviews → structured summary with mention counts |
| Area summary | All property summaries → area-level insights |
| Correlation insights | Statistical data + descriptions → market analysis for ADR/Occupancy |
| Description scoring | Each listing description scored on 7 quality dimensions, then synthesized |
For a typical run of ~300 listings, expect roughly $2–5 in API costs (GPT-4.1-mini at $0.40/1M input, $1.60/1M output tokens). The pipeline includes a built-in cost tracker (utils/cost_tracker.py) that logs every request to logs/cost_tracking.json so you can monitor spend in real time.
AirDNA is a third-party short-term rental analytics platform. A paid subscription is required to access rentalizer data, which provides financial metrics not available from Airbnb directly:
- ADR (Average Daily Rate) — average price per booked night
- Occupancy — percentage of available days that are booked
- Revenue — annual rental revenue
- Days Available — how many days the property is listed per year
The pipeline automatically looks up each listing discovered in the Airbnb search on AirDNA's rentalizer page — no manual comp set creation required. Without AirDNA data, the correlation analysis and description quality analysis stages will lack the financial metrics needed to function.
- AirDNA Per-Listing Lookup — Enrich each listing with ADR, Occupancy, Revenue, and Days Available via Playwright/CDP rentalizer pages
- Property Search — Find Airbnb listings within a geographic area using lat/long center + radius via
pyairbnb - Review Scraping — Pull all reviews for discovered listings with per-file caching
- Property Details Scraping — Scrape amenities, descriptions, house rules, and neighborhood info
- Details Fileset Build — Transform raw details + AirDNA financials into structured CSVs (amenity matrix, descriptions, house rules)
- AI Property Summaries — Generate structured summaries per property using GPT-4.1-mini: pros/cons with mention percentages, amenity analysis, and rating context vs. area average
- AI Area Summary — Roll up all property summaries into area-level trends and insights
- Data Extraction — LLM-powered parsing of summaries into structured numeric data with sentiment categories
- Correlation Analysis — Statistical comparison of top/bottom percentile tiers by ADR or Occupancy, with LLM-generated market insights
- Description Quality Analysis — OLS regression of ADR against 160+ features, LLM scoring of descriptions on 7 quality dimensions, and correlation of language quality with pricing premiums
- ADR Prediction Model — Two-stage residual XGBoost model predicting nightly rates from property features and amenities
- Prediction Web App — Flask UI for interactive ADR predictions using the trained model
- TTL-Based Caching — Pipeline cache with cascade invalidation prevents redundant scraping and API calls
- Cost Tracking — Monitor OpenAI API usage and costs per session
Requires Python 3.13.
# Clone the repository
git clone https://github.com/threnjen/AirBNB-Review-Scraper.git
cd AirBNB-Review-Scraper
# Install dependencies
pipenv install --dev
# Install Playwright browsers (required for AirDNA scraping)
pipenv run playwright install chromium
# Set OpenAI API key (or copy .env.example to .env and fill it in)
cp .env.example .env
# Then edit .env with your key, or export directly:
export OPENAI_API_KEY="your-api-key-here"The pipeline combines two data sources and four LLM use cases to produce market intelligence:
Data Sources:
- Airbnb (via
pyairbnb) — listing search by geographic bounding box, review text, property details, amenities, descriptions, and house rules - AirDNA (via Playwright/CDP) — financial metrics (ADR, Occupancy, Revenue, Days Available) scraped per-listing from AirDNA's rentalizer page via a logged-in Chrome session connected over Chrome DevTools Protocol
AI Processing:
- All OpenAI calls go through
review_aggregator/openai_aggregator.py, which handles token estimation viatiktoken, automatic chunking at 120K tokens with a merge step, and 3 retries with exponential backoff - Property-level prompts produce structured summaries with mention counts in "X of Y Reviews (Z%)" format
- Area-level prompts aggregate all property summaries into a single area analysis
- Correlation prompts receive statistical comparison data and produce actionable market insights
- Description prompts use a two-phase approach: first scoring each description on 7 dimensions (1–10), then synthesizing findings with top/bottom examples
Caching:
- TTL-based cache (
utils/pipeline_cache_manager.py) checks file existence andos.path.getmtime()against the configured TTL — no metadata file needed - Refreshing an early stage automatically cascades invalidation to all downstream stages
- Per-file caching for reviews and details allows incremental scraping of new listings
This pipeline automates data collection, but it is designed to browse no faster than a human would. Every scraping stage inserts randomized delays between requests, so the pace of data retrieval is consistent with a person manually clicking through listings. The AirDNA scraper goes further — it connects to a real Chrome browser session via Chrome DevTools Protocol, meaning it appears on the network as a normal logged-in user navigating page by page.
Key principles:
- Human-speed pacing — Randomized pauses between every request ensure automated browsing is no faster than manual browsing. There is no parallel request fan-out; listings are visited one at a time.
- Caching prevents redundant requests — TTL-based caching means previously scraped listings are never re-fetched unless their cache expires. A second run against the same search area hits zero external endpoints if all data is still fresh.
- Backoff on rate-limit signals — If the pipeline detects signs of rate limiting (e.g., AirDNA returning empty results), it pauses for an extended cooldown period before retrying, rather than retrying immediately.
- No API abuse — OpenAI calls use exponential backoff with retry limits. Token usage and costs are tracked per-session so users can monitor spend.
This project is intended for personal market research. The scraping approach is deliberately conservative — the automation saves time by running unattended, not by going faster.
The reports/ directory contains example analytical output from a full pipeline run on the Mount Hood, Oregon area (843 properties analyzed):
Area Summary — reports/area_summary_mt_hood.md
Aggregated area-level insights from all property summaries. Identifies a diverse selection of accommodations — cabins, lodges, mountain homes, condos, chalets, yurts, tiny homes, treehouses, and glamping tents — set in forested, riverfront, and mountain settings. Top positives: hot tubs, location, cleanliness, host communication, well-stocked kitchens. Top issues: maintenance inconsistencies, hot tub problems, noise/privacy concerns, heating limitations, Wi-Fi reliability.
ADR Correlation Analysis — reports/correlation_insights_adr_mt_hood.md
Identifies what drives higher nightly rates after adjusting for property size. Uses XGBoost regression on capacity, bedrooms, beds, and bathrooms (R² = 0.747) to compute size-adjusted residuals, then compares the top 25% (residual +$42.98, n=207) against the bottom 25% (residual −$53.10, n=207). Key finding: luxury amenities and experiential features — not additional space per guest — drive premium pricing.
Feature High Tier Low Tier Difference (pp) Jacuzzi 67.6% 33.3% +34.3% Ocean View 38.6% 15.0% +23.7% Firepit 59.4% 44.9% +14.5% Grill 82.6% 69.1% +13.5% Dishwasher 86.5% 74.4% +12.1%
The pipeline also generates Occupancy Correlation Analysis and Description Quality Analysis reports (see stages 8 and 9 in Pipeline Flow), which are not included as checked-in examples.
Edit config.json to configure the pipeline. All pipeline behavior is controlled here — there is no CLI argument interface.
| Key | Type | Description |
|---|---|---|
search_results |
bool | Search for Airbnb listings by lat/long bounding box |
details_scrape |
bool | Scrape property details (amenities, rules) |
airdna_data |
bool | Scrape AirDNA per-listing metrics |
reviews_scrape |
bool | Scrape reviews for discovered listings |
details_results |
bool | Transform scraped details + AirDNA financials into structured datasets |
listing_summaries |
bool | Generate AI summaries for each property |
area_summary |
bool | Generate area-level summary + extract structured data from summaries |
correlation_results |
bool | Run correlation analysis of amenities/capacity vs. ADR and Occupancy |
description_analysis |
bool | Run description quality scoring and regression analysis |
machine_learning_model |
bool | Train two-stage XGBoost ADR prediction model |
Stage dependencies: Stages run in order and depend on upstream outputs. Stages 1–5 produce the raw data; stages 6–9 consume it. Stage 10 (ML model) depends on stage 5. For example, listing_summaries (6) requires search_results (1) and reviews_scrape (4); correlation_results (8) and description_analysis (9) require details_results (5) and airdna_data (3). If you enable a downstream stage without having run its upstream stages first, the pipeline will fail or produce empty results.
AirDNA data required for stages 8–9: The correlation_results and description_analysis stages silently filter out properties without AirDNA financial data. If you skip the airdna_data stage, these analysis stages will have no properties to analyze. Run airdna_data first to populate AirDNA metrics.
Entire-home filter: The details_results stage only includes listings with room type "Entire home/apt". Shared rooms, private rooms, and hotel rooms are silently excluded. All downstream analysis operates on entire-home listings only.
Minimum property requirements: The correlation analyzer requires at least 4 properties with valid metric values per tier. The description analyzer's OLS regression requires more properties than features (~161+ for the full amenity set). Small search areas with few listings may produce empty or degraded results.
| Key | Type | Description |
|---|---|---|
search_zone_name |
string | Required. A label for this search area (e.g., "mt_hood"). Used in output filenames. Alphanumeric, underscores, and hyphens only. |
start_lat |
float | Required. Center latitude of the search area (e.g., 45.347161) |
start_long |
float | Required. Center longitude of the search area (e.g., -121.9646838) |
search_radius_miles |
float | Required. Search radius in miles from the center point (e.g., 15) |
poi_lat |
float | Point-of-interest latitude, used for distance calculations in the ML model |
poi_long |
float | Point-of-interest longitude |
iso_code |
string | Country code (e.g., "us") |
num_listings_to_search |
int | Max listings to find in search |
num_listings_to_summarize |
int | Max listings to process with AI |
review_thresh_to_include_prop |
int | Minimum reviews required to process a listing |
num_summary_to_process |
int | Max property summaries to process in downstream stages |
dataset_use_categoricals |
bool | Use categorical encoding for amenity features in analysis |
Tip: To find coordinates for your area, right-click any location in Google Maps and select the lat/long values.
| Key | Type | Description |
|---|---|---|
min_days_available |
int | Minimum Days Available to include a listing in the cleaned amenities matrix used by description analysis (default: 100). The raw matrix used by correlation analysis is unfiltered. |
airdna_cdp_url |
string | Chrome DevTools Protocol URL (default: http://localhost:9222) |
airdna_inspect_mode |
bool | Pause after navigation for DOM selector discovery |
| Key | Type | Description |
|---|---|---|
correlation_metrics |
array | Metrics to analyze (e.g., ["adr", "occupancy"]) |
correlation_top_percentile |
int | Top percentile tier threshold (e.g., 25 = top 25%) |
correlation_bottom_percentile |
int | Bottom percentile tier threshold (e.g., 25 = bottom 25%) |
| Key | Type | Description |
|---|---|---|
openai.model |
string | Model to use (default: "gpt-4.1-mini") |
openai.temperature |
float | Response randomness (0.0–1.0) |
openai.max_tokens |
int | Max tokens per response |
openai.chunk_token_limit |
int | Token limit per chunk sent to the API |
openai.enable_cost_tracking |
bool | Log API costs to logs/cost_tracking.json |
The pipeline includes a TTL-based cache that prevents redundant scraping and processing. When enabled, each stage's outputs are tracked with timestamps. If all outputs for a stage are still within the TTL window, the stage is skipped entirely on the next run. For per-file stages (reviews, details), individual listings are skipped when their cached files are fresh.
| Key | Type | Default | Description |
|---|---|---|---|
pipeline_cache_enabled |
bool | true |
Enable/disable pipeline-level TTL caching |
pipeline_cache_ttl_days |
int | 90 |
Number of days before cached outputs expire |
force_refresh_search_results |
bool | false |
Force re-run area search |
force_refresh_details_scrape |
bool | false |
Force re-scrape all property details |
force_refresh_details_results |
bool | false |
Force rebuild details fileset |
force_refresh_reviews_scrape |
bool | false |
Force re-scrape all reviews |
force_refresh_airdna_data |
bool | false |
Force re-run AirDNA scraping even if cached |
force_refresh_listing_summaries |
bool | false |
Force regenerate property summaries |
force_refresh_area_summary |
bool | false |
Force regenerate area summary + data extraction |
force_refresh_correlation_results |
bool | false |
Force re-run correlation analysis |
force_refresh_description_analysis |
bool | false |
Force re-run description quality analysis |
force_refresh_machine_learning_model |
bool | false |
Force retrain ADR prediction model |
How it works:
- Freshness is determined by file existence and
os.path.getmtime()— each stage declares its expected output files, and a stage is fresh when all files exist with mtime within the TTL - On each run, the pipeline checks whether outputs exist and are within the TTL before executing a stage
- Refreshing an early stage cascades invalidation to all downstream stages
- The
force_refresh_*flags let you bypass the cache for specific stages
Example: Run the full pipeline, then re-run immediately — all stages will be skipped:
pipenv run python main.py # First run: executes all enabled stages
pipenv run python main.py # Second run: skips cached stagesTo force a single stage to re-run:
{"force_refresh_reviews_scrape": true}Enrich each discovered listing with financial metrics (ADR, Occupancy, Revenue, Days Available) from AirDNA's rentalizer page. The pipeline automatically feeds listing IDs from the Airbnb search into AirDNA — no manual comp set creation needed.
Setup:
- Quit Chrome completely first (Cmd+Q on macOS) — the debug port won't open if Chrome is already running.
- Launch Chrome with remote debugging:
# macOS: make chrome-debug # Or manually: open -a "Google Chrome" --args --remote-debugging-port=9222 # Linux: google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug-profile & # Windows (PowerShell): & "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="$env:TEMP\chrome-debug-profile"
- In the Chrome window that opens, navigate to AirDNA and log in with your account
Run:
# Set config.json: "airdna_data": true
pipenv run python main.py
# Or:
make scrape-airdnaThe scraper visits https://app.airdna.co/data/rentalizer?&listing_id=abnb_{id} for each listing and extracts header metrics (Bedrooms, Bathrooms, Max Guests, Rating, Review Count) and KPI cards (Revenue, Days Available, Annual Revenue, Occupancy, ADR). All listings are saved regardless of Days Available; filtering by min_days_available (default: 100) is applied later when the cleaned amenities matrix is built in the details_results stage.
Output: listing_{id}.json — one file per listing in outputs/03_airdna_data/:
{
"1050769200886027711": {"ADR": 487.5, "Occupancy": 32, "Revenue": 51700.0, "Days_Available": 333, "Bedrooms": 4, "Bathrooms": 3, "Max_Guests": 15, "Rating": 4.7, "Review_Count": 287, "LY_Revenue": 0.0}
}Note:
LY_Revenue(Last Year Revenue) is a placeholder field — it is always0.0in the current implementation and can be ignored.
Inspect mode: If selectors break (AirDNA UI changes), enable "airdna_inspect_mode": true to pause the browser and use Playwright Inspector to discover new selectors.
# 1. Configure your search area in config.json:
# "search_zone_name": "my_area",
# "start_lat": 45.347161,
# "start_long": -121.9646838,
# "search_radius_miles": 15
# 2. Scrape listings and reviews
# Set config.json: "search_results": true, "reviews_scrape": true
pipenv run python main.py
# 3. Generate property summaries
# Set config.json: "listing_summaries": true
pipenv run python main.py
# 4. Generate area summary
# Set config.json: "area_summary": true
pipenv run python main.pyEnable all stages in config.json:
{
"search_zone_name": "mt_hood",
"start_lat": 45.347161,
"start_long": -121.9646838,
"search_radius_miles": 15,
"search_results": true,
"details_scrape": true,
"airdna_data": true,
"reviews_scrape": true,
"details_results": true,
"listing_summaries": true,
"area_summary": true,
"correlation_results": true,
"description_analysis": true,
"machine_learning_model": true
}Then run:
pipenv run python main.pyLat/Long + config.json
↓
┌───────────────────────────────────────┐
│ 1. Search Results │
│ pyairbnb.search_all() by lat/long│
│ → outputs/01_search_results/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 2. Details Scrape │
│ pyairbnb.get_details() per listing│
│ → outputs/02_details_scrape/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 3. Comp Sets (AirDNA) │
│ Playwright/CDP → Chrome → AirDNA │
│ → outputs/03_airdna_data/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 4. Reviews Scrape │
│ pyairbnb.get_reviews() per listing│
│ → outputs/04_reviews_scrape/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 5. Details Results │
│ Raw details + AirDNA financials │
│ → amenity matrix, descriptions │
│ → outputs/07_details_results/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 6. Listing Summaries (GPT) │
│ Reviews → structured summaries │
│ → outputs/05_listing_summaries/ │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 7. Area Summary (GPT) │
│ Summaries → area insights │
│ → reports/area_summary_*.md │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 8. Correlation Results (GPT) │
│ Top/bottom percentile comparison │
│ → outputs/08_correlation_results/ │
│ → reports/correlation_insights_* │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 9. Description Analysis │
│ OLS regression + GPT scoring │
│ → outputs/09_description_analysis/│
│ → reports/description_quality_* │
└───────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ 10. ML Model (XGBoost) │
│ Two-stage residual ADR predictor │
│ → ml/model/residual/ │
└───────────────────────────────────────┘
| Directory | Content |
|---|---|
outputs/01_search_results/ |
Search results by geographic area |
outputs/02_details_scrape/ |
Property details (amenities, rules, descriptions) |
outputs/03_airdna_data/ |
AirDNA per-listing metrics (ADR, Occupancy, Revenue, Days Available) + master comp set |
outputs/04_reviews_scrape/ |
Raw review JSON per listing |
outputs/07_details_results/ |
Structured CSVs and JSON: amenity matrix, house rules, descriptions, neighborhood highlights |
outputs/05_listing_summaries/ |
AI-generated summary per property |
outputs/08_correlation_results/ |
Correlation statistics (JSON) for each metric |
outputs/09_description_analysis/ |
Description quality statistics (JSON) |
reports/ |
Markdown and JSON reports: area summaries, correlation insights, description quality analysis |
ml/model/residual/ |
Trained XGBoost model artifacts and metrics |
logs/cost_tracking.json |
OpenAI API cost logs per session |
main.py # Entry point — config-driven pipeline orchestrator
├── steps/
│ ├── __init__.py # Shared helper (load_search_results)
│ ├── 01_search_results.py # Listing discovery by lat/long bounding box
│ ├── 02_details_scrape.py # Scrape property details
│ ├── 03_airdna_data.py # AirDNA per-listing lookup + master comp set
│ ├── 04_reviews_scrape.py # Scrape reviews per listing
│ ├── 07_details_results.py # Transform details + AirDNA → structured data
│ ├── 05_listing_summaries.py # Per-property AI summaries
│ ├── 06_area_summary.py # Area-level AI summary
│ ├── 08_correlation_results.py # Percentile-based metric correlation
│ ├── 09_description_analysis.py # OLS regression + description scoring
│ └── 10_ml_model.py # Two-stage XGBoost ADR prediction model
├── scraper/
│ ├── airbnb_searcher.py # Lat/long center → bounding box → listing search
│ ├── airdna_scraper.py # AirDNA per-listing rentalizer scraper (Playwright/CDP)
│ ├── reviews_scraper.py # Fetch reviews per listing
│ ├── details_scraper.py # Fetch property details
│ ├── details_fileset_build.py # Transform to structured data + merge AirDNA financials
│ └── location_calculator.py # Lat/long + radius → bounding box
├── review_aggregator/
│ ├── property_review_aggregator.py # Per-property AI summaries
│ ├── area_review_aggregator.py # Area-level AI summaries
│ ├── openai_aggregator.py # OpenAI client with chunking, retry, cost tracking
│ ├── correlation_analyzer.py # Percentile-based metric correlation analysis
│ └── description_analyzer.py # OLS regression + LLM description quality scoring
├── ml/
│ ├── train_2_level.py # Two-stage residual XGBoost training pipeline
│ └── ml_utils.py # Shared ML utilities (train, save, refit)
├── app/
│ └── app.py # Flask web app — ADR prediction UI
├── utils/
│ ├── cost_tracker.py # OpenAI API cost tracking (per-session)
│ ├── pipeline_cache_manager.py # TTL-based caching with cascade invalidation
│ ├── local_file_handler.py # File system utilities
│ └── tiny_file_handler.py # JSON I/O + config validation
└── prompts/
├── prompt.json # Property-level prompt template
├── zone_prompt.json # Area-level prompt template
├── correlation_prompt.json # Correlation analysis prompts (ADR + Occupancy)
└── description_analysis_prompt.json # Description scoring + synthesis prompts
# Run full test suite with coverage (75% minimum required)
make test
# Or directly:
pipenv run pytest
# Fast mode — fail on first error, no coverage:
make test-fast
# Generate HTML coverage report:
make coverage
# → opens coverage_html/index.html| Target | Description |
|---|---|
make setup |
Install dependencies and Playwright browsers |
make test |
Run pytest with coverage |
make test-fast |
Run pytest, fail-fast, no coverage |
make coverage |
Generate HTML coverage report |
make chrome-debug |
Launch Chrome with remote debugging port for AirDNA scraping |
make scrape-airdna |
Run AirDNA scraper standalone |
make train |
Train the ADR prediction model |
make run-app |
Start the ADR prediction web app |
| Notebook | Purpose |
|---|---|
model_notebook.ipynb |
Modeling experiments on listing features and pricing relationships |
The project includes a Flask web app that serves interactive ADR (Average Daily Rate) predictions using the trained two-stage XGBoost model.
Prerequisites: Train the model first by running pipeline step 10 ("machine_learning_model": true) or directly via make train. This produces model artifacts in ml/model/residual/.
Run:
make run-app
# Or manually:
pipenv run flask --app app/app run --debugThe app provides a form where you enter property features (bedrooms, bathrooms, capacity, amenity categories) and returns a predicted ADR with error margin. See app/RUNNING.md for production deployment options.
| Problem | Cause | Solution |
|---|---|---|
| AirDNA scraper can't connect | Chrome not running with --remote-debugging-port or regular Chrome already open |
Quit Chrome fully (Cmd+Q), then make chrome-debug |
| AirDNA returns empty results | Rate limiting or session expired | Wait 3 minutes and retry; re-login to AirDNA in the debug Chrome window |
| Correlation/description analysis produces empty results | Missing AirDNA data or too few properties | Ensure airdna_data stage ran first; check that the search area has enough entire-home listings (4+ for correlation, 161+ for full OLS regression) |
| OpenAI API errors | Insufficient credits or rate limits | Check your OpenAI balance; the pipeline retries 3x with exponential backoff automatically |
| Search returns fewer listings than expected | pyairbnb caps at ~280 listings per geographic bounding box |
This is a library limitation; the pipeline subdivides the search area into a 2×2 grid automatically, but very dense areas may still hit caps |
make chrome-debug hangs |
Chrome debug port already in use | Quit all Chrome processes and retry |
The pipeline scrapes publicly visible Airbnb review text, which may include reviewer names and personally identifiable information. All scraped data is stored locally in the outputs/ directory. Users are responsible for handling this data in compliance with applicable privacy regulations and should avoid sharing raw review data publicly.
Chrome DevTools Protocol (CDP) port exposure: The AirDNA scraper requires Chrome to be launched with --remote-debugging-port=9222. This opens a CDP endpoint on localhost:9222 that allows any local process to attach to the browser and control it — including accessing the authenticated AirDNA session, reading cookies, and navigating to arbitrary URLs. On a single-user development machine this is low risk, but on shared servers, multi-user environments, or CI runners it can be exploited by other users or processes on the same host. Mitigations:
- Only launch Chrome with
--remote-debugging-portwhen actively running the AirDNA scraper, and close it immediately afterward. - Do not expose port 9222 to the network (the default
localhostbinding is not reachable externally, but verify your firewall configuration). - Avoid running the AirDNA scraper on shared or multi-tenant machines.
This tool scrapes data from Airbnb and AirDNA. Use of web scraping may be subject to the terms of service of these platforms. This project is intended for personal market research and analysis. Users are responsible for ensuring their use complies with applicable terms of service and laws.
Issues and pull requests are welcome. Please ensure all tests pass (make test) before submitting.
MIT