Skip to content

felipepov/airbnb-search-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Airbnb Lucene Analyzer: Search App & Classifier

Airbnb Lucene Analyzer is a dual-purpose Information Retrieval project built with Java 21 and Apache Lucene 10. It functions as both a high-performance Search Application and a Machine Learning Classifier. It demonstrates how to build a production-grade ecosystem featuring robust CSV indexing, relational data modeling, an interactive search UI, and text classification models.

The project processes the Los Angeles Airbnb Data (June 2025) dataset obtained from Kaggle. This dataset contains public information available in the Inside Airbnb project for the city of Los Angeles, California. The data was collected on June 17, 2025, and provides a detailed view of Airbnb listings at that time. The dataset comes in a listings.csv file with a total of 79 columns and about 45,000 resources.


πŸš€ Key Features

  • Robust Indexing Engine: Parses complex CSV data, handling multi-line fields, missing values, and data anomalies.
  • Relational Index Design: Implements a "Join" strategy by maintaining separate indices for Properties and Hosts, linked by ID.
  • Interactive Search App: A JavaFX graphical interface providing Real-time search, Faceted filtering (Neighborhoods, Types), and Geospatial queries.
  • Machine Learning Classifier: Evaluates algorithms (Naive Bayes, KNN, Fuzzy KNN) to categorize listings based on unstructured text (e.g., predicting neighborhood from description).
  • Advanced Text Analysis: Utilizes EnglishAnalyzer for stemming/stopwords on descriptions and StandardAnalyzer for exact metadata.

πŸ›  Prerequisites

  • Java: JDK 21 or higher.
  • Maven: 3.6+ installed.
  • Data: listings.csv in the project root.

πŸ“¦ Build & Run

1. Compilation

We use Maven to manage dependencies and package the application.

# Clean and compile
mvn clean compile

# Create standalone JAR (includes all dependencies)
mvn clean package

Artifact created at: target/airbnb-indexer.jar

2. Indexing the Data

Before searching, you must build the indices. This process reads listings.csv and generates the Lucene indices.

java -jar target/airbnb-indexer.jar --input ./listings.csv --index-root ./index_root --mode build
  • --mode build: Creates new indices. Use rebuild to wipe and recreate, or update to add new records.
  • Output: Generates index_root/index_properties and index_root/index_hosts.

3. Launching the Search UI

The JavaFX application allows you to explore the indexed data visually.

mvn javafx:run -Djavafx.args="--index-root ./index_root"

Search UI Screenshot

4. Running Classifiers

Analyze the performance of Lucene's classification module.

java -cp target/airbnb-indexer.jar AirbnbClasificador --index-root ./index_root

πŸ“‚ Index Structure & Schema

The project splits data into two optimized indices. Here are the key fields and their configurations:

🏠 Index: Properties (index_properties)

Field Lucene Type Analyzer Note
id IntPoint - Primary Key
name TextField Standard Full-text searchable
description TextField English Stemmed, stopwords removed
neighbourhood_cleansed StringField + Facet Keyword Normalized lowercase
property_type StringField + Facet Keyword Normalized lowercase
amenity TextField (Multivalued) Standard Individual amenities indexed
price DoublePoint - Numeric ranges
location LatLonPoint - Geo-spatial queries
host_id StringField Keyword Foreign Key to Host Index

πŸ‘€ Index: Hosts (index_hosts)

Field Lucene Type Analyzer Note
host_id StringField Keyword Primary Key
host_name TextField Standard
host_about TextField English
host_response_time StringField + Facet Keyword
host_is_superhost IntPoint (0/1) - Boolean flag

πŸ” Search Guide & Syntax

You can use Luke (Lucene Index Toolbox) or the provided Search App to query the data.

Common Query Syntax (Lucene Standard)

  • Keyword Search: beach (Finds "beach" in default field)
  • Specific Field: name:ocean OR description:pool
  • Phrase Match: description:"walking distance"
  • Wildcards: name:beac* (Matches beach, beacon, etc.)
  • Boolean Logic: pool AND wifi NOT party
  • Numeric Ranges: bedrooms:[2 TO 4] (Inclusive)

πŸ— Architecture & Design Decisions

1. Analyzers & Normalization

We deliberately chose EnglishAnalyzer for description and host_about to handle English morphology (e.g., matching "running" with "run"). For metadata like amenities or name, StandardAnalyzer is used to preserve more specific tokens. String identifiers (Neighborhoods) are normalized to lowercase to ensure case-insensitive exact matching.

2. Relational Indexing

Instead of denormalizing host data into every property listing (which would duplicate "Host Name" thousands of times), we maintained a normalized schema.

  • Pros: Smaller index size, faster updates for host details.
  • Cons: Requires application-level "joins" (two queries) to display host info alongside property results.

3. Robust CSV Parsing

The indexer implements a custom multi-line CSV handling logic because standard parsers often fail with user-generated content containing newlines within quotes. We track max-errors to allow the process to finish even if a few rows are corrupt, ensuring resilience.


πŸ§ͺ Classification Methodology & Experimental Setup

The classification experiments follow a controlled and reproducible evaluation protocol:

  • Dataset split: 70% training / 30% test using DatasetSplitter
  • Sampling: Stratified sampling with fixed seed to preserve class proportions
  • Metrics: Accuracy, Precision, Recall, and F1-score computed via ConfusionMatrixGenerator
  • Field Comparison:
    • description: Pure unstructured text
    • contents: MegaField aggregating description, name, host info, neighborhood, etc.

Across tasks, the MegaField consistently improves performance except when the task requires extracting very specific numeric information from text.

πŸ“Š Classification Tasks & Key Findings

1️⃣ Neighbourhood Group Classification (neighbourhood_group_cleansed)

  • Goal: Predict one of three macro-areas: City of Los Angeles, Other Cities, Unincorporated Areas
  • Best Classifier: βœ” SimpleNaiveBayesClassifier
  • Why it works: Geographic classification relies on strong, independent lexical cues (street names, landmarks). Naive Bayes excels when individual terms are highly predictive.
  • Best Input: MegaField (contents) - Aggregating description + host + location context enriches the geographic vocabulary.
  • Result: Accuracy β‰ˆ 0.80, outperforming KNN.

2️⃣ Property Type Classification (property_type)

  • Goal: Reduce 94 property types into macro-classes: Rental unit, Home, Guesthouse
  • Best Classifier: βœ” KNearestFuzzyClassifier (k=5)
  • Why Fuzzy KNN: User descriptions are ambiguous (a "guesthouse" may sound like a "home"). Fuzzy KNN assigns graded membership instead of rigid voting, handling this overlap better.
  • Best Input: MegaField (contents) - Contextual clues (house rules, host details) help disambiguate borderline cases.
  • Result: Accuracy above 0.93.

3️⃣ Bedroom Count Classification (bedrooms)

  • Goal: Predict labels: 0, 1, 2, 3, 4, 5+
  • Best Classifier: βœ” KNearestFuzzyClassifier (k=5)
  • Key Insight: Description beats MegaField.
    • Bedroom count relies on explicit phrases ("studio", "two bedroom").
    • MegaFields introduce noise (prices, codes, distances). Restricting input improves signal-to-noise ratio.
  • Why Fuzzy KNN: Handles class imbalance well (many 1-2 bedrooms, few 5+).
  • Result: Accuracy β‰ˆ 0.81 with improved recall for extreme classes.

πŸ“Œ Practical Takeaways

  • No universal winner: Effectiveness depends on task semantics and class balance.
  • Naive Bayes is ideal for broad semantic categories with strong lexical signals.
  • Fuzzy KNN excels when classes overlap semantically or data is imbalanced.
  • Lucene Classification: Proven to be competitive, interpretable, and easy to integrate into IR pipelines.

πŸ“š Additional Resources & Text Analysis

It is recommended to consult the Javadoc for the specific version of Lucene being used (currently 10.3.0 / 10.3.1).

  • Main Documentation: Lucene 10.3.0 Documentation
  • Text Analysis: The Analysis Package is particularly important, as it handles converting input text into tokens. Lucene performs a sequence of operations on the text, such as splitting by whitespace, removing stop words, and stemming.
  • Demo: Lucene Demo

πŸ‘₯ Authors

Project developed for the Information Retrieval (RI) course, 2025 with Emilio Guillen Alvarez

About

Full-stack Information Retrieval system featuring a JavaFX Search UI, relational Lucene indexing, and text classification models (Naive Bayes, KNN).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages