Airbnb Lucene Analyzer: Search App & Classifier

Airbnb Lucene Analyzer is a dual-purpose Information Retrieval project built with Java 21 and Apache Lucene 10. It functions as both a high-performance Search Application and a Machine Learning Classifier. It demonstrates how to build a production-grade ecosystem featuring robust CSV indexing, relational data modeling, an interactive search UI, and text classification models.

The project processes the Los Angeles Airbnb Data (June 2025) dataset obtained from Kaggle. This dataset contains public information available in the Inside Airbnb project for the city of Los Angeles, California. The data was collected on June 17, 2025, and provides a detailed view of Airbnb listings at that time. The dataset comes in a listings.csv file with a total of 79 columns and about 45,000 resources.

🚀 Key Features

Robust Indexing Engine: Parses complex CSV data, handling multi-line fields, missing values, and data anomalies.
Relational Index Design: Implements a "Join" strategy by maintaining separate indices for Properties and Hosts, linked by ID.
Interactive Search App: A JavaFX graphical interface providing Real-time search, Faceted filtering (Neighborhoods, Types), and Geospatial queries.
Machine Learning Classifier: Evaluates algorithms (Naive Bayes, KNN, Fuzzy KNN) to categorize listings based on unstructured text (e.g., predicting neighborhood from description).
Advanced Text Analysis: Utilizes EnglishAnalyzer for stemming/stopwords on descriptions and StandardAnalyzer for exact metadata.

🛠 Prerequisites

Java: JDK 21 or higher.
Maven: 3.6+ installed.
Data: listings.csv in the project root.

📦 Build & Run

1. Compilation

We use Maven to manage dependencies and package the application.

# Clean and compile
mvn clean compile

# Create standalone JAR (includes all dependencies)
mvn clean package

Artifact created at: target/airbnb-indexer.jar

2. Indexing the Data

Before searching, you must build the indices. This process reads listings.csv and generates the Lucene indices.

java -jar target/airbnb-indexer.jar --input ./listings.csv --index-root ./index_root --mode build

--mode build: Creates new indices. Use rebuild to wipe and recreate, or update to add new records.
Output: Generates index_root/index_properties and index_root/index_hosts.

3. Launching the Search UI

The JavaFX application allows you to explore the indexed data visually.

mvn javafx:run -Djavafx.args="--index-root ./index_root"

4. Running Classifiers

Analyze the performance of Lucene's classification module.

java -cp target/airbnb-indexer.jar AirbnbClasificador --index-root ./index_root

📂 Index Structure & Schema

The project splits data into two optimized indices. Here are the key fields and their configurations:

🏠 Index: Properties (`index_properties`)

Field	Lucene Type	Analyzer	Note
`id`	IntPoint	-	Primary Key
`name`	TextField	Standard	Full-text searchable
`description`	TextField	English	Stemmed, stopwords removed
`neighbourhood_cleansed`	StringField + Facet	Keyword	Normalized lowercase
`property_type`	StringField + Facet	Keyword	Normalized lowercase
`amenity`	TextField (Multivalued)	Standard	Individual amenities indexed
`price`	DoublePoint	-	Numeric ranges
`location`	LatLonPoint	-	Geo-spatial queries
`host_id`	StringField	Keyword	Foreign Key to Host Index

👤 Index: Hosts (`index_hosts`)

Field	Lucene Type	Analyzer	Note
`host_id`	StringField	Keyword	Primary Key
`host_name`	TextField	Standard
`host_about`	TextField	English
`host_response_time`	StringField + Facet	Keyword
`host_is_superhost`	IntPoint (0/1)	-	Boolean flag

🔍 Search Guide & Syntax

You can use Luke (Lucene Index Toolbox) or the provided Search App to query the data.

Common Query Syntax (Lucene Standard)

Keyword Search: beach (Finds "beach" in default field)
Specific Field: name:ocean OR description:pool
Phrase Match: description:"walking distance"
Wildcards: name:beac* (Matches beach, beacon, etc.)
Boolean Logic: pool AND wifi NOT party
Numeric Ranges: bedrooms:[2 TO 4] (Inclusive)

🏗 Architecture & Design Decisions

1. Analyzers & Normalization

We deliberately chose EnglishAnalyzer for description and host_about to handle English morphology (e.g., matching "running" with "run"). For metadata like amenities or name, StandardAnalyzer is used to preserve more specific tokens. String identifiers (Neighborhoods) are normalized to lowercase to ensure case-insensitive exact matching.

2. Relational Indexing

Instead of denormalizing host data into every property listing (which would duplicate "Host Name" thousands of times), we maintained a normalized schema.

Pros: Smaller index size, faster updates for host details.
Cons: Requires application-level "joins" (two queries) to display host info alongside property results.

3. Robust CSV Parsing

The indexer implements a custom multi-line CSV handling logic because standard parsers often fail with user-generated content containing newlines within quotes. We track max-errors to allow the process to finish even if a few rows are corrupt, ensuring resilience.

🧪 Classification Methodology & Experimental Setup

The classification experiments follow a controlled and reproducible evaluation protocol:

Dataset split: 70% training / 30% test using DatasetSplitter
Sampling: Stratified sampling with fixed seed to preserve class proportions
Metrics: Accuracy, Precision, Recall, and F1-score computed via ConfusionMatrixGenerator
Field Comparison:
- description: Pure unstructured text
- contents: MegaField aggregating description, name, host info, neighborhood, etc.

Across tasks, the MegaField consistently improves performance except when the task requires extracting very specific numeric information from text.

📊 Classification Tasks & Key Findings

1️⃣ Neighbourhood Group Classification (`neighbourhood_group_cleansed`)

Goal: Predict one of three macro-areas: City of Los Angeles, Other Cities, Unincorporated Areas
Best Classifier: ✔ SimpleNaiveBayesClassifier
Why it works: Geographic classification relies on strong, independent lexical cues (street names, landmarks). Naive Bayes excels when individual terms are highly predictive.
Best Input: MegaField (contents) - Aggregating description + host + location context enriches the geographic vocabulary.
Result: Accuracy ≈ 0.80, outperforming KNN.

2️⃣ Property Type Classification (`property_type`)

Goal: Reduce 94 property types into macro-classes: Rental unit, Home, Guesthouse
Best Classifier: ✔ KNearestFuzzyClassifier (k=5)
Why Fuzzy KNN: User descriptions are ambiguous (a "guesthouse" may sound like a "home"). Fuzzy KNN assigns graded membership instead of rigid voting, handling this overlap better.
Best Input: MegaField (contents) - Contextual clues (house rules, host details) help disambiguate borderline cases.
Result: Accuracy above 0.93.

3️⃣ Bedroom Count Classification (`bedrooms`)

Goal: Predict labels: 0, 1, 2, 3, 4, 5+
Best Classifier: ✔ KNearestFuzzyClassifier (k=5)
Key Insight: Description beats MegaField.
- Bedroom count relies on explicit phrases ("studio", "two bedroom").
- MegaFields introduce noise (prices, codes, distances). Restricting input improves signal-to-noise ratio.
Why Fuzzy KNN: Handles class imbalance well (many 1-2 bedrooms, few 5+).
Result: Accuracy ≈ 0.81 with improved recall for extreme classes.

📌 Practical Takeaways

No universal winner: Effectiveness depends on task semantics and class balance.
Naive Bayes is ideal for broad semantic categories with strong lexical signals.
Fuzzy KNN excels when classes overlap semantically or data is imbalanced.
Lucene Classification: Proven to be competitive, interpretable, and easy to integrate into IR pipelines.

📚 Additional Resources & Text Analysis

It is recommended to consult the Javadoc for the specific version of Lucene being used (currently 10.3.0 / 10.3.1).

Main Documentation: Lucene 10.3.0 Documentation
Text Analysis: The Analysis Package is particularly important, as it handles converting input text into tokens. Lucene performs a sequence of operations on the text, such as splitting by whitespace, removing stop words, and stemming.
Demo: Lucene Demo

👥 Authors

Project developed for the Information Retrieval (RI) course, 2025 with Emilio Guillen Alvarez

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/main/java		src/main/java
.gitignore		.gitignore
GUI_Screenshot.png		GUI_Screenshot.png
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airbnb Lucene Analyzer: Search App & Classifier

🚀 Key Features

🛠 Prerequisites

📦 Build & Run

1. Compilation

2. Indexing the Data

3. Launching the Search UI

4. Running Classifiers

📂 Index Structure & Schema

🏠 Index: Properties (`index_properties`)

👤 Index: Hosts (`index_hosts`)

🔍 Search Guide & Syntax

Common Query Syntax (Lucene Standard)

🏗 Architecture & Design Decisions

1. Analyzers & Normalization

2. Relational Indexing

3. Robust CSV Parsing

🧪 Classification Methodology & Experimental Setup

📊 Classification Tasks & Key Findings

1️⃣ Neighbourhood Group Classification (`neighbourhood_group_cleansed`)

2️⃣ Property Type Classification (`property_type`)

3️⃣ Bedroom Count Classification (`bedrooms`)

📌 Practical Takeaways

📚 Additional Resources & Text Analysis

👥 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Airbnb Lucene Analyzer: Search App & Classifier

🚀 Key Features

🛠 Prerequisites

📦 Build & Run

1. Compilation

2. Indexing the Data

3. Launching the Search UI

4. Running Classifiers

📂 Index Structure & Schema

🏠 Index: Properties (index_properties)

👤 Index: Hosts (index_hosts)

🔍 Search Guide & Syntax

Common Query Syntax (Lucene Standard)

🏗 Architecture & Design Decisions

1. Analyzers & Normalization

2. Relational Indexing

3. Robust CSV Parsing

🧪 Classification Methodology & Experimental Setup

📊 Classification Tasks & Key Findings

1️⃣ Neighbourhood Group Classification (neighbourhood_group_cleansed)

2️⃣ Property Type Classification (property_type)

3️⃣ Bedroom Count Classification (bedrooms)

📌 Practical Takeaways

📚 Additional Resources & Text Analysis

👥 Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🏠 Index: Properties (`index_properties`)

👤 Index: Hosts (`index_hosts`)

1️⃣ Neighbourhood Group Classification (`neighbourhood_group_cleansed`)

2️⃣ Property Type Classification (`property_type`)

3️⃣ Bedroom Count Classification (`bedrooms`)

Packages