Airbnb Lucene Analyzer is a dual-purpose Information Retrieval project built with Java 21 and Apache Lucene 10. It functions as both a high-performance Search Application and a Machine Learning Classifier. It demonstrates how to build a production-grade ecosystem featuring robust CSV indexing, relational data modeling, an interactive search UI, and text classification models.
The project processes the Los Angeles Airbnb Data (June 2025) dataset obtained from Kaggle. This dataset contains public information available in the Inside Airbnb project for the city of Los Angeles, California. The data was collected on June 17, 2025, and provides a detailed view of Airbnb listings at that time. The dataset comes in a listings.csv file with a total of 79 columns and about 45,000 resources.
- Robust Indexing Engine: Parses complex CSV data, handling multi-line fields, missing values, and data anomalies.
- Relational Index Design: Implements a "Join" strategy by maintaining separate indices for
PropertiesandHosts, linked by ID. - Interactive Search App: A JavaFX graphical interface providing Real-time search, Faceted filtering (Neighborhoods, Types), and Geospatial queries.
- Machine Learning Classifier: Evaluates algorithms (Naive Bayes, KNN, Fuzzy KNN) to categorize listings based on unstructured text (e.g., predicting neighborhood from description).
- Advanced Text Analysis: Utilizes
EnglishAnalyzerfor stemming/stopwords on descriptions andStandardAnalyzerfor exact metadata.
- Java: JDK 21 or higher.
- Maven: 3.6+ installed.
- Data:
listings.csvin the project root.
We use Maven to manage dependencies and package the application.
# Clean and compile
mvn clean compile
# Create standalone JAR (includes all dependencies)
mvn clean packageArtifact created at: target/airbnb-indexer.jar
Before searching, you must build the indices. This process reads listings.csv and generates the Lucene indices.
java -jar target/airbnb-indexer.jar --input ./listings.csv --index-root ./index_root --mode build--mode build: Creates new indices. Userebuildto wipe and recreate, orupdateto add new records.- Output: Generates
index_root/index_propertiesandindex_root/index_hosts.
The JavaFX application allows you to explore the indexed data visually.
mvn javafx:run -Djavafx.args="--index-root ./index_root"Analyze the performance of Lucene's classification module.
java -cp target/airbnb-indexer.jar AirbnbClasificador --index-root ./index_rootThe project splits data into two optimized indices. Here are the key fields and their configurations:
| Field | Lucene Type | Analyzer | Note |
|---|---|---|---|
id |
IntPoint | - | Primary Key |
name |
TextField | Standard | Full-text searchable |
description |
TextField | English | Stemmed, stopwords removed |
neighbourhood_cleansed |
StringField + Facet | Keyword | Normalized lowercase |
property_type |
StringField + Facet | Keyword | Normalized lowercase |
amenity |
TextField (Multivalued) | Standard | Individual amenities indexed |
price |
DoublePoint | - | Numeric ranges |
location |
LatLonPoint | - | Geo-spatial queries |
host_id |
StringField | Keyword | Foreign Key to Host Index |
| Field | Lucene Type | Analyzer | Note |
|---|---|---|---|
host_id |
StringField | Keyword | Primary Key |
host_name |
TextField | Standard | |
host_about |
TextField | English | |
host_response_time |
StringField + Facet | Keyword | |
host_is_superhost |
IntPoint (0/1) | - | Boolean flag |
You can use Luke (Lucene Index Toolbox) or the provided Search App to query the data.
- Keyword Search:
beach(Finds "beach" in default field) - Specific Field:
name:oceanORdescription:pool - Phrase Match:
description:"walking distance" - Wildcards:
name:beac*(Matches beach, beacon, etc.) - Boolean Logic:
pool AND wifi NOT party - Numeric Ranges:
bedrooms:[2 TO 4](Inclusive)
We deliberately chose EnglishAnalyzer for description and host_about to handle English morphology (e.g., matching "running" with "run"). For metadata like amenities or name, StandardAnalyzer is used to preserve more specific tokens. String identifiers (Neighborhoods) are normalized to lowercase to ensure case-insensitive exact matching.
Instead of denormalizing host data into every property listing (which would duplicate "Host Name" thousands of times), we maintained a normalized schema.
- Pros: Smaller index size, faster updates for host details.
- Cons: Requires application-level "joins" (two queries) to display host info alongside property results.
The indexer implements a custom multi-line CSV handling logic because standard parsers often fail with user-generated content containing newlines within quotes. We track max-errors to allow the process to finish even if a few rows are corrupt, ensuring resilience.
The classification experiments follow a controlled and reproducible evaluation protocol:
- Dataset split: 70% training / 30% test using
DatasetSplitter - Sampling: Stratified sampling with fixed seed to preserve class proportions
- Metrics: Accuracy, Precision, Recall, and F1-score computed via
ConfusionMatrixGenerator - Field Comparison:
description: Pure unstructured textcontents: MegaField aggregating description, name, host info, neighborhood, etc.
Across tasks, the MegaField consistently improves performance except when the task requires extracting very specific numeric information from text.
- Goal: Predict one of three macro-areas: City of Los Angeles, Other Cities, Unincorporated Areas
- Best Classifier: β
SimpleNaiveBayesClassifier - Why it works: Geographic classification relies on strong, independent lexical cues (street names, landmarks). Naive Bayes excels when individual terms are highly predictive.
- Best Input: MegaField (
contents) - Aggregating description + host + location context enriches the geographic vocabulary. - Result: Accuracy β 0.80, outperforming KNN.
- Goal: Reduce 94 property types into macro-classes: Rental unit, Home, Guesthouse
- Best Classifier: β
KNearestFuzzyClassifier(k=5) - Why Fuzzy KNN: User descriptions are ambiguous (a "guesthouse" may sound like a "home"). Fuzzy KNN assigns graded membership instead of rigid voting, handling this overlap better.
- Best Input: MegaField (
contents) - Contextual clues (house rules, host details) help disambiguate borderline cases. - Result: Accuracy above 0.93.
- Goal: Predict labels: 0, 1, 2, 3, 4, 5+
- Best Classifier: β
KNearestFuzzyClassifier(k=5) - Key Insight: Description beats MegaField.
- Bedroom count relies on explicit phrases ("studio", "two bedroom").
- MegaFields introduce noise (prices, codes, distances). Restricting input improves signal-to-noise ratio.
- Why Fuzzy KNN: Handles class imbalance well (many 1-2 bedrooms, few 5+).
- Result: Accuracy β 0.81 with improved recall for extreme classes.
- No universal winner: Effectiveness depends on task semantics and class balance.
- Naive Bayes is ideal for broad semantic categories with strong lexical signals.
- Fuzzy KNN excels when classes overlap semantically or data is imbalanced.
- Lucene Classification: Proven to be competitive, interpretable, and easy to integrate into IR pipelines.
It is recommended to consult the Javadoc for the specific version of Lucene being used (currently 10.3.0 / 10.3.1).
- Main Documentation: Lucene 10.3.0 Documentation
- Text Analysis: The Analysis Package is particularly important, as it handles converting input text into tokens. Lucene performs a sequence of operations on the text, such as splitting by whitespace, removing stop words, and stemming.
- Demo: Lucene Demo
Project developed for the Information Retrieval (RI) course, 2025 with Emilio Guillen Alvarez
