Back to Projects
AI/MLRESEARCH
Airbnb NYC Popularity — Explainable ML

Machine Learning

Airbnb NYC Popularity — Explainable ML

Using Random Forest and SHAP analysis to predict and explain Airbnb listing popularity

Data Scientist & Author
Academic Project
Noor Naila Imtinan Himam
Pythonscikit-learnSHAPBayesSearchCVmatplotlibseaborn
Overview

This machine learning project predicts how many reviews an Airbnb listing might receive per month using the 2019 NYC Airbnb dataset (48,895 listings, 16 features). Through iterative model development - from baseline DummyRegressor to an optimized Random Forest - and SHAP-based explainability analysis, the project transforms predictions into actionable strategies for hosts. The final model achieved R2 = 0.59, explaining 59% of the variance in listing popularity.

The Problem

With nearly 49,000 Airbnb listings in New York City, hosts face intense competition. Understanding what drives listing popularity - and being able to predict it - could help hosts optimize their profiles and help Airbnb improve its recommendation engine. But popularity depends on a complex mix of location, pricing, availability, and guest feedback, making simple heuristics insufficient.

48,895

Airbnb listings in the dataset

16

Features analyzed

R2=0.59

Final model accuracy

Process
📊

EDA

Exploratory Data Analysis

Analyzed price distributions across boroughs (Manhattan median ~$150, Bronx ~$65), availability patterns, room type preferences, and correlations between features and review counts.

Violin PlotsCorrelation AnalysisDistribution Analysis
📏

Baseline

Establishing a Baseline

Started with a DummyRegressor (mean strategy) as baseline: MAE 1.15, R2 -3.33 (worse than guessing). This set the floor for meaningful model improvement.

DummyRegressorBaseline Metrics
📈

Linear Models

Ridge Regression

Applied Ridge Regression with GridSearchCV over alpha values. Achieved MAE 0.81, R2 0.38 - a significant improvement but with room to grow.

Ridge RegressionGridSearchCVCross-Validation
🔄

Model Comparison

Comparing Advanced Models

Tested KNN Regressor (CV MSE: 1.49), SVR (CV MSE: 1.44), and Random Forest (CV MSE: 1.20). Random Forest clearly outperformed the alternatives.

KNNSVRRandom ForestCV Comparison

Optimization

Bayesian Hyperparameter Tuning

Used BayesSearchCV on Random Forest (n_estimators: 115, max_depth: 14, max_features: sqrt). Best CV MSE: 1.13.

BayesSearchCVHyperparameter Tuning
🔍

Explainability

SHAP Feature Importance

Applied SHAP TreeExplainer to understand which features drive predictions. Number of reviews, minimum nights, and availability were the top drivers - not price or location.

SHAPTreeExplainerFeature Importance
Final Model Performance

0.578

Mean Absolute Error (MAE)

0.986

Mean Squared Error (MSE)

0.993

Root MSE (RMSE)

0.593

R-squared (R2)

Random Forest (optimized)

Best model

BayesSearchCV

Optimization method

SHAP-Driven Insights for Hosts
1

Number of existing reviews is the most impactful feature - encouraging guest reviews creates a compounding popularity effect

2

Shorter minimum night requirements drive more bookings - flexibility attracts more guests

3

Year-round availability strongly correlates with more reviews - maximizing your calendar pays off

4

Location and price had surprisingly lower impact than expected - it's more about listing behavior than geography

5

Explainable AI (SHAP) transforms a black-box model into actionable business recommendations

What I Learned
1

Iterative model development

Progressing from baseline to Ridge to Random Forest with Bayesian optimization showed how each step provides meaningful improvement - and when to stop iterating.

2

Explainability > accuracy alone

A model that explains 59% of variance but reveals actionable insights (via SHAP) is more valuable than a marginally better black-box model.

3

Feature engineering matters

The dataset lacked temporal and seasonal features, limiting accuracy. Understanding what data you don't have is as important as modeling what you do.

4

Writing for a technical audience

Publishing the analysis on Medium taught me to balance technical depth with readability - making data science accessible without oversimplifying.

scikit-learn (RF, Ridge, KNN, SVR)SHAP ExplainabilityBayesSearchCVExploratory Data AnalysisTechnical Writing (Medium)Python Data Science Stack
Links & Resources