Uber Fare Price Predictor

Fall 2024

PythonScikit-learnPandasMatplotlib/Seaborn

Problem

Uber does not reveal trip cost until a user is ready to book, even though millions of people rely on ridesharing every month. This project builds a machine-learning system that predicts the price of a ride in advance using historical trip data, allowing riders to better estimate and plan transportation costs.

Dataset & Features

The model was trained on a large Kaggle dataset of Uber rides containing fare amount, pickup and drop-off GPS coordinates, timestamps, and passenger count. From this data, I engineered additional features such as day of week, hour of day, and trip distance computed from latitude and longitude.

Preprocessing Pipeline

Removed incomplete and invalid trips
Filtered outliers in fare, distance, and GPS coordinates
Engineered temporal features (day of week, hour of day)
Computed trip distance from pickup and drop-off coordinates
Standardized numerical features to zero mean and unit variance

Models

I trained both supervised and unsupervised models to predict fares. Linear regression and random forest regressors were used as supervised baselines, while a Gaussian Mixture Model (GMM) was used to cluster trips into pricing regimes such as short city rides or longer rush-hour trips. New rides were assigned to a cluster and priced using that cluster’s average fare.

Results

Gaussian Mixture Model: R² = 0.801, MAE = 1.266, RMSE = 1.660
Random Forest: R² = 0.663, MAE = 1.547, RMSE = 2.209
Linear Regression: R² = 0.651, MAE = 1.574, RMSE = 2.209
GMM reduced error by over 40% compared to guessing the average fare

Why It Works

The unsupervised GMM captured natural groupings in trip behavior such as distance, passenger count, and time of day, which allowed it to model nonlinear pricing patterns more effectively than standard regression models.