
Uber Fare Price Predictor
Problem
Uber does not reveal trip cost until a user is ready to book, even though millions of people rely on ridesharing every month. This project builds a machine-learning system that predicts the price of a ride in advance using historical trip data, allowing riders to better estimate and plan transportation costs.
Dataset & Features
The model was trained on a large Kaggle dataset of Uber rides containing fare amount, pickup and drop-off GPS coordinates, timestamps, and passenger count. From this data, I engineered additional features such as day of week, hour of day, and trip distance computed from latitude and longitude.
Preprocessing Pipeline
- Removed incomplete and invalid trips
- Filtered outliers in fare, distance, and GPS coordinates
- Engineered temporal features (day of week, hour of day)
- Computed trip distance from pickup and drop-off coordinates
- Standardized numerical features to zero mean and unit variance
Models
I trained both supervised and unsupervised models to predict fares. Linear regression and random forest regressors were used as supervised baselines, while a Gaussian Mixture Model (GMM) was used to cluster trips into pricing regimes such as short city rides or longer rush-hour trips. New rides were assigned to a cluster and priced using that cluster’s average fare.
Results
- Gaussian Mixture Model: R² = 0.801, MAE = 1.266, RMSE = 1.660
- Random Forest: R² = 0.663, MAE = 1.547, RMSE = 2.209
- Linear Regression: R² = 0.651, MAE = 1.574, RMSE = 2.209
- GMM reduced error by over 40% compared to guessing the average fare
Why It Works
The unsupervised GMM captured natural groupings in trip behavior such as distance, passenger count, and time of day, which allowed it to model nonlinear pricing patterns more effectively than standard regression models.