Live Project: Uber Rides Data Analysis using Python

Project Info: This project focused on analyzing real-world Uber rides data using Python to derive business insights & uncover travel patterns. It involved in-depth data cleaning, preprocessing & visualization to explore ride purposes, categories & trends across time dimensions like day, month & hour. The goal was to understand user behavior & ride distribution for personal vs business use, as well as derive actionable insights for operational decisions. This is one of the top 10 Data Science Projects for Resume.

Key objectives included:

Data wrangling & feature engineering
Analyzing ride patterns across different time periods
Uncovering relationships between ride categories & distances
Identifying travel behavior trends based on weekday & hour

Project Implementation:

Data Import & Library Setup:
Utilized key libraries like Pandas, NumPy, Matplotlib & Seaborn for handling data, computation & visualization. Dataset was read using pd.read_csv().
Initial Exploration:
- Used .head(), .shape, & .info() to understand the structure, size & data types.
- Identified & handled missing values in the PURPOSE column using fillna().
Datetime Processing:
- Converted START_DATE & END_DATE columns to datetime format.
- Extracted time & day segments to categorize rides into Morning, Afternoon, Evening & Night using pd.cut().
Data Cleaning:
- Removed nulls & duplicate entries for accurate analysis.
- Separated categorical & numeric columns for preprocessing.
Visual Analysis:
- Used sns.countplot() to visualize categorical trends across CATEGORY, PURPOSE, & time-of-day.
- Compared business vs personal rides using hue in plots.
- Plotted day-wise & month-wise ride distribution with bar & line plots.
- Observed reduced rides during winter months (Nov–Jan), validating seasonal trends.
Feature Engineering & Encoding:
- Applied OneHotEncoding on CATEGORY & PURPOSE.
- Merged encoded columns with original dataset after dropping originals.
Correlation Analysis:
- Used heatmap to find correlations among numerical features.
- Noted a strong negative correlation between Business & Personal ride categories.
Mileage Analysis:
- Created boxplot & distribution plot for short-distance rides (<100 miles).
- Found that most rides were between 0–20 miles, with a peak around 4–5 miles.

Key Learnings & Outcomes:

Gained practical experience in preprocessing datetime data
Developed strong command over visualization techniques using Seaborn
Learned how to extract real business insights from noisy travel datasets
Understood user ride behavior across time, purpose & category dimensions
Applied OneHotEncoding for ML-readiness & performed correlation analysis

Importing Libraries

The analysis will be done using the following libraries :

Pandas: This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
Numpy: Numpy arrays are very fast and can perform large computations in a very short time.
Matplotlib / Seaborn: This library is used to draw visualizations.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Importing Dataset

After importing all the libraries, download the data using the link.

Once downloaded, you can import the dataset using the pandas library.


dataset = pd.read_csv("UberDataset.csv")
dataset.head()

To find the shape of the dataset, we can use dataset.shape


dataset.shape

To understand the data more deeply, we need to know about the null values count, datatype, etc. So for that we will use the below code.


dataset.info()

Data Preprocessing

As we understood that there are a lot of null values in PURPOSE column, so for that we will me filling the null values with a NOT keyword. You can try something else too.


dataset['PURPOSE'].fillna("NOT", inplace=True)

Changing the START_DATE and END_DATE to the date_time format so that further it can be use to do analysis.


dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'], 
                                       errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'], 
                                     errors='coerce')

Splitting the START_DATE to date and time column and then converting the time into four different categories i.e. Morning, Afternoon, Evening, Night


from datetime import datetime

dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour

#changing into categories of day and night
dataset['day-night'] = pd.cut(x=dataset['time'],
                              bins = [0,10,15,19,24],
                              labels = ['Morning','Afternoon','Evening','Night'])


dataset.dropna(inplace=True)


dataset.drop_duplicates(inplace=True)

Data Visualization

In this section, we will try to understand and compare all columns.

Let’s start with checking the unique values in dataset of the columns with object datatype.


obj = (dataset.dtypes == 'object')
object_cols = list(obj[obj].index)

unique_values = {}
for col in object_cols:
  unique_values[col] = dataset[col].unique().size
unique_values

Now, we will be using matplotlib and seaborn library for countplot the CATEGORY and PURPOSE columns.


plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
sns.countplot(dataset['CATEGORY'])
plt.xticks(rotation=90)

plt.subplot(1,2,2)
sns.countplot(dataset['PURPOSE'])
plt.xticks(rotation=90)

Let’s do the same for time column, here we will be using the time column which we have extracted above.


sns.countplot(dataset['day-night'])
plt.xticks(rotation=90)

Now assign a numerical value to each age category. Once we have mapped the age into different categories we do not need the age feature. Hence drop it


plt.figure(figsize=(15, 5))
sns.countplot(data=dataset, x='PURPOSE', hue='CATEGORY')
plt.xticks(rotation=90)
plt.show()

Now, we will be comparing the two different categories along with the PURPOSE of the user.


train = train.drop(['Name'], axis=1)
test = test.drop(['Name'], axis=1)


from sklearn.preprocessing import OneHotEncoder
object_cols = ['CATEGORY', 'PURPOSE']
OH_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
OH_cols = pd.DataFrame(OH_encoder.fit_transform(dataset[object_cols]))
OH_cols.index = dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = dataset.drop(object_cols, axis=1)
dataset = pd.concat([df_final, OH_cols], axis=1)

# This code is modified by Susobhan Akhuli

After that, we can now find the correlation between the columns using heatmap.


# Select only numerical columns for correlation calculation
numeric_dataset = dataset.select_dtypes(include=['number'])

sns.heatmap(numeric_dataset.corr(), 
            cmap='BrBG', 
            fmt='.2f', 
            linewidths=2, 
            annot=True)

# This code is modified by Susobhan Akhuli

Insights from the heatmap:

Business and Personal Category are highly negatively correlated, this have already proven earlier. So this plot, justifies the above conclusions.
There is not much correlation between the features.

Now, as we need to visualize the month data. This can we same as done before (for hours).


dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month
month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',
               5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
               9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'}
dataset["MONTH"] = dataset.MONTH.map(month_label)

mon = dataset.MONTH.value_counts(sort=False)

# Month total rides count vs Month ride max count
df = pd.DataFrame({"MONTHS": mon.values,
                   "VALUE COUNT": dataset.groupby('MONTH',
                                                  sort=False)['MILES'].max()})

p = sns.lineplot(data=df)
p.set(xlabel="MONTHS", ylabel="VALUE COUNT")

Insights from the above plot :

The counts are very irregular.
Still its very clear that the counts are very less during Nov, Dec, Jan, which justifies the fact that time winters are there in Florida, US.

Visualization for days data.


dataset['DAY'] = dataset.START_DATE.dt.weekday
day_label = {
    0: 'Mon', 1: 'Tues', 2: 'Wed', 3: 'Thus', 4: 'Fri', 5: 'Sat', 6: 'Sun'
}
dataset['DAY'] = dataset['DAY'].map(day_label)


day_label = dataset.DAY.value_counts()
sns.barplot(x=day_label.index, y=day_label);
plt.xlabel('DAY')
plt.ylabel('COUNT')

As the graph is not clearly understandable. Let’s zoom in it for values lees than 100.


sns.boxplot(dataset[dataset['MILES']<100]['MILES'])


sns.distplot(dataset[dataset['MILES']<40]['MILES'])

Insights from the above plots :

Most of the cabs booked for the distance of 4-5 miles.
Majorly people chooses cabs for the distance of 0-20 miles.
For distance more than 20 miles cab counts is nearly negligible.

Live Project: Uber Rides Data Analysis using Python

Project Implementation:

Key Learnings & Outcomes:

Importing Libraries

Importing Dataset

Data Preprocessing

Data Visualization

Insights from the heatmap:

Insights from the above plot :

Insights from the above plots :

Quick Links

Home

Contact

Blogs

FAQs

News

Placements

Interview Questions

Data Science Projects

Courses

Data Science

Data Analytics

Power BI

Machine Learning

Advance AI

Full Stack Python

Full Stack Java

MERN Stack

Address

Prime Point AI, Data Science Course, Data Analytics Training

Office No. 7, First Floor, Quantum Works Awfis Building, Near Nal Stop, Metro Station, Erandwane, Pune, Maharashtra - 411004

Contact Details

+91 8446273688

info@primepointinstitute.com

Live Project: Uber Rides Data Analysis using Python

Project Implementation:

Key Learnings & Outcomes:

Importing Libraries

Importing Dataset

Data Preprocessing

Data Visualization

Insights from the heatmap:

Insights from the above plot :

Insights from the above plots :

Quick Links

Courses

Address

Contact Details

Request Callback