Live Project: Titanic Survival Prediction
The Titanic disaster of 1912 resulted in the deaths of over 1,500 individuals & remains one of the most studied historical maritime tragedies. This project applies machine learning techniques to analyze the passenger data from the Titanic & construct a predictive model that estimates the probability of survival for each individual on board. One of the top 10 Data Science projects for Resume from Prime Point Institute for students and Professionals.
Project Overview: The purpose of this project is to develop a supervised learning model to classify whether a passenger survived the Titanic disaster based on a variety of input features. The model will be trained using historical passenger data that includes demographic, socioeconomic, & categorical information. The primary algorithm used in this project is the Random Forest Classifier, known for its robustness & high accuracy in classification problems.
Dataset Description: The Titanic dataset provided by Kaggle consists of the following CSV files:
train.csv: Contains 891 records with attributes such as passenger class, name, sex, age, number of siblings/spouses aboard, number of parents/children aboard, fare, cabin, embarked port, & the target variable
Survived
.test.csv: Includes similar attributes for 418 passengers but lacks the
Survived
column. This dataset is used to evaluate the model’s performance on unseen data.gender_submission.csv: A sample submission file used to illustrate the expected format for model output when submitting predictions.
Project Objective: The objective is to construct a machine learning pipeline that can:
Preprocess the dataset by handling missing values & transforming categorical variables.
Engineer new features that may enhance predictive performance.
Train a classification model using the training data.
Generate predictions on the test dataset.
Output the results in the correct submission format.
Implementation Pipeline: The implementation is divided into the following key stages:
1. Feature Engineering
Extraction of titles from names (e.g., Mr, Miss, Mrs) to derive social role information.
Binning of age & fare into categories to reduce noise & identify trends.
Construction of new features such as FamilySize & IsAlone to capture relationships.
2. Imputation & Data Cleaning
Filling missing values in age, embarked port, & fare using statistical or contextual methods.
Encoding categorical variables such as Sex & Embarked using label encoding or one-hot encoding.
Dropping features with high missingness or limited predictive power, such as Cabin & Ticket.
3. Model Training & Evaluation
Splitting the training data into training & validation subsets for performance evaluation.
Applying a Random Forest Classifier, chosen for its ability to handle mixed-type features & non-linear relationships.
Performing cross-validation to ensure generalization & reduce overfitting.
4. Prediction
Once the model achieves satisfactory accuracy on the validation set, predictions are generated for the test data.
Results are saved in the required submission format: a CSV file with
PassengerId
& predictedSurvived
status.
1. Importing Libraries & Initial setup
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
warnings.filterwarnings('ignore')
Now let’s read the training & test data using the pandas data frame.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# To know number of columns & rows
train.shape
# (891, 12)
To know the information about each column like the data type, etc we use the df.info() function.
train.info()
Now let’s see if there are any NULL values present in the dataset. This can be checked using the isnull() function. It yields the following output.
train.isnull().sum()
2. Data Visualization: Understanding Survival Trends & Passenger Demographics
Now let us visualize the data using some pie charts & histograms to get a proper understanding of the data.
f, ax = plt.subplots(1, 2, figsize=(12, 4))
train['Survived'].value_counts().plot.pie(
explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=False)
ax[0].set_title('Survivors (1) and the dead (0)')
ax[0].set_ylabel('')
sns.countplot(x='Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survivors (1) and the dead (0)')
plt.show()
# This code is modified by Susobhan Akhuli
Analyzing the Impact of Sex on Survival Rates: Reflects the focus on exploring how the gender of passengers influenced their chances of survival.
f, ax = plt.subplots(1, 2, figsize=(12, 4))
train[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survivors by sex')
sns.countplot(x='Sex', hue='Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survived (1) and deceased (0): men and women')
plt.show()
# This code is modified by Susobhan Akhuli
3. Feature Engineering: Optimizing Data for Model Training
This section focuses on refining the dataset by removing irrelevant features & transforming categorical data into numerical formats. Key tasks include:
- Dropping Redundant Features: Removing columns like
Cabin
that offer limited predictive value. - Creating New Features: Introducing a new column to indicate whether cabin information was assigned or not.
- Data Transformation: Converting textual data into numerical categories for seamless model training.
train = train.drop(['Cabin'], axis=1)
test = test.drop(['Cabin'], axis=1)
train = train.drop(['Ticket'], axis=1)
test = test.drop(['Ticket'], axis=1)
# replacing the missing values in
# the Embarked feature with S
train = train.fillna({"Embarked": "S"})
We will now sort the age into groups. We will combine the age groups of the people & categorize them into the same groups. BY doing so we will be having fewer categories & will have a better prediction since it will be a categorical dataset.
# sort the ages into logical categories
train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager',
'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels=labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels=labels)
In the ‘title’ column for both the test & train set, we will categorize them into an equal number of classes. Then we will assign numerical values to the title for convenience of model training.
# create a combined group of both datasets
combine = [train, test]
# extract a title for each Name in the
# train and test datasets
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train['Title'], train['Sex'])
# replace various titles with more common names
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
'Don', 'Dr', 'Major',
'Rev', 'Jonkheer', 'Dona'],
'Rare')
dataset['Title'] = dataset['Title'].replace(
['Countess', 'Lady', 'Sir'], 'Royal')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
# map each of the title groups to a numerical value
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3,
"Master": 4, "Royal": 5, "Rare": 6}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
Now using the title information we can fill in the missing age values.
mr_age = train[train["Title"] == 1]["AgeGroup"].mode() # Young Adult
miss_age = train[train["Title"] == 2]["AgeGroup"].mode() # Student
mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() # Adult
master_age = train[train["Title"] == 4]["AgeGroup"].mode() # Baby
royal_age = train[train["Title"] == 5]["AgeGroup"].mode() # Adult
rare_age = train[train["Title"] == 6]["AgeGroup"].mode() # Adult
age_title_mapping = {1: "Young Adult", 2: "Student",
3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}
for x in range(len(train["AgeGroup"])):
if train["AgeGroup"][x] == "Unknown":
train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
for x in range(len(test["AgeGroup"])):
if test["AgeGroup"][x] == "Unknown":
test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]
Now assign a numerical value to each age category. Once we have mapped the age into different categories we do not need the age feature. Hence drop it
# map each Age value to a numerical value
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3,
'Student': 4, 'Young Adult': 5, 'Adult': 6,
'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)
train.head()
# dropping the Age feature for now, might change
train = train.drop(['Age'], axis=1)
test = test.drop(['Age'], axis=1)
Drop the name feature since it contains no more useful information.
train = train.drop(['Name'], axis=1)
test = test.drop(['Name'], axis=1)
sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)
Fill in the missing Fare value in the test set based on the mean fare for that P-class
from sklearn.model_selection import train_test_split
# Drop the Survived and PassengerId
# column from the trainset
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(
predictors, target, test_size=0.2, random_state=0)
4. Model Training: Building the Predictive Model
In this phase, we employ Random Forest as our algorithm to train the model for predicting survival. Key steps include:
- Data Splitting: Dividing the dataset into 80% training & 20% testing subsets using
train_test_split()
from thesklearn
library. - Model Selection: Leveraging the Random Forest algorithm, known for its robustness & ability to handle diverse data.
- Performance Evaluation: Assessing the trained model’s accuracy on the test data to ensure it generalizes well.
for x in range(len(test["Fare"])):
if pd.isnull(test["Fare"][x]):
pclass = test["Pclass"][x] # Pclass = 3
test["Fare"][x] = round(
train[train["Pclass"] == pclass]["Fare"].mean(), 4)
# map Fare values into groups of
# numerical values
train['FareBand'] = pd.qcut(train['Fare'], 4,
labels=[1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4,
labels=[1, 2, 3, 4])
# drop Fare values
train = train.drop(['Fare'], axis=1)
test = test.drop(['Fare'], axis=1)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
randomforest = RandomForestClassifier()
# Fit the training data along with its output
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
# Find the accuracy score of the model
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)
Prediction: Generating Survival Predictions on Test Data
In this final phase, we use the trained Random Forest model to make predictions on the test dataset. The key steps are:
- Running Predictions: Input the test dataset into the trained model to predict survival outcomes.
- Preparing Results: Store the PassengerId from the test data & the corresponding Survival predictions (0 or 1).
- Saving the Output: Export the predictions to a CSV file for submission, with two columns:
- PassengerId: ID of each passenger from the test dataset.
- Survival: Predicted survival status (0 = Did not survive, 1 = Survived).
ids = test['PassengerId']
predictions = randomforest.predict(test.drop('PassengerId', axis=1))
# set the output as a dataframe and convert
# to csv file named resultfile.csv
output = pd.DataFrame({'PassengerId': ids, 'Survived': predictions})
output.to_csv('resultfile.csv', index=False)
Conclusion
In this project, we successfully built a Random Forest classifier to predict the survival chances of Titanic passengers. Through data preprocessing, feature engineering, imputation, & model training, we were able to create a robust model with 83.8% accuracy on the training set. Here we are done with Data Science Project for Resume of students and Professionals.