Titanic Survival Prediction: A Beginner’s Data Science Project


 


Introduction

The Titanic Survival Prediction challenge is one of the most popular beginner data science projects. It is often the first machine learning project introduced to newcomers, and for good reason. It covers the fundamentals of data science: data cleaning, exploratory data analysis (EDA), feature engineering, model selection, and evaluation. The problem statement is simple: given the details about passengers aboard the Titanic, predict whether they survived or not. In this blog post, we’ll walk through the steps to tackle this project using Python and popular libraries like Pandas, Scikit-learn, and Matplotlib.

This project is hosted on Kaggle, a platform where you can access datasets and competitions, making it a great starting point for beginners. We'll guide you through each step of this project, so you’ll not only understand how to approach it but also gain hands-on experience that will help you in real-world data science tasks.


1. Project Overview

The Titanic dataset contains information on passengers such as their age, gender, class of travel, ticket fare, and whether they survived the tragic sinking of the ship in 1912. Our goal is to build a machine learning model that predicts the likelihood of survival based on these attributes.

Key Questions:

  • What features (or variables) are important for predicting survival?
  • Can we accurately predict who survived the Titanic disaster?
  • What algorithms will provide the best results?

This project is a binary classification task, meaning the target variable (survived) can have one of two possible outcomes: 1 (survived) or 0 (did not survive). We’ll be using various machine learning models, starting with simple algorithms and then tuning them for better performance.


2. Data Loading and Preprocessing

The first step in any data science project is to load the data and get familiar with its structure. In the case of the Titanic dataset, it’s relatively small and contains both numerical and categorical data, making it an excellent choice for practicing data preprocessing.

Loading the Data:

import pandas as pd

# Load the Titanic dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Check the first few rows of the data
train_data.head()

Exploring the Data: When you first look at the data, you’ll see several columns, such as PassengerId, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, and Survived. Each of these columns contains information about the passengers. Here’s what each of them means:

  • PassengerId: Unique identifier for each passenger.
  • Pclass: The class of the passenger (1st, 2nd, or 3rd).
  • Sex: Gender of the passenger.
  • Age: Age of the passenger.
  • SibSp: Number of siblings or spouses aboard.
  • Parch: Number of parents or children aboard.
  • Fare: The ticket fare the passenger paid.
  • Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
  • Survived: Whether the passenger survived (1 = Yes, 0 = No).

You’ll notice that the dataset contains missing values, particularly in the Age and Embarked columns. The first step is to clean the data by handling these missing values.

Handling Missing Data:

# Check for missing values
train_data.isnull().sum()

# Fill missing Age values with the median
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)

# Fill missing Embarked values with the mode (most frequent value)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

Once the missing values are handled, it’s time to transform the data into a format suitable for machine learning models. This includes encoding categorical variables and scaling numerical features.

Encoding Categorical Data: For the Sex and Embarked columns, we can convert them into numerical values using Label Encoding or One-Hot Encoding. For simplicity, we’ll use label encoding for the Sex column and one-hot encoding for the Embarked column.

# Convert 'Sex' to numeric (male = 0, female = 1)
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})

# One-Hot Encode the 'Embarked' column
train_data = pd.get_dummies(train_data, columns=['Embarked'])

Feature Selection: Now that the data is cleaned and transformed, we can select the features that will be used to train the model. In this case, we will use Pclass, Sex, Age, SibSp, Parch, and Fare as our input features, and Survived will be the target variable.

# Selecting features and target variable
X = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = train_data['Survived']

3. Model Building and Evaluation

At this stage, we’re ready to build and train a machine learning model. Since this is a binary classification problem, we’ll start with a simple Logistic Regression model. Logistic Regression is a straightforward algorithm that works well for binary classification tasks like this.

Training the Model:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")

The accuracy of the model tells us how many of the test cases were classified correctly. The confusion matrix provides additional information about the false positives and false negatives, which is critical for understanding the model's performance.


4. Model Improvement

While Logistic Regression is a great starting point, we can experiment with other models to improve accuracy. Some common models for classification tasks include:

  • Random Forest Classifier: A more powerful ensemble method.
  • Support Vector Machine (SVM): Effective for high-dimensional spaces.
  • K-Nearest Neighbors (KNN): A simple yet effective model for classification.

Trying a Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_pred)
rf_conf_matrix = confusion_matrix(y_test, rf_pred)

print(f"Random Forest Accuracy: {rf_accuracy}")
print(f"Confusion Matrix:\n{rf_conf_matrix}")

By comparing the accuracy of different models, you can determine which one works best for this problem.


5. Conclusion

The Titanic Survival Prediction project is a fantastic way to get started with data science. It covers the entire workflow of a data science project: from loading and cleaning data to building models and evaluating their performance. As a beginner, this project allows you to practice key skills, such as data preprocessing, feature engineering, and model selection.

Moreover, once you’re comfortable with Logistic Regression and Random Forest, you can dive into more advanced topics, like hyperparameter tuning, cross-validation, and other machine learning algorithms. As you progress, you’ll also be able to tackle more complex datasets and challenges.

Remember, the Titanic dataset is just the beginning. Data science is a vast field, and as you complete more projects, you’ll develop a deeper understanding of how to apply these techniques to solve real-world problems.


Post a Comment

Previous Post Next Post