HW 2 - Titanic dataset

Due Friday, 21 October, 2pm on Gradescope


For this homework assignment we will be exploring the Titanic dataset.

Learning goals

In this assignment, you will…

  • Import data into R
  • Explore numeric variables
  • Compute means for subgroups and compare them
  • Comment on and produce a pairs plot of numeric variables
  • Address overplotting

Getting started

Log in to RStudio

Click on your Stat1010.Rproj that we made the first day of class.

Go to FileNew FileQuarto Document and name the document hw-2 click create.


The following packages will be used in this assignment:

library(tidyverse) # for data manipulation and data visualization
library(here) # to organize files
library(GGally) # for the pairs plot

Data: Titanic

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex, and the fare they paid

  • Pclass: The class of passengers on the titanic, with \(1st\) being the highest class, and \(3rd\) the lowest.

  • Survived: A variable that records whether or not a passenger survived the sinking of the titanic.

  • Fare: A variable that records the passenger fare.

  1. Import the data into R
  2. For each value of Pclass, compute the mean fare and standard deviation paid by the passengers in that class. Comment on them in context.
  3. For each value of Survived, compute the mean fare and standard deviation paid by the passengers with that survival status. Comment on them in context.

This may be a good time to render your document.

  1. Draw a pairs plot of all numeric variables in the dataset and comment on all scatterplots and densities. Be sure to use variable names.
  2. Explore any outliers by filtering and looking at the data. Answer questions like this, but include others that spark your curiosity. Include a minimum of 2 additional outliers.
    1. Which was the biggest family aboard?

    2. Which and how many people paid the highest fair?

  3. Draw one plot containing 3 boxplots of the Fare paid for each category of Pclass. What conclusions can you draw from this plot? Consider our last assignment and how the passenger class predicted survival.

This may be a good time to render your document.

  1. Draw a scatterplot of Pclass and Fare, and color it by the survival variable. (Hint: use the function geom_jitter() to avoid overplotting.) Comment on the plot.

Before submitting, make sure you render your document,



Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:


Total points available: 38 points

Component Points
Qu 1 2
Qu 2 4
Qu 3 4
Qu 4 15
Qu 5 4
Qu 6 5
Qu 7 4