HW 1 - Titanic dataset

Due Friday, 23 September, 2pm on Gradescope


For this homework assignment we will be exploring the Titanic dataset.

Learning goals

In this assignment, you will…

  • Download and import data into R
  • Explore categorical variables
  • Compute the expected counts of a \(\chi^2\) distribution and
  • Perform a \(\chi^2\) test on some variables

Getting started

Log in to RStudio

Click on your Stat1010.Rproj.

Go to FileNew FileQuarto Document and name the document hw-1 click create


The following packages will be used in this assignment:

library(tidyverse) # for data manipulation and data visualization
library(here) # to organize files

Data: Titanic

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life wasthat there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex, and the fare they paid

  • Pclass: The class of passengers on the titanic, with \(1st\) being the highest class, and \(3rd\) the lowest.
  • Survived: A variable that records whether or not a passenger survived the sinking of the titanic.

Part 1: Download and import the data

  1. Download the file into your Stat1010 R project in the data folder.

  2. Import your data into R.


This may be a good time to render your document to ensure that the data loads without errors.

Part 2: Exploring categorical variables

  1. How many observations and how many variables in this dataset?

  2. Classify all the variables into two categories: qualitative or quantitative

  3. How many levels of passenger class are there in the variable?

  4. How many passengers survived the sinking of the titanic?

  5. Make a contingency table of class and survival. Which combination of class and survival is the most common?

    1. What proportion of \(3rd\) class passengers survived?

    2. What proportion of \(1st\) class passengers survived?


This may be a good time to render your document.

  1. Draw a plot that conditions on class and displays the proportions of each that survived and update labels. What does the plot suggest regarding the chance of survival and your class? Does it suggest that they are associated?

  2. Assuming Survived and Pclass are independent, write R coding to find the expected counts for all cells of the contingency table. (Hint: you will need to use the rowSums(.[2:3]) and colSums(.[2:4]) functions, depending on the order that you compute the two.)

  3. Perform a \(\chi^2\) test on the two variables Survived and Pclass. Include both \(H_0\) and \(H_A\) and interpret the \(p-value\). What conclusions can you draw about the independence of Survived and Pclass?


Before submitting, make sure you render your document,



Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:


Total points available: 25 points.

Component Points
Part 1 Qu 1 2
Part 1 Qu 2 3
Part 2 Qu 1 2
Part 2 Qu 2 2
Part 2 Qu 3 1
Part 2 Qu 4 2
Part 2 Qu 5 Pt 1 1
Part 2 Qu 5 Pt 2 1
Part 2 Qu 6 3
Part 2 Qu 7 4
Part 2 Qu 8 4