Introduction to the diamonds dataset

Important

This application exercise is due at noon on 6 Sept at 4:59pm.

Scenario

You own a jewelry business and are working to price your diamonds. To facilitate, you explore the diamond dataset in R and pay special attention to the cost of the diamond. Let’s explore together.

To accomplish this, please paste each line of code into your RStudio instance, and answer the accompanying questions.

Coding basics

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

  • When writing code, load all libraries first.

  • “##” is used to write comments in R and should be used to justify next steps and anything unusual or unexpected steps

  • Comments are really important and are notes to your future self to remind you about steps that you took and decisions that you made.

## loading library
library(tidyverse) # for data analysis and visualisation

Data

The diamonds dataset includes basic information about 53940 diamonds including price and other characteristics.

## View the dataset in a separate tab
glimpse(diamonds)

The ___ dataset has ___ observations and ___ variables.

## What are the variables in this dataset?
?diamonds
  1. What do the “4 c’s” of every diamond mean? What kind of variables are they? (hint: This video may help, copy this into your notes under the title data types)

ggplot is an R package that is used to create data visualizations. We will be using it throughout this course. The basic format for every ggplot command is:

## DO NOT COPY INTO YOUR QUARTO DOCUMENT
## THIS IS TO SHOW YOU THE BASIC FORMAT OF ALL ggplots
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
## What is the distribution of price?
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 1) 
  1. What kind of variable is price? What happens if you draw a histogram with another kind of variable?
## What happens if you draw a histogram with a different type of variable?
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = OTHER_VAR), binwidth = 1) 
  1. What does the binwidth in the geom_histogram() function do? (hint: draw multiple plots changing this value) What binwidth value is most appropriate for this data?
## Find an appropriate binwidth
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = OTHER_VALUE) 

The distribution of diamond prices is right skewed (long right tail).

  1. In your notes, copy these histograms and the words used to describe them.

One interesting feature of this graph is the dip in diamonds priced at about $1000. Let’s explore this a little further and look at some basic dplyr commands.

The pipe operator “%>%” tells R to take the output from one function and sends it to the input in the next function.

## Filtering on price
diamonds %>% 
  filter(price > 750 & price < 1250) %>% 
  View()
  1. What does the filter function do?
diamonds %>% 
  filter(price > 750 & price < 1250) %>% ## Filtering on price
  mutate(price_hundred = price %/% 100) %>% ## Finding price_hundred
  View()
  1. What does this “%/%” operator do? (hint: run 10 %/% 3, 10 %/% 5, 7 %/% 3, 7 %/% 5 in the console)

  2. What does the mutate function do? (hint: pay special attention to the dimensions of the data)

diamonds %>% 
  filter(price > 750 & price < 1250) %>% ## Filtering on price
  mutate(price_hundred = price %/% 100) %>% ## Finding price_hundred
  select(price, price_hundred) %>% ## Select vars
  View()
  1. What does the select function do?
diamonds %>% 
  filter(price > 750 & price < 1250) %>% ## Filtering on price
  mutate(price_hundred = price %/% 100) %>% ## Finding price_hundred
  select(price, price_hundred) %>% ## Select vars
  count(price_hundred) %>% ## Count
  View()
  1. Can you identify the values where the dip is using the information above? Why or why not?
diamonds %>% 
  filter(price > 750 & price < 1250) %>% ## Filtering on price
  mutate(price_hundred = price %/% 100) %>% ## Finding price_hundred
  select(price, price_hundred) %>% ## Select vars
  mutate(price_remainder = price %% 100) %>% ## Finding price_remainder
  count(price_hundred, price_remainder) %>% ## Count
  View()

10. What does the “%%” operator do?

  1. Can you identify the values where the dip is using the information above? Why or why not?

Cut may impact upon price

## Cut and price
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, color = cut, fill = cut),
                 binwidth = 100) 
  1. Which cut do you suspect is the most common in the dataset? How many diamonds with that cut are in the dataset?
## Most common cute
diamonds %>% 
  INSERT_FUNCTION_HERE(cut)
  1. The variables color or clarity might impact upon the price. Draw a histogram and color it using one of them.
## Impact of color and clarity
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, color = VAR_NAME, 
                               fill = VAR_NAME), 
                 binwidth = YOUR_BIN_WIDTH) 
  1. What do the fill and color inputs do in the above coding?

  2. What kind of variables should we use for the fill and color inputs? What happens if another type of variable is used?

## Color and fill variables
ggplot(data = diamonds) +
      geom_histogram(mapping = aes(x = price, color = VAR_NAME, 
                                   fill = VAR_NAME), 
                     binwidth = YOUR_BIN_WIDTH) 

So far we have been looking at only 1 continuous variable. To look at two continous variables, we draw scatterplots.

## Color and fill variables
ggplot(data = diamonds) +
      geom_point(mapping = aes(x = price, y = depth)) 
  1. What kinds of variables do you think you could use to color this plot? Please use an appropriate variable from diamonds dataset and do so.
## Color and fill variables
ggplot(data = diamonds) +
      geom_point(mapping = aes(x = price, y = depth))