STAT 1010 - Fall 2022
By the end of this lesson you should:
Perform a visual association test
Know how to read a scatterplot to describe associations between quantitative variables
Know how to quantify associations in quantitative variables
Use a line to describe associations in linear relationships
Understand spurious correlations and lurking variables
diamonds %>% # filtered data
slice_sample(n = nrow(.)) %>% # random sample rows
pull(carat) %>% # take out the variable carat
bind_cols(., diamonds$price) %>% # price in the same order and bound to carat in different order
ggplot() + # into ggplot
geom_point(aes(y = ...2, x = ...1)) + # using the new names
labs(title = "Simulated association test",
x = "Weight of diamond in carat",
y = "Price of diamonds in US$")
\(cov(x, y) = \frac{(x_1 - \bar{x})(y_1 - \bar{y}) + (x_2 - \bar{x})(y_2 - \bar{y}) + \ldots + (x_n - \bar{x})(y_n - \bar{y})}{n-1}\)
\[corr(x, y) = \frac{cov(x, y)}{s_x \cdot s_y}\]
R
[1] 1742.765
[1] 0.9215913
\[ y = mx + b \]
\[ m = \frac{r \cdot s_y}{s_x} \] \[ b = \bar{y} - m \bar{x} \]
\(r = 0.9215913\)
\(s_{price} = 3989.44\)
\(s_{carat} = 0.4740112\)
\(m = \frac{r \cdot s_{price}}{s_{carat}}\)
\[\begin{aligned} m &= \frac{r \cdot s_{price}}{s_{carat}} \\ &= \frac{0.9215913 \cdot 3989.44}{0.4740112} \\ &= 7756.426 \end{aligned}\]
\(b = \overline{price} - m \cdot \overline{carat}\)
\(\overline{price} = 3932.8\)
\(\overline{carat} = 0.7979397\)
\(m = 7756.426\)
\[\begin{aligned} b &= \overline{price} - m \cdot \overline{carat} \\ &= 3932.8 - 7756.426 \cdot 0.7979397\\ &= -2256.36 \end{aligned}\]
Our model is: \(\widehat{\text{price}} = -2256.36 + 7756.426 \cdot \text{carat}\)
for carat values \(2.5\)
\(\widehat{\text{price}} = -2256.36 + 7756.426 \cdot \text{carat}\)
\[\begin{aligned} \widehat{\text{price}} &= -2256.36 + 7756.426 \cdot \text{carat} \\ &= -2256.36 + 7756.426 \cdot 2.5 \\ &= \$17135 \end{aligned}\]
R
# library tidymodels
library(tidymodels)
## Assign the least
## squares line
least_squares_fit <-
linear_reg() %>%
set_engine("lm") %>%
fit(price ~ carat, data = diamonds)
## NOTE: outcome first, predictor second
## Find the prediction
predict(least_squares_fit, tibble(carat = c(1, 2, 2.5, 4)))
# A tibble: 4 × 1
.pred
<dbl>
1 5500.
2 13256.
3 17135.
4 28769.
These variables are called lurking variables or confounders.
Lurking variables are not considered in the statistical analysis
Confounders are considered
There are both known and unknown confounders, uknown confounders are lurking variables.
Click here for more spurious correlations
Click here or the qr code below