Revision 1 for exam 1

What tidyverse function would you use to extract rows?

a. filter()

count()
select()
distinct()

What tidyverse function would you use to extract colums?

filter()
count()
c. select()
distinct()

To find the maximum number of times that a categorical variable appears which function should I use?

filter()
b. count()
select()
distinct()

What does the “%>%” operator do?

a. used in tidyverse between layers of dplyr coding

b. used in ggplot to add more layers of coding

Will this code run without an error? diamonds %>% mutate(price_hundred = price %/% 100)

It will produce an error
b. It will not produce an error

What does the binwidth input in ggplot do?

It controls the opacity of the plot
It can be used to update the labels
c. It controls the smoothness of a density plot
It controls the size of dots

What punctuation must be placed before a function or dataset to access more information?

“.”
“+”
c. “?”
“!”

Where should the variable names go in the following line of code: ggplot(data = A) + B(mapping = aes(C))

A
B
c. C

What does the here package do?

for data manipulation and display
b. to organize files
none of the above

Which of the following variables are discrete?

a. number of employees in a company.

distance traveled to and from work
taxes paid in 2019
purchases from 2018

What is the difference between distinct() and count()?

they are the same
b. distinct() gives the levels in a variable, count() gives the marginal distribution
distinct() gives the marginal distribution, count() gives the levels in a variable

What operator is used in ggplot?

“.”
b. “+”
“?”
“!”

Categorical variables

Find the expected count for clarity VS1 and color I for a \(\chi^2\) distribution assuming independence.

\(8171\times 5422/53940\)

# A tibble: 8 × 10
  clarity     D     E     F     G     H     I     J marginal_clarity marginal_…¹
  <ord>   <int> <int> <int> <int> <int> <int> <int>            <dbl>       <dbl>
1 I1         42   102   143   150   162    92    50              741        6775
2 SI2      1370  1713  1609  1548  1563   912   479             9194        9797
3 SI1      2083  2426  2131  1976  2275  1424   750            13065        9542
4 VS2      1697  2470  2201  2347  1643  1169   731            12258       11292
5 VS1       705  1281  1364  2148  1169   962   542             8171        8304
6 VVS2      553   991   975  1443   608   365   131             5066        5422
7 VVS1      252   656   734   999   585   355    74             3655        2808
8 IF         73   158   385   681   299   143    51             1790       53940
# … with abbreviated variable name ¹marginal_color

[1] 821.3415

What proportion of the very good cut diamonds are colored I?

0.0997

# A tibble: 5 × 8
  cut            D      E      F      G      H      I      J
  <ord>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Fair      0.0241 0.0229 0.0327 0.0278 0.0365 0.0323 0.0424
2 Good      0.0977 0.0952 0.0953 0.0771 0.0845 0.0963 0.109 
3 Very Good 0.223  0.245  0.227  0.204  0.220  0.222  0.241 
4 Premium   0.237  0.239  0.244  0.259  0.284  0.263  0.288 
5 Ideal     0.418  0.398  0.401  0.433  0.375  0.386  0.319

# A tibble: 5 × 8
# Groups:   cut [5]
  cut           D     E     F     G     H      I      J
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
1 Fair      0.101 0.139 0.194 0.195 0.188 0.109  0.0739
2 Good      0.135 0.190 0.185 0.178 0.143 0.106  0.0626
3 Very Good 0.125 0.199 0.179 0.190 0.151 0.0997 0.0561
4 Premium   0.116 0.169 0.169 0.212 0.171 0.104  0.0586
5 Ideal     0.132 0.181 0.178 0.227 0.145 0.0971 0.0416

Which of the lines below is the mean, median, or mode?

red is median, green is mean
red is mean, green is mode
red is mode, green is mean
d. red is mean, green is median

Is the mean or median a better measure of center in this distribution.

The mean is a better estimate in a a symmetric distribution, median in a skewed distribution

Which plot is best at displaying outliers:

histogram
dotplot
plot of proportions
d. boxplot