MODULE 1
READING
Pivoting, separating, and uniting: https://r4ds.had.co.nz/tidy-data.html#pivoting (Chp 12.3-12.4)
Filter, mutate, summarise: https://r4ds.had.co.nz/transform.html#filter-rows-with-filter (Chp 5.2, 5.5-5.6)
ggplot: https://r4ds.had.co.nz/data-visualisation.html#first-steps (Chp 3)
DATA TASK 1
This data task will be a refresher for some basic operations in R’s tidyverse before we move on to the intermediate modules. This data task is based on the R data task for Winter 2023’s application cycle.
The tutorials below explain some of the operations you will need for the data task.
Data Task 1
Remember to document your code well. Report all results in a LaTeX-rendered PDF document. Submit your .R file and your .PDF file in a single compressed folder (.7z, .zip, or .rar).
TASK:
Question 1:
At the start of your script, set seed to 725.
In the remainder of your questions, you should ensure that each question is ordered correctly within the script and that they can be run from beginning to end in one go.
Question 2:
Construct a vector of length 5000 called possibility_of_brushing, sampled from the Beta distribution with shape1 and shape2 parameters set as 1.2 and 0.7 respectively.
Question 3:
Construct a vector of length 5000 called ID, going from 1 to 5000.
Question 4:
Join the two vectors into a dataframe object.
Question 5:
Create and fill in a new column called day_1 on the dataframe based on possibility_of_brushing. For this column, randomly sample from the Bernoulli distribution with probability equal to possibility_of_brushing. (You can use rbinom).
Question 6:
Create and fill in a new column called month_1 on the dataframe based on possibility_of_brushing. For this column, randomly sample from the Bernoulli distribution 29 times with probability equal to possibility_of_brushing. Format the output as a string separated by dashes. For example: “1-0-0-1-1-0-0-…-1”.
Question 7:
Create and fill in the columns day_2 to day_30, by splitting month_1 into 29 new columns. Do not remove the original month_1 column. At least 1 point is rewarded if you create the day columns non-manually (i.e. do not reference day_2, day_3, … etc.).
Question 8:
Generate the new column called average_brushing as the mean of day_1 to day_30. At least 1 point is rewarded if you reference the day columns non-manually (i.e. do not reference day_2, day_3, … etc.).
Question 9:
Suppose a researcher fumbles and average_brushing now has a measurement error. Add average_brushing by N(0, 0.1), i.e. the normal distribution with mean 0 and standard deviation 0.5. Bound average_brushing between 0 and 1 by taking maximum and minimums appropriately (for example, if after adding noise it becomes 1.23, it is capped at 1, and if it becomes -0.15, it is capped below at 0).
Question 10:
Generate the following columns:
toothbrush_color, which equals ‘B’ with probability equal to the formula: 0.5 + ((average_brushing - 0.5) / 2), and ‘W’ otherwise. (Use Bernoulli)
gender, randomly being ‘F’ or ‘M’ with probability 0.5. (Use Bernoulli)
income, randomly going from 0 to 10000, using the Uniform distribution.
Question 11:
Create a new dataframe that summarizes the mean of average_brushing by toothbrush color and gender. You should have 4 values.
Question 12:
Create a new column, high_income, that is equal to 1 if the person has median or above income in the artificial dataframe. You cannot do this by referencing a global environment variable created just for this question or referencing a specific number.
Question 13:
Create a new dataframe that converts the existing dataframe to a long format, by each day. Refer to the picture below for what it should look like (we’re converting from what it looks like at the top of the picture to what it looks like at the bottom of the picture; days 9 to 30 are removed for clarity in the picture).
Question 14:
Based on the old dataframe (before Question 12), create a scatterplot of the average brushing (Y-axis) on the possibility of brushing (X-axis), with clear labels, axes labels, title, a best-fit linear line (OLS), and a clear 99% confidence interval.
Question 15:
Create a second scatterplot based on the scatterplot above so that each point has a clearly different color depending on the toothbrush color, appropriately adjusted so that we can observe the difference in densities of toothbrush colors (Hint: use transparency). Create two best-fit linear lines for each subgroup of toothbrush color, with a clearly different color for each best-fit line, with a 99% confidence interval for each best-fit linear line (which can be the same color). Ensure that labels, axes labels, title are clear, and ensure that the legend title is formal (do not keep it as “toothbrush_color”). Do not do this by creating two different plots - that defeats the purpose.