MODULE 2

READING

Pivoting, separating, and uniting: https://r4ds.had.co.nz/tidy-data.html#pivoting (Chp 12)
Joins: https://r4ds.had.co.nz/relational-data.html (Chp 13)
Strings and regex: https://r4ds.had.co.nz/strings.html (Chp 14)
Regex: https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html (Chp 17)
Lookahead and Lookbehind: https://bookdown.org/Maxine/tidy-text-mining/looking-ahead-and-back.html (Chp A.3)
https://regex101.com/

P Set 1

Remember to document your code well. Report all results in a LaTeX-rendered PDF document. Submit your .R file and your .PDF file in a single compressed folder (.7z, .zip, or .rar).

TASK:

We are interested in understanding the class choice, careers, and lifetime trajectories. We have a lot of data of UChicago students and alumni. Note that all data used in this p-set is artificially created. There are many databases used in this p-set, which you can download here.

classes.csv contains a raw scraped record of the classes that UChicago students have taken.

major_codes.csv contains a list of departments in UChicago and their associated 4-letter code.

courses.csv contains all courses offered by UChicago.

alumni_[year].csv contains a record of UChicago alumni, their income, the type of career that they are pursuing, and when they have answered the survey, for each year. We have data spanning 15 years.

addresses.csv contains a record of UChicago alumni and a history of where they have lived before and their most current address.

counties.csv contains a record of US counties and a breakdown of the types of careers that are most common in each county.

In this problem set, you will investigate whether UChicago students pursue a career similar to their class choices and whether we can predict a student’s career with their class choices.

Problem 1

Open the databases in the p-set and propose a concrete hypothesis that you wish to test using the data.

Problem 2

classes.csv contains raw data that will need to be cleaned. Using regex, clean the raw data of classes.csv so that you have a database with, in each observation, the student’s name, the quarter, the class they took (the class name, the course department code (e.g. “SOSC”), and the course number (e.g. “14100”), the units of the class, and their grade. Very briefly describe the data in the style of an academic paper (around one sentence).

Problem 3

We want to merge the cleaned classes database with the alumni databases.

(a) First, merge all alumni databases into one single database.

(b) Next, pivot the classes database so that each student is its own observation.

(c) Do the same for the alumni data: pivot it so that each alumni is its own observation.

(d) How would you join the two databases? Briefly discuss what kind of join you would be using.

(e) Finally, join the pivoted class and alumni databases.

(f) Briefly discuss why we needed to pivot the databases. Is the final result long or wide data? Is it possible or optimal to do your analysis in the other data shape?

Problem 4

Merge the addresses database in. Discuss any issue you come across and any decision you had to make, if any.

Problem 5

Merge the county database in. Discuss any issue you come across and any decision you had to make, if any.

Problem 6

Is major_codes.csv necessary? Discuss any decision you make regarding this dataset.

Problem 7

Analyze the data and answer the hypothesis you made (a two-sided t-test or a simple regression is sufficient). Briefly discuss your findings.