MODULE 3.1

Note: For the pandas modules, you’ll be reading a lot of documentation. This is mostly because pandas is evolving rapidly and the best functions are not covered in most tutorials. Reading documentation is also a good skill to have.

READING:

  1. Modern Pandas, Chp 1

  2. tidyverse - pandas cheatsheet

  3. filter

  4. query

  5. rename

  6. unique

  7. groupby

  8. agg

  9. Modern Pandas, Chp 2 - Method Chaining

  10. Method chaining syntax

INSTALLATION:

For the pandas tutorial, you may wish to install JupyterLab in addition to Jupyter Notebook. Both Jupyter Notebook and JupyterLab can be found here. You might also be interested in VSCode’s Jupyter integration. PyCharm also has a really good integration. For now, the pip installation should be fine. In the future, you may wish to use conda, especially if you anticipate that packages you install may frequently get updated with functions removed or changing.

Downloading a variable inspector plugin would be very helpful for data science work. Either use PyCharm pro or VSCode, both of which have native variable inspectors, or download a plugin for your chosen IDE (Jupyter). I recommend VSCode since it’s free and have many more features like Git integration.

If you’re already familiar with R’s tidyverse, you could use the cheatsheet here. For the most part, you shouldn’t be choosing between tidyverse’s dplyr or pandas. The real choice is to use Python or R. You may want to use pandas if you need to do ML since that’s Python’s expertise, but you would want to use dplyr if you need to run an IV regression - most econometric tools are better on Stata or R. That being said, there are always other packages out there, like datar, which could be better.

MODULE 3.2

READING:

  1. thinkcspy: Chp 12.1-12.3

  2. https://www.dataquest.io/blog/python-dictionaries/

  3. https://stackoverflow.com/questions/22391419/what-is-the-difference-between-curly-brace-and-square-bracket-in-python

SCRIPT AND DATA:

https://drive.google.com/file/d/1B1tm-v1iuZxgW19k2XQLXAD0BHQu9xOJ/view?usp=sharing

TASK:

In the script and data above, you are given a dta file consisting of children IDs, their race, and their cognitive and non-cognitive test scores.

Question 1:

We’re interested in creating a dictionary where each key is the child’s ID and the values consist of the child’s race, bsl_stdcog, and bsl_stdncog. Read Python1-3data.dta, and convert it into a Python dictionary through pandas. You should get a dictionary with 1736 keys. It should look something like this:

{1096.0: {'race': 'African American', 'bsl_stdcog': 0.2606467306613922, 'bsl_stdncog': -0.019319750368595123}, 1100.0: {'race': 'Hispanic', 'bsl_stdcog': -1.1454229354858398, 'bsl_stdncog': -0.7010607719421387}, 1102.0: {'race': 'Hispa

[…]

stdncog': 0.0}, 4928.0: {'race': 'Hispanic', 'bsl_stdcog': 0.22607022523880005, 'bsl_stdncog': 0.0}, 4931.0: {'race': 'Other', 'bsl_stdcog': -0.5765784382820129, 'bsl_stdncog': 0.0}, 4932.0: {'race': 'Other', 'bsl_stdcog': 1.7583993673324585, 'bsl_stdncog': 0.0}}

(Hints: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

In all subsequent questions except the last, you may not exceed 120 characters for each question, except for commenting.

We want to produce summary statistics for our paper using this data. We are interested in finding the average cognitive scores for each cohort (a cohort is defined by the first digit in their ID) and the racial categories of the children. Pay close attention to the Dictionary Comprehension part of the second reading.

Question 2:

Using dictionary comprehension, print all racial categories of all children in a single line of code.

Question 2a:

Modify your answer to Question 2, but edit the curly brackets to square brackets. You are now transforming a set/dictionary comprehension to a list comprehension. Print the result. What does that tell you about how Python interprets lists vs. sets? Comment your answer in the code.

Question 3:

Using dictionary comprehension, clean the dictionary’s cognitive scores (look for nan) in a single line of code. (Hint: What does {x : y} mean?)

Question 4:

Children in the third year (3xxx) who participated in a special program have ‘5’ in their second digit of their ID. We want to know if these children have better scores than others. Using comprehension, print the combined cognitive score (bsl_stdcog + bsl_stdncog) of each child with IDs 3500 to 3599 in a single line of code. (Should you use list or set/dict comprehension?)

Question 5:

While you are producing summary statistics, another research assistant burst into the room and told you that the team has forgotten to add a child into the data. Add the child with ID 4852 into the data. Their race is ‘Other’, their cognitive score is -0.5, while their non-cognitive score is 0.5. Do this in a single line of code.

Question 6:

We want to see if above average children in the first year had a higher cognitive score (bsl_stdcog) than those in the second year. Using comprehension, compute the upper quartile (75th percentile) of cognitive scores in the first year and store it in the variable ‘first’. Do this in a single line of code. Use numpy’s quantile function. Should you use list or set comprehension?

Question 7:

Do the same for the second cohort.

Question 8:

Print whether the first cohort had a higher cognitive score in the 75th percentile than the second cohort.

Question 9:

The last thing we need to do is to investigate the children with missing races. Using comprehension, produce a list of IDs with missing races and print it. Also export it as a .csv.

Submit your task in a single .py script.

P-Set 3

Remember to document your code well. Submit your .ipynb file in a single compressed folder (.7z, .zip, or .rar).

TASK:

You should use the Jupyter Notebook for p-sets focusing on pandas, with the IDE of your choice.

However, a Jupyter Notebook is inappropriate for a production environment, which would include replication packages on a journal, since you need other researchers to replicate your research, or if you are working in a large team where code needs to be shared and internal replicability is important.

Problem 1

Download the data here. Load it (use .read_stata()), setting child as the index, and take a quick look at the first 85 rows of the data, with every row displayed.

Problem 2

Select the ‘child’, ‘year’, ‘test’, ‘cog_pre’, and ‘ncog_pre’ variables and view the data.

Problem 3

Select ‘child’, ‘year’, ‘test’, and all variables ending in _500, _1000, …, _20000 (multiples of 500), and view the data. (Hint: use a combination of regex and a lambda)

Problem 4

Print all unique treatments that children with ID >= 3000 underwent. How is this different from the treatment that children with ID <= 2999 underwent? (Try using method chaining).

Problem 5

Find the average cognitive score (std_cog), grouped by the test period and the year.

Remember that for this p-set, you’ll be submitting the .ipynb file, instead of the .py file, in a compressed file.