MODULE 3.1
READING:
For the task, download the .py script here and complete the script, then submit the script.
TASK:
Question 1:
Create the variable ‘c’ that adds ‘a’ and ‘b’.
Print the variable ‘c’.
Question 2:
Create the variable ‘d’ that takes the square root of ‘c’. (Hint)
Print the variable ‘d’.
Question 3:
Create the variable ‘book_title’ that combines ‘word_1’, ‘word_2’, …, ‘word_5’ and ‘space’.
Print the variable ‘book_title’.
Question 4:
Create the variable ‘length_of_book_title’ that gets the length of the string variable ‘book_title’.
Print the variable ‘length_of_book_title’.
In Questions 5 and 6, you will need to look up relevant Python functions and operators on Google.
Question 5:
The script creates the variables ‘adjacent’ and ‘opposite’ representing the adjacent and opposite sides of a right triangle. Create the variable ‘pythagoras’ that computes the length of the hypotenuse. (Pythagorean Theorem)
Print the variable ‘pythagoras’.
Question 6:
Create the variable ‘angle’ that computes (1) the angle between a side of length ‘adjacent’ and the hypotenuse (‘pythagoras’) on a right triangle, (2) in degrees, and (3) rounded to the nearest integer. The correct trigonometric function is arctan. The image below is an illustration.
Print the variable ‘angle’.
Verify that your answers to Question 5 and 6 are accurate by running the script and inputting arbitrary lengths to test. You may find WolframAlpha useful for this purpose.
Submit the Python script, in .py format.
P-Set 3
Remember to document your code well. Report all results in a LaTeX-rendered PDF document. Submit your .py file and your .PDF file in a single compressed folder (.7z, .zip, or .rar).
TASK:
We’re interested in finding some summary statistics of the data.
Problem 1:
(a) Download the CHECC data here and the code here. Ensure they are located in the same folder. Install the package pandas in order to run the code. pandas is a package that allows us to inspect, manipulate, and analyze data. This is the same data from P-Set 2.
(b) list_data is a list of lists. Which of the following is the number of observations (rows) of the data? Which of the following is the number of variables (columns) of the data? You may find it useful to print list_data to see what the structure is. Explain why.
len(list_data)
len(list_data[0])
Problem 2:
We first wish to clean the data and remove all children with invalid treatment data. You cannot use a for loop or if/else except within a list comprehension for this problem.
(a) Using the in operator, valid_treatments, and list comprehension, construct a list of booleans, treatment_vars, that equals to whether the first row of the data contains a value in valid_treatments.
(b) The any operator returns True if a list of booleans has at least one True, and False if there is zero. Using any, determine if the first row of data has a valid treatment value.
(c) Repeat Problem 2(b), but do not reference the treatment_vars variable this time (though you may reference valid_treatments).
(d) Create a new list of lists, called list_data_cleaned that equals to list_data, but using list comprehension with a for (do not write list_data_cleaned = [list_data] or list_data_cleaned = list_data).
(e) Using your insights from Question 2(c) and 2(d), create a new list of lists, called list_data_cleaned_2, that preserves only rows/observations with at least one column having a valid treatment value. (Hint: remember that you can use an if in a list comprehension).
Problem 3:
We next wish to delete all children with the “kinderprep” treatment, as this treatment is a 3-month pre-K program unlike all other treatments which were at least 9-months long. You cannot use a for loop or if/else except within a list comprehension for this problem.
(a) Using a similar procedure as in Problem 2, construct a new list of lists that is both cleaned and removes all children with at least one “kinderprep” treatment during any year. (You can simply use the equality operator ‘==’, since this time we’re only comparing against one value).
(b) How many children have valid treatment data and do not belong in kinderprep? How does this compare to the entire data? By deleting children who do not belong in kinderprep, will we cause a sample bias? (True, False, or Uncertain). If uncertain, under what conditions will a sample bias occur?
Problem 4:
We now want to find the average non-cognitive score of our non-kinderprep data. Note that pandas has converted all NA values into nan. The numpy function np.nanmean() allows us to compute the mean and automatically ignore nan values.
(a) What is the index of ncog_pre? Knowing that child 1096 (the first row) has an ncog_pre score of 0.537742257, find the index of ncog_pre.
(b) Using list comprehension and np.nanmean(), compute the average non-cognitive score of the no-KP data.
(c) Compute the average non-cognitive score of kinderprep-treated children.
(d) Is there a sample bias in the non-cognitive score? Assume there is a sample bias. Will any sample bias in the non-cognitive score affect our analysis even if we do not use the non-cognitive scores in the analysis?
Problem 5:
We now go back to the original list_data, but we need to do some cleaning. Do not assign the cleaned result to a new list for the following steps. You should modify list_data directly.
(a) A research assistant has expressed concern that kinderprep, a treatment that should only occur in cohorts 3 and 4 (indicated by the first digit of their ID), is also occurring in cohorts 1 and 2. Print child 1102 (index 2)’s treatment in 2011 (index 5) to verify that this is an issue.
(b) Child 2004 has missing data that needs to be inserted. We know that their treatment in 2012 and sibling info is [“preschool”, 0, 0]. But we do not know the rest of the information. Copy the known list into the code as “known_2004_info”. Then create a list of 14 np.nan (tip: use the star operator), and add this to known_2004_info. (Be careful: you should not get a nested list).
(c) Find the index of child 2004 (Hint: use .index() on a constructed list using list comprehension). Then insert the known info into index 8 of that row (Hint: use slicing instead of append).
(d) Print child 2004’s data to verify that you have done this correctly (you should not get nested lists within the row).
(e) The insertion caused child 2004 to gain extra nan’s at the end of the list. Remove these. (Hint: the resulting child 2004’s row should have the same length as the first row).
(f) Child 2009 has an extra gender value (they have two “male” values). Remove one of these. (Hint: (e) and (f) should be using two different functions).
Problem 6:
Finally, we want to know the percentages of treatment in 2010 (index 2) for children with the first digit 1. Compute, using list comprehension, the count of treatments assigned to each children and their percentage out of the number of children with first digit 1. Do not do this by referencing the treatments manually. Do not use valid_treatments, as it is different from the treatments present in 2010. Use the following procedure as a hint:
(a) Create a list consisting of only children with digits 1000 to 1999, using list comprehension. From here you can obtain the number of children with first digit 1.
(b) How is a set different from a list? How can this help you in obtaining the treatments present in 2010 in children 1000 to 1999?
(c) Using the for loop provided in the problem set code, create the list corresponding that filters the data if the the second index of the row is equal to the treatment.
(d) What do you observe with the treatments? Is there anything unusual? Comment briefly.