MODULE 2.2
READING:
You should not understand every single function in the documentation. Make sure you get the gist of what requests can do, and know how to quickly find the functions you need for your tasks.
SCRIPT:
https://drive.google.com/file/d/1uE0ucZIRUV-aZJO1OSl4AF8s-TFSfSgX/view?usp=sharing
TASK:
Scrape https://statsapi.mlb.com/api/v1/draft/2022. Use the code provided.
With the .csv generated, print players who are likely freshmen, using the blurb and currentage variables. Store these players in freshmen.txt. Each freshman should be written in a new line. Be careful with encoding.
Append to freshmen.txt the user input to the question “Is Gavin Van Kempen in the list” and the number of freshmen. You must not read freshmen.txt in Part 3 and can only write. These two should be written in a new line. (Append means you should not overwrite the existing data in freshmen.txt)
Submit your task in a .zip or .7z compressed file with the original folder structure.
P-Set 2
Remember to document your code well. Report all results in a LaTeX-rendered PDF document. Submit your .csv file, your .py file and your .PDF file in a single compressed folder (.7z, .zip, or .rar).
TASK:
In this p-set, you are strongly recommended to utilize resources such as regex101. Use it to test what you’re doing and build your regex’s bit by bit! As a reminder, feel free to work in groups for p-sets.
We want to sort through a lot of UChicago unofficial transcripts and produce a quick summary of their transcripts. To do so, we will read the transcripts through PyPDF2 and use regex to lift relevant information that we need.
Problem 1
Download the folder of transcripts here. These transcripts are anonymized and scrambled, with the grades randomized.
Problem 2
Construct a function that takes a file link as input and the transcript’s text data as output. Use the following procedure.
(a) Set up a PdfReader with PyPDF2 that reads each file link.
(b) Extract the text of each page of the read file.
(c) Concatenate the text appropriately.
Problem 3
Construct a function that takes a transcript’s text data as input and outputs the undergraduate GPA. Use the following procedure as a hint.
(a) First, consider how GPA is presented in the transcript. How is it formatted?
(b) Next, consider where GPA is presented in the transcript. What is it always next to in the text data?
(c) Use either lookahead or lookbehind to grab the GPA. Do this in a single regex compilation.
Problem 4
Construct a function that takes a transcript’s text data as input and outputs all courses the student has taken as a list of Tuples, e.g.: [(ECON 10000, A), (ECON 12000, B+), (MATH 15000, IA-)]. Use the following procedure as a hint.
(a) How are course labels (e.g.: ECON 22050) formatted?
(b) How are credits formatted? Why would credits be useful in the regex even if we don’t need that information?
(c) How are grades formatted? Make sure to take into account uncommon grades like IA- (https://registrar.uchicago.edu/records/grading/).
(d) How do whitespace differ across the different PDFs?
(e) Using lookahead/lookbehind and capturing groups, grab all courses a student has taken.
Hint: A setup of three regex’s may be ideal: first to grab all information of a single course (the course code, name, units, grade), then use two regex’s to get the course code and the grade.
Problem 5
Output a CSV of all parsed transcripts with the following variables:
The student’s name
The student’s undergraduate GPA
Has the student taken ECON 21020 or ECON 21030, Econometrics?
Has the student taken an intro to CS course? (CMSC 12100 or above)
The number of ECON or ECMA courses the student has taken that are numbered 30000 or above.
The average grade the student has in the Elements of Economic Analysis sequence (ECON 20000, 20100, 20200 or ECON 20010, 20110, 20210)
The average grade the student has in ECON classes
Remember to submit the .csv, .py, and .PDF file together in a compressed file.