| path | /part-7/4-data-processing |
|---|---|
| title | Data processing |
| hidden | false |
After this section
- You will know how to use a module to process CSV files
- You will know how to use a module to process JSON files
- You will be able to retrieve and read files from the internet
CSV is such a simple format that so far we have accessed the with hand-written code. There is, however, a ready-made module in the Python standard library for working with CSV files: csv. It works like this:
import csv
with open("test.csv") as my_file:
for line in csv.reader(my_file, delimiter=";"):
print(line)The above code reads all lines in the CSV file test.csv, separates the contents of each line into a list using the delimiter ;, and prints each list. So, assuming the contents of the line are as follows:
012121212;5
012345678;2
015151515;4
The code would print out this:
['012121212', '5'] ['012345678', '2'] ['015151515', '4']
Since the CSV format is so simple, what's the use of having a separate module when we can just as well use the split function? Well, for one, the way the module is built, it will also work correctly if the values in the file are strings, which may also contain the delimiter character. If some line in the file looked like this
"aaa;bbb";"ccc;ddd"
the above code would produce this:
['aaa;bbb', 'ccc;ddd']
Using the split function would also split within the strings, which would likely break the data, and our program in the process.
CSV is just one of many machine-readable data formats. JSON is another, and it is used often when data has to be transferred between applications.
JSON files are text files with a strict format, which is perhaps a little less accessible to the human eye than the CSV format. The following example uses the file courses.json, which contains information about some courses:
[
{
"name": "Introduction to Programming",
"abbreviation": "ItP",
"periods": [1, 3]
},
{
"name": "Advanced Course in Programming",
"abbreviation": "ACiP",
"periods": [2, 4]
},
{
"name": "Database Application",
"abbreviation": "DbApp",
"periods": [1, 2, 3, 4]
}
]
The structure of a JSON file might look quite familiar to you by now. The JSON file above looks exactly like a Python list, which contains three Python dictionaries.
The standard library has a module for working with JSON files: json. The function loads takes any argument passed in a JSON format and transforms it into a Python data structure. So, processing the courses.json file with the code below
import json
with open("courses.json") as my_file:
data = my_file.read()
courses = json.loads(data)
print(courses)would print out the following:
[{'name': 'Introduction to Programming', 'abbreviation': 'ItP', 'periods': [1, 3]}, {'name': 'Advanced Course in Programming', 'abbreviation': 'ACiP', 'periods': [2, 4]}, {'name': 'Database Application', 'abbreviation': 'DbApp', 'periods': [1, 2, 3, 4]}]
If we also wanted to print out the name of each course, we could expand our program with a for loop:
for course in courses:
print(course["name"])Introduction to Programming Advanced Course in Programming Database Application
Let's have a look at a JSON file, which contains some information about students in the following format:
[
{
"name": "Peter Pythons",
"age": 27,
"hobbies": [
"coding",
"knitting"
]
},
{
"name": "Jean Javanese",
"age": 24,
"hobbies": [
"coding",
"rock climbing",
"reading"
]
}
]Please write a function named print_persons(filename: str), which reads a JSON file in the above format, and prints the contents as shown below. The file may contain any number of entries.
Peter Pythons 27 years (coding, knitting) Jean Javanese 24 years (coding, rock climbing, reading)
The hobbies should be listed in the same order as they appear in the JSON file.
The Python standard library also contains modules for dealing with online content, and one useful function is urllib.request.urlopen. You are encouraged to have a look at the entire module, but the following example should be enough for you to get to grips with the function. It can be used to retrieve content from the internet, so it can be processed in your programs.
The following code would print out the contents of the University of Helsinki front page:
import urllib.request
my_request = urllib.request.urlopen("https://helsinki.fi")
print(my_request.read())Pages intended for human eyes do not usually look very pretty when their code is printed out. In the following examples, however, we will work with machine-readable data from an online source. Much of the machine-readable data available online is in JSON format.
At the address https://studies.cs.helsinki.fi/stats-mock/api/courses you will find basic information about some of the courses offered by the University of Helsinki Department of Computer Science, in JSON format.
Please write a function named retrieve_all(), which retrieves the data of all the courses which are currently active (the field enabled has the value true). These should be returned as a list of tuples, in the following format:
[
('Full Stack Open 2020', 'ofs2019', 2020, 201),
('DevOps with Docker 2019', 'docker2019', 2019, 36),
('DevOps with Docker 2020', 'docker2020', 2020, 36),
('Beta DevOps with Kubernetes', 'beta-dwk-20', 2020, 28)
]
Each tuple contains the following fields from the original data:
- the name of the course:
fullName nameyear- the sum of the values listed in
exercises
NB: It is essential that you retrieve the data with the function urllib.request.urlopen, or the automated tests may not work correctly.
NB2: The tests are designed so that they slightly modify the data retrieved from the internet, to make sure you do not hard-code your return values.
NB3: Some Mac users have come across the following issue:
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1353, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1124)>The solution depends on how Python is installed on your machine. In some cases executing the following in a terminal helps:
cd "/Applications/Python 3.8/"
sudo "./Install Certificates.commandThe path used in the cd command above depends on the version of Python you have installed. The path may also be, for example, "/Applications/Python 3.9/".
Various solutions to the problem have been suggested.
One trick some have found useful:
import urllib.request
import json
import ssl # add this library to your import section
def retrieve_all():
# add the following line to the beginning of all your functions
my_context = ssl._create_unverified_context()
address = "https://studies.cs.helsinki.fi/stats-mock/api/courses"
# add a second argument to the function call
request = urllib.request.urlopen(address, context = my_context)
# the rest of your functionAnother potential workaround:
import urllib.request
import certifi # add this library to your import section
import json
def retrieve_all():
address = "https://studies.cs.helsinki.fi/stats-mock/api/courses"
# add a second argument to the function call
request = urllib.request.urlopen(address, cafile=certifi.where())
# the rest of your functionEach course also has its own URL, where more specific weekly data about the course is available. The URLs follow the format https://studies.cs.helsinki.fi/stats-mock/api/courses/****/stats, where you would replace the stars with the contents of the field name for the course you want to access.
For example, the data for the course docker2019 is at the address https://studies.cs.helsinki.fi/stats-mock/api/courses/docker2019/stats.
Please write a function named retrieve_course(course_name: str), which returns statistics for the specified course, in dictionary format.
For example, the function call retrieve_course("docker2019") would return a dictionary with the following contents:
{
'weeks': 4,
'students': 220,
'hours': 5966,
'hours_average': 27,
'exercises': 4988,
'exercises_average': 22
}
The values in the dictionary are determined as follows:
weeks: the number of JSON object literals retrievedstudents: the maximum number of students in all the weekshours: the sum of allhour_totalvalues in the different weekshours_average: thehoursvalue divided by thestudentsvalue (rounded down to the closest integer value)exercises: the sum of allexercise_totalvalues in the different weeksexercises_average: theexercisesvalue divided by thestudentsvalue (rounded down to the closest integer value)
NB: See the notices in Part 1 of the exercise, as they apply here, too.
NB2: The Python math module has a useful function for rounding down integers.
The file start_times.csv contains individual start times for a programming exam, in the format name;hh:mm. An example:
jarmo;09:00
timo;18:42
kalle;13:23Additionally, the file submissions.csv contains points and handin times for individual exercises. The format here is name;task;points;hh:mm. An example:
jarmo;1;8;16:05
timo;2;10;21:22
jarmo;2;10;19:15
jne...Your task is to find the students who spent over 3 hours on the exam tasks. That is, any student whose any task was handed in over 3 hours later than their exam start time is labelled a cheater. There may be more than one submission for the same task for each student. You may assume all times are within the same day.
Please write a function named cheaters(), which returns a list containing the names of the students who cheated
You have the CSV files from the previous exercise at your disposal again. Please write a function named final_points(), which returns the final exam points received by the students, in a dictionary format, following these criteria:
- If there are multiple submissions for the same task, the submission with the highest number of points is taken into account.
- If the submission was made over 3 hours after the start time, the submission is ignored.
The tasks are numbered 1 to 8, and each submission is graded with 0 to 6 points.
In the dictionary returned the key should be the name of the student, and the value the total points received by the student.
Hint: nested dictionaries might be a good approach when processing the tasks and submission times of each student.
The official Python documentation contains information on all modules available in the standard library:
In addition to the standard library, the internet is full of freely available Python modules for different purposes. Some commonly used modules are listed here:
In this exercise you will write an improved version of the Spell checker from the previous part.
Just like in the previous version, the program should ask the user to type in a line of text. Your program should then perform a spell check, and print out feedback to the user, so that all misspelled words have stars around them. Additionally, the program should print out a list of suggestions for the misspelled words.
Please have a look at the following two examples.
write text: We use ptython to make a spell checker
We use *ptython* to make a spell checker suggestions: ptython: python, pythons, typhon
write text: this is acually a good and usefull program
this is *acually* a good and *usefull* program suggestions: acually: actually, tactually, factually usefull: usefully, useful, museful
The suggestions should be determined with the function get_close_matches from the Python standard library module difflib.
NB: For the automatic tests to work correctly, please use the function with the "default settings". That is, please pass only two arguments to the function: the misspelled word, and the word list.