The goals of this week’s lab:
This week we will be analyzing several datasets using entropy. Part 1 walks you through the process of implementing Shannon encoding - we will use this to encode/decode natural language datasets. Part 2 includes implementing conditional entropy for the purpose of selecting informative features (similar goal to the ROC curve assignment, but now using entropy).
The heart disease and diabetes data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the Cleveland Clinic to study coronary heart disease. The patients were identified as having or not having heart disease and the resulting feature set is a subset of the original 76 factors that were studied. The diabetes data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted to ARFF format by Mark Craven at the University of Wisconsin.
To begin, accept your Lab 6 repository on Github classroom. You should have the following files:
encode_decode.py
- your main program executable for Part 1.Shannon.py
- this is where you will implement Shannon encoding for Part 1.best_feature.py
- your main program executable for Part 2.Partition.py
- file for the Example
and Partition
classes (similar to previous labs). This is where you will implement conditional entropy and information gain for Part 2.README.md
- for analysis questions and lab feedback.In this part of the lab you will be working with the files encode_decode.py
and Shannon.py
. The high-level goal is to implement the Shannon encoding algorithm discussed in class, then use it on a real-world Twitter dataset.
For this part, work inside the Shannon.py
file to create a Shannon
class. The main methods of this class will be encode
and decode
. Each of these methods should take in a string and return a string. encode
should take in a string of characters and return a string of 0’s and 1’s. decode
should take in a string of 0’s and 1’s and return a string of characters.
It is up to you how you design your class, but I would recommend that the constructor take a single argument: a dictionary with the probabilities for each character. Below is an example similar to what we did in class, except “senior”, “junior”, “sophomore”, “first-year”, have been replaced with single characters “A”, “B”, “C”, “D” for simplicity (although not in the same order). From these probabilities we wish to create an encoding.
char_probs = {"A": 0.25, "B": 0.125, "C": 0.5, "D": 0.125}
shannon = Shannon(char_probs)
Inside the constructor you may wish to call helper methods that will help you create another dictionary, this time mapping the characters to their encoding. We know for this example the map should be:
{"A": "10", "B": "110", "C": "0", "D": "111"}
You may also want to create a dictionary for the reverse mapping (for decoding). When you have implemented both encode
and decode
, you can check your work by encoding a string, then decoding it and making sure you get back the original. For example, you should be able to run something like the following code (put this in your main
inside Shannon
, which is just for testing).
orig = "ABAACDDBACCDAAABDBBABCDDCBA"
encoded = shannon.encode(orig)
decoded = shannon.decode(encoded)
assert decoded == orig
In this next part you’ll use Shannon encoding to encode and decode a variety of files. For this part it’s okay to hard-code the filenames in encode_decode.py
(this program doesn’t need to take any command line arguments). The workflow is below:
data/vaccine_tweets_codes_source.csv
which is a large file containing tweets about the COVID vaccine. Use the provided read_file
function with encoding ISO-8859-1
to read this file in as one long string of characters. This is what we will use to develop the character probability table (i.e. what was given in the small example above).Create a function to develop a dictionary that maps each character in this file to a probability. The most frequent character should end up being a “space”, followed by “e”, etc. Use this probability dictionary to create an instance of the Shannon
class.
Shannon
class to encode the file data/vaccine_tweets_to_encode.csv
(read it in first in a similar manner to the previous file). This file is a smaller subset of tweets. Write the output (a long string of 0’s and 1’s) to a file called:output/vaccine_tweets_encoded.txt
using the write_file
function with the default encoding (you’ll need to create an output
directory first).
Shannon
class to decode the secret message in the file data/secrete_message_to_decode.txt
. Read this file in using the default encoding. Then decode it and write the output to a file called:output/secret_message_decoded.txt
So at the end of this section you should have two files in your output
folder. Make sure to add/commit/and push these files. The last one should make sense (i.e. not look like random characters)!
In this part of the lab you will be working with the files best_feature.py
and Partition.py
. Analyze the code in best_feature.py
- the reading of the arff file (including conversion of continuous features to discrete) has already been done. The only step remaining is to fill in the main
function as you implement information gain. At the end you should only need this code in main
:
args = parse_args()
train_partition = read_arff(args.train_filename)
best_f = train_partition.best_feature()
print("best feature:", best_f)
Here we will only be using the training data to obtain the best feature. Looking at the code above, the suggestion is to implement the best_feature
method inside the Partition
class. That is because the best feature is really a property of the data (we’re not creating a model here).
To implement information gain, begin slowly and use several helper methods. It’s a great idea to use top-down design here, then bottom-up implementation. For example, to compute information gain we will need to compute the entropy of y. This might require a helper method that computes the probability y is a specific label (i.e. +1 or -1). Refer to Handout 13 to check your intermediate progress.
Here are a few examples so you can check your work. This is what should print when we run your programs:
TENNIS
$ python3 best_feature.py data/tennis_train.arff
Info Gain:
Outlook, 0.246750
Temperature, 0.029223
Humidity, 0.151836
Wind, 0.048127
best feature: Outlook
MOVIES
$ python3 best_feature.py data/movies_train.arff
Info Gain:
Type, 0.306099
Length, 0.306099
Director, 0.557728
Famous_actors, 0.072780
best feature: Director
The last step for this part is to create a method inside Partition
that determines the best feature based on the maximum classification accuracy (instead of entropy). In other words, if I use a specific feature and for each feature value I predict the majority label, what is the classification accuracy of the training data? Compute this for all features and then select the one the maximizes this metric.
Use this method to identify the best feature for each dataset (tennis
, movies
, heart
, and diabetes
). Then answer the analysis questions below.
Answer the following questions in your README.md
.
Credit for this lab: Materials by Sara Mathieson, modified from materials created by Ameet Soni and Allison Gong.