CS260 Lab 6: Information Theory

GitHub Classroom Assignment Link

Overview

The goals of this week’s lab:

Understand entropy as an essential measurement in machine learning and data science
Apply entropy concepts to create a binary encoding scheme
Use entropy to identify informative features

This week we will be analyzing several datasets using entropy. Part 1 walks you through the process of implementing Shannon encoding - we will use this to encode/decode natural language datasets. Part 2 includes implementing conditional entropy for the purpose of selecting informative features (similar goal to the ROC curve assignment, but now using entropy).

The heart disease and diabetes data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the Cleveland Clinic to study coronary heart disease. The patients were identified as having or not having heart disease and the resulting feature set is a subset of the original 76 factors that were studied. The diabetes data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted to ARFF format by Mark Craven at the University of Wisconsin.

To begin, accept your Lab 6 repository on Github classroom. You should have the following files:

data - directory containing train/test data sets.
encode_decode.py - your main program executable for Part 1.
Shannon.py - this is where you will implement Shannon encoding for Part 1.
best_feature.py - your main program executable for Part 2.
Partition.py - file for the Example and Partition classes (similar to previous labs). This is where you will implement conditional entropy and information gain for Part 2.
README.md - for analysis questions and lab feedback.

Part 1: Shannon encoding

In this part of the lab you will be working with the files encode_decode.py and Shannon.py. The high-level goal is to implement the Shannon encoding algorithm discussed in class, then use it on a real-world Twitter dataset.

Shannon encoding algorithm

For this part, work inside the Shannon.py file to create a Shannon class. The main methods of this class will be encode and decode. Each of these methods should take in a string and return a string. encode should take in a string of characters and return a string of 0’s and 1’s. decode should take in a string of 0’s and 1’s and return a string of characters.

It is up to you how you design your class, but I would recommend that the constructor take a single argument: a dictionary with the probabilities for each character. Below is an example similar to what we did in class, except “senior”, “junior”, “sophomore”, “first-year”, have been replaced with single characters “A”, “B”, “C”, “D” for simplicity (although not in the same order). From these probabilities we wish to create an encoding.

char_probs = {"A": 0.25, "B": 0.125, "C": 0.5, "D": 0.125}
shannon = Shannon(char_probs)

Inside the constructor you may wish to call helper methods that will help you create another dictionary, this time mapping the characters to their encoding. We know for this example the map should be:

{"A": "10", "B": "110", "C": "0", "D": "111"}

You may also want to create a dictionary for the reverse mapping (for decoding). When you have implemented both encode and decode, you can check your work by encoding a string, then decoding it and making sure you get back the original. For example, you should be able to run something like the following code (put this in your main inside Shannon, which is just for testing).

orig = "ABAACDDBACCDAAABDBBABCDDCBA"
encoded = shannon.encode(orig)
decoded = shannon.decode(encoded)
assert decoded == orig

Testing with a real-world example

In this next part you’ll use Shannon encoding to encode and decode a variety of files. For this part it’s okay to hard-code the filenames in encode_decode.py (this program doesn’t need to take any command line arguments). The workflow is below:

First read in the file data/vaccine_tweets_codes_source.csv which is a large file containing tweets about the COVID vaccine. Use the provided read_file function with encoding ISO-8859-1 to read this file in as one long string of characters. This is what we will use to develop the character probability table (i.e. what was given in the small example above).

Create a function to develop a dictionary that maps each character in this file to a probability. The most frequent character should end up being a “space”, followed by “e”, etc. Use this probability dictionary to create an instance of the Shannon class.

Use your instance of the Shannon class to encode the file data/vaccine_tweets_to_encode.csv (read it in first in a similar manner to the previous file). This file is a smaller subset of tweets. Write the output (a long string of 0’s and 1’s) to a file called:

output/vaccine_tweets_encoded.txt

using the write_file function with the default encoding (you’ll need to create an output directory first).

Finally, use your instance of the Shannon class to decode the secret message in the file data/secrete_message_to_decode.txt. Read this file in using the default encoding. Then decode it and write the output to a file called:

output/secret_message_decoded.txt

So at the end of this section you should have two files in your output folder. Make sure to add/commit/and push these files. The last one should make sense (i.e. not look like random characters)!

Part 2: Information gain for best feature

In this part of the lab you will be working with the files best_feature.py and Partition.py. Analyze the code in best_feature.py - the reading of the arff file (including conversion of continuous features to discrete) has already been done. The only step remaining is to fill in the main function as you implement information gain. At the end you should only need this code in main:

args = parse_args()
train_partition = read_arff(args.train_filename)
best_f = train_partition.best_feature()
print("best feature:", best_f)

Here we will only be using the training data to obtain the best feature. Looking at the code above, the suggestion is to implement the best_feature method inside the Partition class. That is because the best feature is really a property of the data (we’re not creating a model here).

To implement information gain, begin slowly and use several helper methods. It’s a great idea to use top-down design here, then bottom-up implementation. For example, to compute information gain we will need to compute the entropy of y. This might require a helper method that computes the probability y is a specific label (i.e. +1 or -1). Refer to Handout 13 to check your intermediate progress.

Here are a few examples so you can check your work. This is what should print when we run your programs:

TENNIS

$ python3 best_feature.py data/tennis_train.arff

Info Gain:
     Outlook, 0.246750
 Temperature, 0.029223
    Humidity, 0.151836
        Wind, 0.048127

best feature: Outlook

MOVIES

$ python3 best_feature.py data/movies_train.arff

Info Gain:
         Type, 0.306099
       Length, 0.306099
     Director, 0.557728
Famous_actors, 0.072780

best feature: Director

Comparison with classification accuracy

The last step for this part is to create a method inside Partition that determines the best feature based on the maximum classification accuracy (instead of entropy). In other words, if I use a specific feature and for each feature value I predict the majority label, what is the classification accuracy of the training data? Compute this for all features and then select the one the maximizes this metric.

Use this method to identify the best feature for each dataset (tennis, movies, heart, and diabetes). Then answer the analysis questions below.

Part 3: Analysis

Answer the following questions in your README.md.

For your Shannon code based on the vaccine Twitter data, what is the average number of bits needed to send one character?
For Part 2, was the feature selected by information gain ever different from the feature selected by classification accuracy? Explain your results for each of the 4 datasets.

Acknowledgements

Credit for this lab: Materials by Sara Mathieson, modified from materials created by Ameet Soni and Allison Gong.