CS260 Lab 7: Logistic Regression and Visualization

GitHub Classroom Assignment Link


Overview

Goals:

Credit: Materials by Sara Mathieson, modified from material created by Ameet Soni. Phoneme dataset and information created by Jessica Wu. Challenger dataset and visualization ideas from Alvin Grissom.


Introduction

In this lab we will be analyzing two datasets:

Accept your Lab 7 repository on GitHub classroom. You should have the following files:


Usage and I/O

Usage

Your programs should take in the same command-line arguments as Lab 4 (feel free to reuse the argument parsing code), plus a parameter for the learning rate alpha. For example, to run Logistic Regression on the phoneme example:

python3 run_LR.py -r data/phoneme_train.csv -e data/phoneme_test.csv -a 0.02

Program Inputs

To simplify preprocessing, you may assume the following:

Program Outputs


Logistic Regression

You will implement the logistic regression task discussed in class for binary classification.

Model

For now you should assume that the features will be continuous and the response will be a discrete, binary output. In the case of binary labels, the probability of a positive prediction is:

To model the intercept term, we will include a bias term. In practice, this is equivalent to adding a feature that is always “on”. For each instance in training and testing, append a 1 to the feature vector. Ensure that your logistic regression model has a weight for this feature that it can now learn in the same way it learns all other weights.

Training

To learn the weights, you will apply stochastic gradient descent (SGD) as discussed in class until the cost function does not change in value (very much) for a given iteration. As a reminder, our cost function is the negative log of the likelihood function, which is:

Our goal is to minimize the cost using SGD. We will use the same idea as in Lab 3 for linear regression. Pseudocode:

initialize weights to 0's
while not converged:
    shuffle the training examples
    for each training example xi:
        calculate derivative of cost with respect to xi
        weights = weights - alpha*derivative
    compute and store current cost

The SGD updates for w term are:

The hyper-parameter alpha (learning rate) should be sent as a parameter to your SGD and used in training. A few notes for the above:

Prediction

For binary prediction, apply the equation:

and output the more likely binary outcome. You may use the Python math.exp() function to make the calculation in the denominator. Alternatively, you may use the weights directly as a decision boundary as discussed in class.

The choice of hyper-parameters will affect the results, but you should be able to obtain at least 90% accuracy on the testing data. Make sure to print a confusion matrix.

Example Output

Here is an example run for testing purposes. You should be able to get a better result than this, but this will check if your algorithm is working. The details are:

$ python3 run_LR.py -r data/phoneme_train.csv -e data/phoneme_test.csv -a 0.02
Accuracy: 0.875000 (70 out of 80 correct)

   prediction
      0  1
    ------
  0| 36  4
  1|  6 34

Labels and Debugging Output

Comparison with sklearn

After implementing logistic regression by yourself, try using the LogisticRegression model implemented in the sklearn library. If you do not already have this library, install it using pip3:

pip3 install scikit-learn

If this doesn’t work, it may be that you have multiple versions of Python 3. In that case run:

python3 -m pip install scikit-learn

Train and evaluate the LogisticRegression model on both datasets and compare the results with your own implementation.


Visualization

For just the Challenger dataset, create a visualization of both:

Make sure to think about visualization practices discussed in class including:

Submit your figure as part of your GitHub repo using the filename challenger.pdf.


Analysis

(See README.md to answer these questions)

  1. Are there any differences between your implementation of logistic regression and the sklearn model? How did the sklearn logistic regression model perform on the datasets?

  2. So far, we have run logistic regression and Naive Bayes on different types of datasets. Why is that? Discuss the types of features and labels that we worked with in each case.

  3. Regarding the Challenger data specifically, what was your final probability of an accident after fitting the model? Is this what you expected? Based on this value, would you have recommended the Challenger be launched on 1/28/86? What data and/or modeling might have helped increase the confidence of our prediction?