CMSC H260: Foundations of Data Science

(Fall 2024)

Course Info | Schedule | Grading
Academic Integrity | Piazza | Accommodations | Title IX | Links
CMSC H260 Data Science - Materials adapted from Sara Mathieson

Course Information

Lecture: MW 1-2:30pm in Stokes 10
Professor: Thao Nguyen (she/her)
Office: KINSC L303
Office hours: Mondays 10:30am-12pm in L303 (or schedule an appointment with me)
TAs: Edgar Leon, Darshan Mehta
TA hours: Mondays 7-9pm (Zubrow Commons), Wednesdays 11am-1pm (Zubrow), Thursdays 9-11pm (H110)
Piazza: CS260

The prerequisites for this course are Calculus I, Data Structures, and Discrete Mathematics (this last is a co-req).

This course will introduce core principles of learning from data. More and more decisions are being made by algorithms that operate on large datasets, and this course will give students the tools to understand and contribute to this process. Throughout we will emphasize the ethical use of data and analyze case studies of how data science has intersected with society. This course will have a significant theory component, covering introductory linear algebra, probability, statistics, modeling, information theory, and optimization. However, we will also implement these ideas (in Python) and apply them to concrete datasets from a variety of fields (including images, video, text, DNA, music, art, etc).

The language for this course is Python 3.

Textbook:

You do not need to purchase a textbook for this course. We will draw from several online textbooks, as well as supplemental online readings and research papers.


See the Schedule for each week's reading assignment. The schedule is tentative and subject to change throughout the semester.

Schedule (Tentative)

WEEK
DAY
TOPIC & READING
LECTURES
LAB & MEMOS
1 Sep 03

Introduction to Data Science and Python

  • What can we learn from data?
  • Representing data
  • Crash course on Python
  • Numpy
  • Matplotlib (plotting in Python)
  • Classes and objects in Python
  • Dictionaries

Reading:

  • MML Chap 1

Tue:

Lab 1: Computing and plotting in Python (due Sep 10)

Sep 04

Wed:

2

Sep 09

Introduction to Modeling

  • What is a model?
  • Linear models
  • Polynomial models
  • Assessing model fit and complexity
  • Using models for prediction

Reading:

  • MML Chap 8.1, 8.2.1-8.2.3

Mon:

Lab 2: Modeling climate change (due Sep 16)

Sep 11

Wed:

3

Sep 16

Applied Linear Algebra and Optimization

  • Matrices and vectors
  • Representing data as a matrix
  • Matrix operations including dot products
  • Analytic solution for linear regression
  • Model fitting as a numerical optimization problem
  • Introduction to gradients
  • Gradient descent
  • Application to linear regression
  • Discussion of optimization in other contexts

Reading:

  • Duame Chap 7.1-7.6
  • (optional) MML Chap 2.1-2.2, 2.5, 2.7.1
  • (optional) MML Chap 9-9.2
  • (optional) MML Chap 7-7.1

Mon:

Lab 3: Gradient Descent (due Sep 23)

Last day to drop (Sep 20)

Sep 18

Wed:

4

Sep 23

Evaluation Metrics

  • Precision and recall
  • Specificity and sensitivity
  • Confusion matrices
  • ROC curves
  • Introduction to probability

Reading:

Mon:

Lab 4: Evaluation Metrics (due Sep 30)

Sep 25

Wed:

5

Sep 30

Probabilistic modeling I (+ review)

  • Introduction to probability
  • Bayes rule
  • Midterm I review

Reading:

  • MML Chap 6.1-6.3
  • Duame Chap 9.1-9.4

Mon:

Study Guide 1

Midterm 1

Oct 02

Tue+Wed:

6

Oct 7

Ethics: Disparate Impact

  • Naive Bayes algorithm
  • Probability in clinical trials
  • Introduction to algorithmic bias
  • Redundant encoding of protected features

Reading:

Mon:

Lab 5: Naive Bayes (due Oct 22)

Oct 9

Wed:

 

Oct 14

Fall Break

Oct 16

7

Oct 21

Information theory

  • Introduction to information theory
  • Entropy
  • Coding theory
  • Discuss applications of entropy in machine learning
  • Continuous features

Reading:

Mon:

Lab 6: Information Theory (due Oct 28)

Oct 23

Wed:

8

Oct 28

Probabilistic modeling II + Visualization

  • Logistic regression
  • Principles of visualization data
  • Discrete vs. continuous data
  • Types of graphs (bar chart, scatter plot, heatmap, etc)
  • Visualizing graphs
  • Principal components analysis (PCA)
  • Interactive visualization basics

Reading:

Mon:

Lab 7: Logistic Regression and Visualization (due Nov 4)

Final Project proposal (due Nov 8)

Oct 30

Wed:

9

Nov 04

Introduction to statistics

  • Introduction to statistics
  • Normal distributions
  • Hypothesis testing
  • p-values
  • Permutation testing

Reading:

Mon:

Nov 06

Wed:

10

Nov 11

Introduction to statistics II (+ review)

  • Bootstrap, bagging
  • Random forests
  • Midterm II review

Reading:

Tue:

Lab 8: Statistics and Visualization (due Nov 18)

Study Guide 2

Nov 13

Wed:

11

Nov 18

Midterm II review

  • Review
  • Begin unsupervised learning

Reading:

Mon:

Midterm 2

Nov 20

Wed:

12

Nov 25

Unsupervised learning

  • Clustering (K-means and Gaussian Mixture Models)
  • Dimensionality Reduction (PCA and t-SNE)
  • Kernel Density Estimation (KDE)

Reading:

  • Duame Chap 15 (K-means)
  • MML Chap 11 (GMM)
  • (optional) GMM tutorial

Mon:

Nov 27

13

Dec 02

Intro to neural networks

  • Missing data
  • Neural networks
  • Deep learning
  • Applications

Reading:

Mon:

Final Project Presentation and Deliverables (GitHub repos due 12pm Dec 20)

Dec 04

Wed:

14

Dec 09

Project Presentations

  • Final project presentations

Last day to pass/fail (Dec 13)

Dec 11


Grading Policies

Grades will be weighted as follows:
35% Lab assignments
20% Midterm I
20% Midterm II
15% Final Project (including presentation)
10% Participation (including attendance and note-taking)
1% Extra Credit (sharing at least 3 alternative resources)

Quizzes and Exams

In lieu of reading quizzes this semester, we will have short excercises during class (to work on and discuss, not turn in). Be ready to work on these exercises by completing the weekly reading before class on Wednesdays.

There will be two midterms (with limited time, but you will have several days to choose a window). In lieu of a final exam, there will be a final project and associated presentation. You must pass at least one exam to pass the course overall.

Labs

Our labs are on Tuesdays. Lab assignments will generally be released Monday night and due the following Monday at midnight. There will be an introduction to the assignment on Tuesday during lab. Lab attendance is required, and missing labs will quickly affect your participation grade. There will sometimes be pair-programming warm-up exercises as part of the lab, and lab in general is a time to build community around the course and the material. Please note that I will often be off campus on Thursday and Friday, and make use of office hours (both mine and the TAs) and Piazza for questions.

Weekly Lab Sessions
Lab A 1:30—2:30pm Tuesdays Nguyen H110
Lab B 2:30—3:30pm Tuesdays Nguyen H110

Handing in labs: Lab assignments are submitted electronically and managed using github classroom. You may submit your assignment multiple times, but each submission overwrites the previous one and only the final submission will be graded. Some of the programming/lab assignments may be in pairs. There may also be some written assignments that will have specific instructions for handing in.


Late Policy: Each individual will be given 4 late days for the semester. A late day is a 24 hour extension from the original deadline. You can use up to one late day on any one assignment. This will encompass any reason - illness, interviews, many midterms in the same week, etc. Past these days, late assignments will not be accepted. You should budget your days to account for future illnesses or assignment deadlines for other courses. Even if you do not fully complete a lab assignment you should submit what you have done to receive partial credit. Late days count against both partners in a group lab.

For extensions beyond these 4 late days (in the case of an emergency or ongoing personal issue), please contact your Class Dean. If your Class Dean notifies me of the issues, then we can arrange an accommodation.


Academic Integrity

From the faculty:

In a community that thrives on relationships between students and faculty that are based on trust and respect, it is crucial that students understand a professor's expectations and what it means to do academic work with integrity. Plagiarism and cheating, even if unintentional, undermine the values of the Honor Code and the ability of all students to benefit from the academic freedom and relationships of trust the Code facilitates. Plagiarism is using someone else's work or ideas and presenting them as your own without attribution. Plagiarism can also occur in more subtle forms, such as inadequate paraphrasing, failure to cite another person's idea even if not directly quoted, failure to attribute the synthesis of various sources in a review article to that author, or accidental incorporation of another's words into your own paper as a result of careless note-taking. Cheating is another form of academic dishonesty, and it includes not only copying, but also inappropriate collaboration, exceeding the time allowed, and discussion of the form, content, or degree of difficulty of an exam. Please be conscientious about your work, and check with me if anything is unclear.

Please also note the CS Department Collaboration Policy.

More details for this course:

Under no circumstances may you hand in work done with (or by) someone else under your own name. Your code should never be shared with anyone; you may not examine or use code belonging to someone else, nor may you let anyone else look at or make a copy of your code. This includes, but is not limited to, obtaining solutions from students who previously took the course or code that can be found online. You may not share solutions after the due date of the assignment.

Discussing ideas and approaches to problems with others on a general level is fine (in fact, we encourage you to discuss general strategies with each other), but you should never read anyone else's code or let anyone else read your code. All code you submit must be your own with the following permissible exceptions: code distributed in class, code found in the course text book, and code worked on with an assigned partner. In these cases, you should always include detailed comments that indicates on which parts of the assignment you received help, and what your sources were.

Github copilot (or any other software for automaticallly generating code) is not allowed for this course, until the final project. The reasoning behind this decision is that code generation tools often create code that is not well understood by the user. Often this code becomes incorrect in the larger context of the program. However, for the final project you are welcome to use Github copilot, and you'll be asked to reflect on your experience.


Piazza

This semester we'll be using Piazza, an online Q&A forum for class discussion, to help with labs, clarifications, and announcements. You will receive an email invitation to join CMSC H260 on Piazza. If you don't, please let me know.

Piazza is meant for questions outside of regular meeting times such as office hours, class, and lab. Please do not hesitate to ask and answer questions on Piazza, but keep in mind the following guidelines:

  1. Piazza should be used for ALL content and logistics questions outside of class, lab, and office hours. Please do not email me your code or extended questions about the assignments.
  2. If there is a personal issue that relates only to you, please email me.
  3. We encourage non-anonymous posts, but you may post anonymously (to your classmates, not the instructors).
  4. Do not post long blocks of code on Piazza - if you can distill the problem to 1-2 lines of code and an error message, that's fine, but try to avoid giving out key components of your work.
  5. By the same token, when answering a question, try to give some guiding help but do not post code fixes or explicit solutions to the problem.
  6. Posting on Piazza counts toward your participation grade, both asking and answering!

Haverford Academic Accommodations Statement

For details about the accommodations process, visit the Access and Disability Services website.

We are committed to partnering with you on your academic and intellectual journey and recognize that you bring many strengths, perspectives and strategies as you navigate this journey. We also recognize that your ability to thrive academically can be impacted by your personal well-being and that stressors may impact you over the course of the semester. If the stressors are academic, we welcome the opportunity to discuss and address those stressors with you in order to find solutions together. If you are experiencing challenges or questions related to emotional health, finances, physical health, relationships, learning strategies or differences, or other related topics, we hope you will consider reaching out to the many resources available on campus. These resources include CAPS (free and unlimited counseling is available), Office of Academic Resources, Writing Center, Student Diversity Equity and Access Team, Health Services, Professional Health Advocate, Religious and Spiritual Life, the Office of Multicultural Affairs, the GRASE Center, and the Dean's Office. Additional information can be found here.

Additionally, Haverford College is committed to creating a learning environment that meets the needs of its diverse student body and providing equitable access to students with disabilities. If you have (or think you may have) a disability related to mental health, chronic health, neurological state, and/or physical condition – please contact the Office of Access and Disability Services (ADS) at hc-ads@haverford.edu. It is never too late to request ADA accommodations – our bodies and circumstances are continuously changing. Please know that all inquiries and health-related information is handled in a sensitive and confidential manner.

Students who have already been approved to receive academic ADA accommodations and want to use these in this course should share their accommodation letter and make arrangements to meet with me as soon as possible to discuss how their accommodations will be implemented in this course. Please note that accommodations are not retroactive and require advance notice in order to successfully implement.

If, at any point in the semester, a disability or personal circumstances affect your learning in this course or if there are ways in which the overall structure of the course and general classroom interactions could be adapted to facilitate full participation, please do not hesitate to reach out to us.

It is a state law in Pennsylvania that individuals must be given advance notice that they may be recorded. Therefore, any student who has a disability-related need to audio record this class must first be approved for this ADA accommodation by Access and Disability Services and then must communicate approval to me. I will then make a general announcement to the class that audio recording may occur while respecting students’ right to privacy by not identifying the individual(s).

Haverford Title IX Statement

Haverford College is committed to fostering a safe and inclusive living and learning environment where all can feel secure and free from harassment. All forms of sexual misconduct, including sexual assault, sexual harassment, stalking, domestic violence, and dating violence are violations of Haverford's policies, whether they occur on or off campus. Haverford faculty are committed to helping to create a safe learning environment for all students and for the College community as a whole. If you have experienced any form of gender or sex-based discrimination, harassment, or violence, know that help and support are available. Staff members are trained to support students in navigating campus life, accessing health and counseling services, providing academic and housing accommodations, and more.

The College strongly encourages all students to report any incidents of sexual misconduct. Please be aware that all Haverford employees (other than those designated as confidential resources such as counselors, clergy, and healthcare providers) are required to report information about such discrimination and harassment to the Bi-College Title IX Coordinator.

Information about the College's Sexual Misconduct policy, reporting options, and a list of campus and local resources can be found on the College's website here.


Official Python style guide
Python 3 Documentation