DataSci W207:
Applied Machine Learning

Session 3: Th, 4:00-5:30pm PT

Session 99: Th, 6:30-8:00pm PT

Office hours: Wed, 8:00-8:45am PT

Description

The goal of this course is to provide a broad introduction to the key ideas in machine learning. The emphasis will be on intuition and practical examples rather than theoretical results. Through a variety of lecture examples and programming projects, you will learn how to apply powerful machine-learning techniques to new problems, how to run evaluations and interpret results, and how to think about scaling up from thousands of data points to billions.

This class meets for one 90 min class periods each week. It includes four guided programming projects and one more open-ended final project.

All materials in this course are posted on GitHub in the form of Jupyter notebooks.

Announcements
  • Please fill out this PRE-COURSE survey so I can get to know a bit more about you and your programming background.
  • We WILL NOT be using ISVC for communication. We will be using it only for assignment submissions.
Class Logistics

Course Prerequisites

  • Core data science courses: research design, storing and retrieving data, exploring and analyzing data.

  • Undergraduate-level probability and statistics. Linear algebra is recommended.

Programming Prerequisites

  • Python (v3). We will be primarly using numpy and scikit-learn.

  • Jupiter and JupiterLab notebooks. You can install them in your computer using pip or Anaconda. More information here.

  • Git(Hub), including clone/commmit/push from the command line. You can sign up for an account here.

OS

  • Mac/Windows/Linux are all acceptable to use.

Textbook

  • Check readings posted on the iSchool Virtual Platform.

Assignments

  • The four guided programming projects are due on week 3 (Sept 12), week 5 (Sept 26), week 9 (Oct 24), week 12 (Nov 21).
  • Code submmited via GitHub (see notes below).

Final Project

  • You are allowed to work in teams. You will present your final project in class during the final session (Dec 9). The presentation time should not exeed 15-20 min.
  • Code submmited via GitHub (see notes below).

Live Session Plan


Week Lecture Lecture Materials Deadlines (Sunday of the week, 11:59 pm PT)
Supervised Learning
08/26 Introduction Week 1
09/02 Nearest neighbors Week 2
09/09 Naive Bayes Week 3 Project 1
09/16 Decission trees Week 4
09/23 Cross-validation and Ensemble learning Week 5 Project 2
09/30 Regression analysis Week 6 Final project: group and dataset
10/07 Neural networks Week 7
10/14 Support vector machines Week 8
Unsupervised Learning
10/21 Cluster analysis Week 9 Project 3
10/28 Gaussian mixture models Week 10
11/04 Dimensionality reduction Week 11 Final project: baseline presentation
11/11 [Fall Break] -
Other Topics
11/18 Network analysis Week 12 Project 4
11/25 [Thanksgiving Break] -
12/02 Recommender systems Week 13
12/09 Wrap-up Week 14 Final project: code and presentation

Communication channel

We will use Slack to communicate throughout the semester. Questions/comments related to your projects (NO CODE) are strongly encouraged.


Section Slack channel
6 #datasci-207-2021-fall-section-99-3

Final Project

For the final project you will form a group (3-4 people are ideal; 2-5 people are allowed; no 1 person group allowed). Grades will be calibrated by group size. Your group can only include members from the section in which you are enrolled.

Do not just re-run an existing code repository; at the minimum, you must demonstrate the ability to perform thoughtful data preprocessing and analysis (e.g., data cleaning, model training, hyperparameter selection, model evaluation).

The topic of your project is totally flexible (see also below some project ideas).

Deadlines to remember:

  • week 6: inform me [here] about your group and the dataset you plan to use.
  • week 11: prepare a baseline presentation of your project. You will present in class (no more than 10 min).
  • week 16: code submission and final presentation in class (no more than 15-20 min).

A few project ideas:

Baseline presentation. Your slides should include:

  • Title, Authors
  • What is the question you will be working on? Why is it interesting?
  • What is the data you will be using? Include data source, size of dataset, main features to be used. Please also include summary statistics of your data.
  • What prediction algorithms do you plan to use? Please describe them in detail.
  • How will you evaluate your results? Please describe your chosen performance metrices and/or statistical tests in detail.

Final presentation. Your slides should include:

  • Title, Authors
  • (15%) Motivation: Introduce your question and why the question is interesting. Explain what has been done before in this space. Describe your overall plan to approach your question. Provide a summary of your results.
  • (15%) Data: Describe in detail the data that you are using, including the source(s) of the data and relevant statistics.
  • (15%) Approach: Describe in detail the model that you use in your approach.
  • (30%) Experiments: Compare the performance of your chosen model with other baselines or models.
  • Provide insight into the effect of different hyperperameter choices. Please include tables, figures, graphs to illustrate your experiments.
  • (10%) Conclusions: Summarize the key results, what has been learned, and avenues for future work.
  • (15%) Code submission: Provide link to your GitHub repo. The code should be well commented and organized.
  • Contributions: Specify the contributions of each author (e.g., data processing, algorithm implementation, slides etc).

Project Submission Guidelines
  • Step 1: Create separate GitHub repos for Projects 1-4 and Final Project
  • Step 2: Submit your GitHub link, ipynb file(s), and slides [if Final Project] in ISVC
Grading

Final grades will be determined by computing the weighted average of programming projects, final group project, and participation.

Baseline grading range for this course is: A for 93 or above, A- for 90 or above, B+ for 87 or above, B for 83 or above, B- for 80 or above, C+ for 77 or above, C for 73 or above, C- for 70 and above, D+ for 67 and above, D for 63 and above, D- for 60 and above, and F for 59 and below.

Participation5%
Programming projects15% (x4)
Final project35%
Late Policy
Late submissions will be accepted up to one week past the deadline with a 10% penalty, but you need to let me know if you will be submitting late.
Equity and Inclusion

Integrating a diverse set of experiences is important for a more comprehensive understanding of machine learning. I will make an effort to read papers and hear from a diverse group of practitioners, still, limits exist on this diversity in the field of machine learning. I acknowledge that it is possible that there may be both overt and covert biases in the material due to the lens with which it was created. I would like to nurture a learning environment that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, veteran status, etc.) in the spirit of the UC Berkeley Principles of Community.

To help accomplish this, please contact me or submit anonymous feedback through I School channels if you have any suggestions to improve the quality of the course. If you have a name and/or set of pronouns that you prefer I use, please let me know. If something was said in class (by anyone) or you experience anything that makes you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Also, anonymous feedback is always an option, and may lead to me to make a general announcement to the class, if necessary, to address your concerns.

As a participant in teamwork and course discussions, you should also strive to honor the diversity of your classmates.

If you prefer to speak with someone outside of the course, MICS Academic Director Lisa Ho, I School Assistant Dean of Academic Programs Catherine Cronquist Browning, and the UC Berkeley Office for Graduate Diversity are excellent resources. Also see the following link.