DataSci 207:
Applied Machine Learning

Lecture: We

Office hours: Th, 4-5 pm PT

Description

This course provides a practical introduction to the rapidly growing field of machine learning— training predictive models to generalize to new data. We start with linear and logistic regression and implement gradient descent for these algorithms, the core engine for training. With these key building blocks, we work our way to understanding widely used neural network architectures, focusing on intuition and implementation with TensorFlow/Keras. While the course centers on neural networks, we will make sure to cover key ideas in unsupervised learning and nonparametric modeling.

Along the way, weekly short coding assignments and a midterm exam will connect lectures with concrete data and real applications. A more open-ended final project will tie together crucial concepts in experimental design and analysis with models and training.

This class meets for one 90 min class periods each week.

All materials for this course are posted on GitHub in the form of Jupyter notebooks.

Announcements
  • Please fill out this PRE-COURSE survey so I can get to know a bit more about you and your programming background.
  • Due to a large number of private Slack inquiries, I encourage you to first read this website for commonly asked questions.
  • Any questions regarding course content and organization (including assignments and final project) should be posted on my Slack channel. You are strongly encouraged to answer other students' questions when you know the answer.
  • If there are private matters specific to you (e.g., special accommodations), please contact me directly.
  • If you miss a class, watch the recording and inform me here.
  • If you want to stay up to date with recent work in AI/ML, start by looking at the conferences NeurIPS and ICML.
  • ML study guidelines: Stanford's super cheatsheet.
Class Logistics

Course Prerequisites

  • Core data science courses: research design, storing and retrieving data, exploring and analyzing data.

  • Undergraduate-level probability and statistics. Linear algebra is recommended.

Programming Prerequisites

  • Python (v3).

  • Jupiter and JupiterLab notebooks. You can install them in your computer using pip or Anaconda. More information here.

  • Git(Hub), including clone/commmit/push from the command line. You can sign up for an account here.

  • If you have a MacOS M1, this .sh script will install everything for you (credit goes to one of my former students, Kevin Stallone)

OS

  • Mac/Windows/Linux are all acceptable to use.

Textbook

  • Raschka & Mirjalili (RM), Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2.

Assignments

  • Weekly coding assignments, submitted via Gradescope (see notes below).

Midterm exam

  • The URL link for the exam will be added to bCourses on the exam date. More information to follow.

Final Project

  • You will present your final project in class during the final session. You are allowed to work in teams.
  • You will submmit your code and presentation slides via GitHub (see notes below).

Live Session Plan


Week Lecture Lecture Materials Readings Deadlines (Sunday of the week, 11:59 pm PT)
Supervised and Unsupervised Learning
Aug 28-Sept 03 Introduction and Framing Week 01
Sept 04-10 Linear Regression - Gradient Descent Week 02 RM (10, 13 - intro to TensorFlow only), feature scaling, more math (1) Assignment 1
Sept 11-17 Linear Regression - Feature Engineering Week 03 RM (4, 2), Ilin et al. (2021) Assignment 2
Sept 18-24 Logistic Regression - Binary Week 04 RM (3, 6 (p.211-219)), more math (2) Assignment 3
Group, question, and dataset for final project
Sept 25-Oct 01 Logistic Regression - Multiclass Week 05 RM (3, 6 (p.211-219)), more intuition Assignment 4
Oct 02-08 Feedforward Neural Networks Week 06 RM (12, 13, 14), activation functions, regularization Assignment 5
Oct 09-15 KNN, Decision Trees, and Ensembles Week 07 RM (3, 7), Psaltos et al (2022) Assignment 6
Midterm exam
Oct 16-22 Unsupervised Learning: K-Means and PCA
Project: baseline presentation
Week 08 RM (11) Assignment 7
Baseline presentation: slides
Oct 23-29 Embeddings for Text Week 09 RM (8, 16) Assignment 8
Oct 30-Nov 05 Convolutional Neural Networks Week 10 RM (15), 1D CNN intuition, Yoon Kim (2014) Assignment 9
Nov 06-12 Fall Break
Nov 13-19 Network Architecture and Debugging ML algorithms
Week 11 Andrew Ng's advice for Applying ML Assignment 10
Nov 20-26 Thanksgiving Break
Nov 27-Dec 03 Fairness in ML Week 12 Suresh and Guttag (2021)
Dec 04-10 Advanced Topics: RNN/LSTMs, Transformers, BERT Week 13 Rashka et al, ch. 16 (2022)
Dec 11-17 Project: final presentation Final presentation: slides and code

Communication channel

We will use Slack to communicate throughout the semester. Questions/comments related to your projects (NO CODE) are strongly encouraged.


Sections Slack channel
5, 6 #datasci-207-2023-fall-ci

Midterm Exam

How do I take the exam?

  • The exam is available in Gradescope between Oct 09, 8 a.m. PT - Oct 15, 11:59 p.m. PT. You can take it anytime in this time frame. The duration of the exam is limited to 60 minutes.

What is the best way to prepare for the exam?

  • Going over the async material, live sessions, homework, and assigned readings covered in weeks 1-6 is all you need to do. You are allowed to prepare a cheat sheet.

Can I use ChatGPT?

  • My advice is not to rely on ChatGPT for answers during the exam - some responses are wrong! please just rely on your understanding of the material.

How can I see my grade?

  • We aim to post exam grades in bCourses one week after the deadline.

What is the best way to access the exam solutions?

  • Attend any of the OH the week after the exam grades are in; instructors will discuss midterm solutions based on your questions.

Final Project

For the final project you will form a group (3-4 people are ideal). Grades will be calibrated by group size and individual contributions. Your group can only include members from the section in which you are enrolled.

Do not just re-run an existing code repository; at the minimum, you must demonstrate the ability to perform thoughtful data preprocessing and analysis (e.g., data cleaning, model training, hyperparameter selection, model evaluation).

The topic of your project is totally flexible (see also below some project ideas).

Deadlines to remember:

  • week 04: inform me here about your group, question and dataset you plan to use.
  • week 08: prepare the baseline presentation of your project. You will present in class (no more than 12 min).
  • week 16: prepare the final presentation of your project. You will present in class (no more than 12 min).

A few project ideas (from my Summer 2022 students):

Baseline presentation. Your slides should include:

  • Title, Authors
  • What is the question you will be working on? Why is it interesting?
  • What is the data you will be using? Include data source, size of dataset, main features to be used. Please also include summary statistics of your data.
  • What prediction algorithms do you plan to use? Please describe them in detail.
  • How will you evaluate your results? Please describe your chosen performance metrices and/or statistical tests in detail.

Final presentation. Your slides should include:

  • Title, Authors
  • (15%) Motivation: Introduce your question and why the question is interesting. Explain what has been done before in this space. Describe your overall plan to approach your question. Provide a summary of your results.
  • (15%) Data: Describe in detail the data that you are using, including the source(s) of the data and relevant statistics.
  • (15%) Modeling: Describe in detail the models (baseline + improvement over baseline) that you use in your approach.
  • (30%) Experiments: Provide insight into the effect of different hyperperameter choices. Please include tables, figures, graphs to illustrate your experiments.
  • (10%) Conclusions: Summarize the key results, what has been learned, and avenues for future work.
  • (15%) Code submission: Provide link to your GitHub repo. The code should be well commented and organized.
  • Contributions: Specify the contributions of each author (e.g., data processing, algorithm implementation, slides etc.).

Assignments and Final Project Submission Guidelines
  • Part 1: Create a GitHub repo for Assignments 1-10. Upload the homework .ipynb file to Gradescope each week before the deadline.
  • Part 2: Create a team GitHub repo for the Final Project. This repo will contain your code as well as PowerPoint slides for the baseline and final presentations. Add me as a contributor if your repo is private (my username is corneliailin), and add the link to your repo here.
Grading

Final grades will be determined by computing the weighted average of homework, mideterm exam, final group project, and participation.

Baseline grading range for this course is: A for 93 or above, A- for 90 or above, B+ for 87 or above, B for 83 or above, B- for 80 or above, C+ for 77 or above, C for 73 or above, C- for 70 and above, D+ for 67 and above, D for 63 and above, D- for 60 and above, and F for 59 and below.

Participation5%
Assignments45%
Midterm20%
Final project30%
Late Policy
Weekly homework assignments can be submitted up to 3 days late with a 10% (absolute) penalty per day. The lowest grade will be dropped.
Equity and Inclusion

Integrating a diverse set of experiences is important for a more comprehensive understanding of machine learning. I will make an effort to read papers and hear from a diverse group of practitioners, still, limits exist on this diversity in the field of machine learning. I acknowledge that it is possible that there may be both overt and covert biases in the material due to the lens with which it was created. I would like to nurture a learning environment that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, veteran status, etc.) in the spirit of the UC Berkeley Principles of Community.

To help accomplish this, please contact me or submit anonymous feedback through I School channels if you have any suggestions to improve the quality of the course. If you have a name and/or set of pronouns that you prefer I use, please let me know. If something was said in class (by anyone) or you experience anything that makes you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Also, anonymous feedback is always an option, and may lead to me to make a general announcement to the class, if necessary, to address your concerns.

As a participant in teamwork and course discussions, you should also strive to honor the diversity of your classmates.

If you prefer to speak with someone outside of the course, MICS Academic Director Lisa Ho, I School Assistant Dean of Academic Programs Catherine Cronquist Browning, and the UC Berkeley Office for Graduate Diversity are excellent resources. Also see the following link.