You’ll learn how to start with a question and/or a dataset, and use machine learning to turn them into insights.
Lessons 1-4: Supervised Classification
Naive Bayes: We jump in headfirst, learning perhaps the world’s greatest algorithm for classifying text.
Support Vector Machines (SVMs): One of the top 10 algorithms in machine learning, and a must-try for many classification tasks. What makes it special? The ability to generate new features independently and on the fly.
Decision Trees: Extremely straightforward, often just as accurate as an SVM but (usually) way faster. The launch point for more sophisticated methods, like random forests and boosting.
Lesson 5: Datasets and Questions
Behind any great machine learning project is a great dataset that the algorithm can learn from. We were inspired by a treasure trove of email and financial data from the Enron corporation, which would normally be strictly confidential but became public when the company went bankrupt in a blizzard of fraud. Follow our lead as we wrestle this dataset into a machine-learning-ready format, in anticipation of trying to predict cases of fraud.
Lesson 6 and 7: Regressions and Outliers
Regressions are some of the most widely used machine learning algorithms, and rightly share prominence with classification. What’s a fast way to make mistakes in regression, though? Have troublesome outliers in your data. We’ll tackle how to identify and clean away those pesky data points.
Lesson 8: Unsupervised Learning
K-Means Clustering: The flagship algorithm when you don’t have labeled data to work with, and a quick method for pattern-searching when approaching a dataset for the first time.
Lessons 9-12: Features, Features, Features
Feature Creation: Taking your human intuition about the world and turning it into data that a computer can use.
Feature Selection: Einstein said it best: make everything as simple as possible, and no simpler. In this case, that means identifying the most important features of your data.
Principal Component Analysis: A more sophisticated take on feature selection, and one of the crown jewels of unsupervised learning.
Feature Scaling: Simple tricks for making sure your data and your algorithm play nicely together.
Learning from Text: More information is in text than any other format, and there are some effective but simple tools for extracting that information.
Lessons 13-14: Validation and Evaluation
Training/testing data split: How do you know that what you’re doing is working? You don’t, unless you validate. The train-test split is simple to do, and the gold standard for understanding your results.
Cross-validation: Take the training/testing split and put it on steroids. Validate your machine learning results like a pro.
Precision, recall, and F1 score: After all this data-driven work, quantify your results with metrics tailored to what is most important to you.
Lesson 15: Wrapping it all Up
We take a step back and review what we’ve learned, and how it all fits together.
Mini-project at the end of each lesson
Final project: searching for signs of corporate fraud in Enron data