Large Scale Machine Learning
- e.g. Census data, Website traffic data
- Can we train on 1000 examples instead of 100 000 000? Plot
- If high variance, add more examples
If high bias, add extra features
Gradient Descent with Large Datasets
- G.D. = batch gradient descent
- Stochastic Gradient Descent
- cost function = cost of theta wrt a specific example (x^i, y^i). Measures how well the hypothesis works on that example.
- May need to loop over the entire dataset 1-10 times
Mini-Batch Gradient Descent
- Batch gradient descent: Use all m examples in each iteration
- Stochastic gradient descent: Use 1 example in each iteration
- Mini-batch gradient descent: Use b examples in each iteration
- typical range for b = 2-100 (10 maybe)
- Mini-batch Gradient Descent allows vectorized implementation
Can partially parallelize the computation
Stochastic G.D. convergence
- every 1000 iterations we can plot the costs averaged over te last 1000 examples
- Learning Rate, smaller learning rate means smaller oscillations (plot)
average over more examples, 5000, may get a smoother curve
- If curve is increasing, should use smaller learning rate
- Learning Rate
alpha = const 1 / ( iterationNumer + const2 )
- continuous stream of data
- e.g. 1. shipping service, from origin and destination, optimize the price we offer
- x = feature vector (price, origin, destination)
y = if they chose to use our service or not
- e.g. 2. product search
- input: “Android phone 1080p camera”
- we want to offer 10 phones per query
- learning predicted click through rate (CTR)
Map Reduce and Data Parallelism
- Use local CPU to look at local data
- Massive data parallelism
- Free text, unstructured data
- sentiment analysis
- e.g. Aircrafts engines features: heat generated, vibration intensity
- e.g. servers: memory usage, cpu load, cpu load / network traffic
Building Anomaly Detection system
Developing and Evaluating
vs. Supervised Learning
What Features to Use
Multivariate Gaussian Distribution
Content Based Recommendations
- r(i, j): 1 if user j has rated movie i
- y(i, j): rating user j to movie i
Low-rank Matrix Factorization
Unsupervised Learning & K-means
- Clustering Algorithms, K-means Algorithm
- K-means for non-separated clusters
- Random initialisation
- Elbow method
- 2D -> 1D
- Data Compression to speedup training as well as visualizations of complex datasets
- Indexes (e.g. GDP, Human Development Index)
- Principal Component Analysis (PCA), projection
- Data Preprocesing. Scaling, normalization
- [U, S, V] = svd(sigma)
- U = covariance matrix
- Reconstruction from compressed representation
This week is about Support Vector Machine (SVM).
First we will learn about Large Margin Classification, in reference to the larger minimum distance from any of the training samples.
We will study Kernels, and the adaptation to non-linear classifiers.
Choosing landmarks will also be covered.
C parameter will be studied.
Similarity and Gaussian Kernels are also main keywords of this session.
We will get good advice about using SVM vs Logistic Regression vs Neural Networks.
Advice for Applying Machine Learning
What to try next? More samples? Smaller sets? More Complex features? Decreasing Lambda?
We will learn to use Test Sets and Cross Validation Set.
In this lesson is presented the powerful tool Bias vs. Variance.
Machine Learning System Design
Building a Spam Classifier, with this example we will learn to Prioritize what to work on and the Best use of our time.
Plotting learning curves will help us to grade our work and pivot our working path.
We will also learn to use a method for Error Analysis. Developing intuition with samples related to errors. Numerical evaluation would be an important tool for us.
For Skewed Classes we will get Precision and Recall. F1 Score measures the trade off between them.
Fifth week at this course is about Neural Networks
The initial topic is the detailed study about Cost Function.
Back Propagation and Forward Propagation are also explained.
The second part of this session is about Back-propagation in Practice
The lesson covers Unrolling Parameters (into vectors). Using reshape in MATLAB.
Gradient Checking is explained and also it is recommended to turn it off for training.
Random Initialization is the method used for Symmetry Breaking.
Last part of session is about putting all these together.
Week 4 was about Neural Networks
We started reviewing about Neuros and the Brain
The model representation was about Input Layer, Hidden Layers, Output Layer.
Units can be found in each layer
Multi-class Classification is an application of NNs
Third week we started with Logistic Regression, used for Classification problem
In the context of Representation Model we studied the Cost Function and Gradient Descent
Around Multi-class Classification we reviewed One-vs-all
Solving the Problem of Overfitting, we used Regularization
This week we started exploring MATLAB.
Multivariate Linear Regression
We study about multiple features for linear regression. It’s necessary to use Feature Scaling for a better regression. Adjusting the Learning Rate is another key action.
Polynomial Regression is a more complex model type.
Parameters can be analytically computed using Normal Equation.
Beyond Basic Operations, Plotting Data is a good resource to understand models, learning curves, algorithm behaviour.
Vectorization is also a necessary technique to get good algorithm efficiency.
The first week there were presented the Supervised and Unsupervised Learning. Regression and Classification were studied as applications of Supervised Learning. Cluster detection was mentioned as an Unsupervised Learning application.
There was explained the Model function and also the Cost function.
It was nice how Parameter Learning was obtained with Gradient Descent.
Linear Algebra Review
The review was about Vector and Matrices, operations such as Scalar Multiplication, Matrix-Matrix Multiplication, Inverse and Transpose matrix.