This lecture draws heavily on following sources.
High-level software focuses on user-friendly interface to specify and train models.
Keras, scikit-learn, …
Lower-level software focuses on developer tools for implementing deep learning models.
TensorFlow, PyTorch, Theano, CNTK, Caffe, Torch, …
Most tools are developed in Python plus a low-level language (C/C++, CUDA).
Developed by Google Brain team for internal Google use. Formerly DistBelief.
Open sourced in Nov 2015.
OS: Linux, MacOS, and Windows (since Nov 2016).
GPU support: NVIDIA CUDA.
TPU (tensor processing unit), built specifically for machine learning and tailored for TensorFlow.
Mobile device deployment: TensorFlow Lite (May 2017) for Android and iOS.
when you have a hammer, everything looks like a nail.
R users can access Keras and TensorFlow via the keras
and tensorflow
packages.
#install.packages("keras")
library(keras)
install_keras()
# install_keras(tensorflow = "gpu") # if NVIDIA GPU is available
On teaching server, it may be necessary to run
library(reticulate)
virtualenv_create("r-reticulate")
to create a virtual environment ~/.virtualenvs/r-reticulate
to install Keras locally.
Source: CrowdFlower
The most time-consuming but the most creative job. Take \(>80%\) time, require experience and domain knowledge.
Determines the upper limit for the goodness of DL. Garbage in, garbage out
.
For structured/tabular data.
Data prep for special DL tasks.
Image data: pixel scaling, train-time augmentation, test-time augmentation, convolution and flattening.
Data tokenization: break sequences into units, map units to vectors, align and padd sequences.
Data embedding: sparse to dense, merge diverse data, preserve relationship, dimension reduction, Word2Vec, be part of model training.
Source: https://www.asimovinstitute.org/neural-network-zoo/
Regression loss: MSE/quadratic loss/L2 loss, mean absolute error/L1 loss.
Classification loss: cross-entropy loss, …
Customized losses.
Choose optimization algorithm. Generalization (SGD) vs convergence rate (adaptive).
Stochastic GD.
Adding momentum: classical momentum, Nesterov acceleration.
Adaptive learning rate: AdaGrad, AdaDelta, RMSprop.
Comining acceleration and adaptive learning rate: ADAM (default in many libraries).
Beyond ADAM: lookahead, RAdam, AdaBound/AmsBound, Range, AdaBelief.
A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam) by Lili Jiang: https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
Source: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks