Deep Learning Models


  1. blog post about AE’s - regular, deep, sparse, regularized, cnn, variational

    1. A replicate post but explains AE quite nicely.

  2. Hinton’s coursera course on PCA vs AE, basically some info about what PCA does - maximizing variance and projecting and then what AE does and can do to achieve similar but non-linear dense representations

  3. Another great presentation on PCA vs AE, summarized in the KPCA section of this notebook. +another one +StackExchange

  4. Bart denoising AE, sequence to sequence pre training for NL generation translation and comprehension.

AE for anomaly detection, fraud detection

Variational AE

  1. Disentangled VAE - improves VAE


  1. Implementation and faces, intuition towards each node and what it represents in a vision. I.e., each face resembles one of K clusters.

  2. Som on iris, explains inference - averaging, and cons of the method.



NEAT **stands for NeuroEvolution of Augmenting Topologies. It is a method for evolving artificial neural networks with a genetic algorithm.

NEAT implements the idea that it is most effective to start evolution with small, simple networks and allow them to become increasingly complex over generations.**

**That way, just as organisms in nature increased in complexity since the first cell, so do neural networks in NEAT.

This process of continual elaboration allows finding highly sophisticated and complex neural networks.**

A great article about NEAT


HyperNEAT **computes the connectivity of its neural networks as a function of their geometry.

HyperNEAT is based on a theory of representation that hypothesizes that a good representation for an artificial neural network should be able to describe its pattern of connectivity compactly.**

The encoding in HyperNEAT, called compositional pattern producing networks**, is designed to represent patterns with regularities such as symmetry, repetition, and repetition with variationץ

(WIKI) [Compositional pattern-producing networks]( (CPPNs) are a variation of artificial neural networks (ANNs) that have an architecture whose evolution is guided by genetic algorithms**

A great HyperNeat tutorial on Medium.

Radial Basis Function Network (RBFN)

+ RBF layer in Keras.

The RBFN approach is more intuitive than the MLP.

  • An RBFN performs classification by measuring the input’s similarity to examples from the training set.

  • Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set.

  • When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype.

  • Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

Bayesian Neural Network (BNN)

BNN - (what is?) Bayesian neural network (BNN) according to Uber - architecture that more accurately forecasts time series predictions and uncertainty estimations at scale. “how Uber has successfully applied this model to large-scale time series anomaly detection, enabling better accommodate rider demand during high-traffic intervals.”

Under the BNN framework, prediction uncertainty can be categorized into three types:

  1. Model uncertainty captures our ignorance of the model parameters and can be reduced as more samples are collected.

  2. model misspecification

  3. inherent noise captures the uncertainty in the data generation process and is irreducible.

Note: in a series of articles, uber explains about time series and leads to a BNN architecture.

  1. Neural networks - training on multi-signal raw data, training X and Y are window-based and the window size(lag) is determined in advance.

Vanilla LSTM did not work properly, therefore an architecture of

Regarding point 1: ‘run prediction with dropout 100 times’

*** MEDIUM with code how to do it.

Why do we need a confidence measure when we have a softmax probability layer? The blog post explains, for example, that with a CNN of apples, oranges, cat and dogs, a non related example such as a frog image may influence the network to decide its an apple, therefore we can’t rely on the probability as a confidence measure. The ‘run prediction with dropout 100 times’ should give us a confidence measure because it draws each weight from a bernoulli distribution.

“By applying dropout to all the weight layers in a neural network, we are essentially drawing each weight from a Bernoulli distribution. In practice, this mean that we can sample from the distribution by running several forward passes through the network. This is referred to as Monte Carlo dropout.”

Taken from Yarin Gal’s blog post . In this figure we see how sporadic is the signal from a forward pass (black line) compared to a much cleaner signal from 100 dropout passes.

Is it applicable for time series? In the figure below he tried to predict the missing signal between each two dotted lines, A is a bad estimation, but with a dropout layer we can see that in most cases the signal is better predicted.

Going back to uber, they are actually using this idea to predict time series with LSTM, using encoder decoder framework.

Note: this is probably applicable in other types of networks.

Phd Thesis by Yarin, he talks about uncertainty in Neural networks and using BNNs. he may have proved this thesis, but I did not read it. This blog post links to his full Phd.

Old note: The idea behind uncertainty is (paper here) that in order to trust your network’s classification, you drop some of the neurons during prediction, you do this ~100 times and you average the results. Intuitively this will give you confidence in your classification and increase your classification accuracy, because only a partial part of your network participated in the classification, randomly, 100 times. Please note that Softmax doesn't give you certainty.

Medium post on prediction with drop out

The solution for keras says to add trainable=true for every dropout layer and add another drop out at the end of the model. Thanks sam.

“import keras

inputs = keras.Input(shape=(10,))

x = keras.layers.Dense(3)(inputs)

outputs = keras.layers.Dropout(0.5)(x, training=True)

model = keras.Model(inputs, outputs)“


(an excellent and thorough explanation about LeNet) -

  • Convolution Layer primary purpose is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data.

  • ReLU (more in the activation chapter) - The purpose of ReLU is to introduce non-linearity in our ConvNet

  • Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

  • Dense / Fully Connected - a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer to classify. The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

The overall training process of the Convolutional Network may be summarized as below:

  • Step1: We initialize all filters and parameters / weights with random values

  • Step2: The network takes a single training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.

    • Let's say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]

    • Since weights are randomly assigned for the first training example, output probabilities are also random.

  • Step3: Calculate the total error at the output layer (summation over all 4 classes)

    • (L2) Total Error = ∑ ½ (target probability – output probability) ²

  • Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.

    • The weights are adjusted in proportion to their contribution to the total error.

    • When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].

    • This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.

    • Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.

  • Step5: Repeat steps 2-4 with all images in the training set.

The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.

Illustrated 10 CNNS architectures

A study that deals with class imbalance in CNN’s - we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue

  1. Over sampling

  2. Undersampling

  3. Thresholding probabilities (ROC?)

  4. Cost sensitive classification -different cost to misclassification

  5. One class - novelty detection. This is a concept learning technique that recognizes positive instances rather than discriminating between two classes

The results indication (loosely) that oversampling is usually better in most cases, and doesn't cause overfitting in CNNs.


1x1 CNN

  1. Mastery on 1x1 cnn, for dim reduction, decreasing feature maps and other usages.

    1. “This is the most common application of this type of filter and in this way, the layer is often called a feature map pooling layer.”

    2. “In the paper, the authors propose the need for an MLP convolutional layer and the need for cross-channel pooling to promote learning across channels.”

    3. “the 1×1 filter was used explicitly for dimensionality reduction and for increasing the dimensionality of feature maps after pooling in the design of the inception module, used in the GoogLeNet model”

    4. “The 1×1 filter was used as a projection technique to match the number of filters of input to the output of residual modules in the design of the residual network “


1. Using mask rnn for object detection

Invariance in CNN

  1. Making cnn shift invariance - “Small shifts -- even by a single pixel -- can drastically change the output of a deep network (bars on left). We identify the cause: aliasing during downsampling. We anti-alias modern deep networks with classic signal processing, stabilizing output classifications (bars on right). We even observe accuracy increases (see plot below).


Intuitions to the differences between max and average pooling:

  1. A max-pool layer compressed by taking the maximum activation in a block. If you have a block with mostly small activation, but a small bit of large activation, you will loose the information on the low activations. I think of this as saying "this type of feature was detected in this general area".

  2. A mean-pool layer compresses by taking the mean activation in a block. If large activations are balanced by negative activations, the overall compressed activations will look like no activation at all. On the other hand, you retain some information about low activations in the previous example.

  3. MAX pooling In other words: Max pooling roughly means that only those features that are most strongly triggering outputs are used in the subsequent layers. You can look at it a little like focusing the network’s attention on what’s most characteristic for the image at hand.

  4. GLOBAL MAX pooling: In the last few years, experts have turned to global average pooling (GAP) layers to minimize overfitting by reducing the total number of parameters in the model. Similar to max pooling layers, GAP layers are used to reduce the spatial dimensions of a three-dimensional tensor. However, GAP layers perform a more extreme type of dimensionality reduction,

Dilated CNN

  1. ****RESNET, DENSENET UNET - the trick behind them, concatenating both f(x) = x

Graph Convolutional Networks

Explaination here, with some examples


  1. The solution to CNN’s shortcomings, where features can be identified without relations to each other in an image, i.e. changing the location of body parts will not affect the classification, and changing the orientation of the image will. The promise of capsule nets is that these two issues are solved.

  2. Understanding capsule nets - part 2, there are more parts to the series

Transfer Learning using CNN

  1. To Add keras book chapter 5 (i think)

  2. Mastery on TL using CNN

    1. Classifier: The pre-trained model is used directly to classify new images.

    2. Standalone Feature Extractor: The pre-trained model, or some portion of the model, is used to pre-process images and extract relevant features.

    3. Integrated Feature Extractor: The pre-trained model, or some portion of the model, is integrated into a new model, but layers of the pre-trained model are frozen during training.

    4. Weight Initialization: The pre-trained model, or some portion of the model, is integrated into a new model, and the layers of the pre-trained model are trained in concert with the new model.


Recurrent Neural Net (RNN)


a basic NN node with a loop, previous output is merged with current input (using tanh?), for the purpose of remembering history, for time series - to predict the next X based on the previous Y.

(What is RNN?) by Andrej Karpathy - The Unreasonable Effectiveness of Recurrent Neural Networks, basically a lot of information about RNNs and their usage cases 1 to N = frame captioning

  • N to 1 = classification

  • N to N = predict frames in a movie

  • N\2 with time delay to N\2 = predict supply and demand

  • Vanishing gradient is 100 times worse.

  • Gate networks like LSTM solves vanishing gradient.

(how to initialize?) Benchmarking RNN networks for text - don't worry about initialization, use normalization and GRU for big networks.

** Experimental improvements:

Ref - ”Simplified RNN, with pytorch implementation” - changing the underlying mechanism in RNNs for the purpose of parallelizing calculation, seems to work nicely in terms of speed, not sure about state of the art results. Controversy regarding said work, author claims he already mentioned these ideas (QRNN) first, a year before, however it seems like his ideas have also been reviewed as incremental (PixelRNN). Its probably best to read all 3 papers in chronological order and use the most optimal solution.

RNNCELLS - recurrent shop, enables you to build complex rnns with keras. Details on their significance are inside the link

Masking for RNNs - the ideas is simple, we want to use variable length inputs, although rnns do use that, they require a fixed size input. So masking of 1’s and 0’s will help it understand the real size or where the information is in the input. Motivation: Padded inputs are going to contribute to our loss and we dont want that.

Source 1, source 2,

Visual attention RNNS - Same idea as masking but on a window-based cnn. Paper


  • LSTM - what is? the first reference for LSTM on the web, but you should know the background before reading.

  • Hidden state vs cell state - you have to understand this concept before you dive in. i.e, Hidden state is overall state of what we have seen so far. Cell state is selective memory of the past. The hidden state (h) carries the information about what an RNN cell has seen over the time and supply it to the present time such that a loss function is not just dependent upon the data it is seeing in this time instant, but also, data it has seen historically.

  • Paper - a comparison of many LSTMs variants and they are pretty much the same performance wise

  • Paper - comparison of lstm variants, vanilla is mostly the best, forget and output gates are the most important in terms of performance. Other conclusions in the paper..

  • Mastery on under/over fitting lstms - but makes sense for all types of networks

  • Mastery on return_sequence and return_state in keras LSTM

    • That return sequences return the hidden state output for each input time step.

    • That return state returns the hidden state output and cell state for the last input time step.

    • That return sequences and return state can be used at the same time.

  • Mastery on timedistributed layer and seq2seq

    • TimeDistributed Layer - used to connect 3d inputs from lstms to dense layers, in order to utilize the time element. Otherwise it gets flattened when the connection is direct, nulling the lstm purpose. Note: nice trick that doesn't increase the dense layer structure multiplied by the number of dense neurons. It loops for each time step! I.e., The TimeDistributed achieves this trick by applying the same Dense layer (same weights) to the LSTMs outputs for one time step at a time. In this way, the output layer only needs one connection to each LSTM unit (plus one bias).

For this reason, the number of training epochs needs to be increased to account for the smaller network capacity. I doubled it from 500 to 1000 to match the first one-to-one example

Stateful vs Stateless: crucial for understanding how to leverage LSTM networks:

  1. Philippe remy on stateful vs stateless, intuition mostly with code, but not 100% clear

Machine Learning mastery:

A good tutorial on LSTM: important notes:

1. Scale to -1,1, because the internal activation in the lstm cell is tanh.

2.stateful - True, needs to reset internal states, False =stateless. Great info & results HERE, with seeding, with training resets (and not) and predicting resets (and not) - note: empirically matching the shampoo input, network config, etc.

Another explanation/tutorial about stateful lstm, should be thorough.

3. what is return_sequence, return_states, and how to use each one and both at the same time.

Return_sequence is needed for stacked LSTM layers.

4.stacked LSTM - each layer has represents a higher level of abstraction in TIME!

Keras Input shape - a good explanation about differences between input_shape, dim, and what is. Additionally about layer calculation of inputs and output based on input shape, and sequence model vs API model.

A comparison of LSTM/GRU/MGU with batch normalization and various initializations, GRu/Xavier/Batch are the best and recommended for RNN

Benchmarking LSTM variants: - it looks like LSTM and GRU are competitive to mutation (i believe its only in pytorch) adding a bias to LSTM works (a bias of 1 as recommended in the paper), but generally speaking there is no conclusive empirical evidence that says one type of network is better than the other for all tests, but the mutated networks tend to win over lstm\gru variants.

BIAS 1 in keras - unit_forget_bias: Boolean. If True, add 1 to the bias of the forget gate at initializationSetting it to true will also force bias_initializer="zeros". This is recommended in Jozefowicz et al.

Validation_split arg - The validation split variable in Keras is a value between [0..1]. Keras proportionally split your training set by the value of the variable. The first set is used for training and the 2nd set for validation after each epoch.

This is a nice helper add-on by Keras, and most other Keras examples you have seen the training and test set was passed into the fit method, after you have manually made the split. The value of having a validation set is significant and is a vital step to understand how well your model is training. Ideally on a curve you want your training accuracy to be close to your validation curve, and the moment your validation curve falls below your training curve the alarm bells should go off and your model is probably busy over-fitting.

Keras is a wonderful framework for deep learning, and there are many different ways of doing things with plenty of helpers.

Return_sequence: unclear.

Sequence.pad_sequences - using maxlength it will either pad with zero if smaller than, or truncate it if bigger.

Using batch size for LSTM in Keras

Imbalanced classes? Use class_weights, another explanation here about class_weights and sample_weights.

SKlearn Formula for balanced class weights and why it works, example

number of units in LSTM

Calculate how many params are in an LSTM layer?

Understanding timedistributed in Keras, but with focus on lstm one to one, one to many and many to many - here the timedistributed is applying a dense layer to each output neuron from the lstm, which returned_sequence = true for that purpose.

This tutorial clearly shows how to manipulate input construction, lstm output neurons and the target layer for the purpose of those three problems (1:1, 1:m, m:m).


(what is?) Wiki - The basic idea of BRNNs is to connect two hidden layers of opposite directions to the same output. By this structure, the output layer can get information from past and future states.

BRNN are especially useful when the context of the input is needed. For example, in handwriting recognition, the performance can be enhanced by knowledge of the letters located before and after the current letter.

Another explanation- It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.

.. It allows you to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer. The options are:

  • ‘sum‘: The outputs are added together.

  • ‘mul‘: The outputs are multiplied together.

  • ‘concat‘: The outputs are concatenated together (the default), providing double the number of outputs to the next layer.

  • ‘ave‘: The average of the outputs is taken.

The default mode is to concatenate, and this is the method often used in studies of bidirectional LSTMs.

Another simplified example


A great Slide about back prop, on a simple 3 neuron network, with very easy to understand calculations.



A tutorial about GRU - To solve the vanishing gradient problem of a standard RNN, GRU uses, so called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.

  1. update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future.

  2. Reset gate essentially, this gate is used from the model to decide how much of the past information to forget.


What is? (a type of cell that converges to higher accuracy faster than LSTM.

it implements attention into the recurrent neural network:

1. the keras implementation is available at **

2. the whitepaper is at


Potential competitor to the transformer


  1. (amazing) Why i am luke warm about GNN’s - really good insight to what they do (compressing data, vs adjacy graphs, vs graphs, high dim relations, etc.)

  2. Octavian in medium on graphs, A really good intro to graph networks, too long too summarize, clever, mcgraph, regression, classification, embedding on graphs.

  3. Recommender systems using GNN, w2v, pytorch w2v, networkx, sparse matrices, matrix factorization, dictionary optimization, part 1 here (how to find product relations, important: creating negative samples)

  4. Michael Bronstein’s Central page for Graph deep learning articles on Medium (worth reading)

  5. GAT graphi attention networks, paper, examples - The graph attentional layer utilised throughout these networks is computationally efficient (does not require costly matrix operations, and is parallelizable across all nodes in the graph), allows for (implicitly) assigning different importances to different nodes within a neighborhood while dealing with different sized neighborhoods, and does not depend on knowing the entire graph structure upfront—thus addressing many of the theoretical issues with approaches.

  6. Struc2vec, youtube: Learning Node Representations from Structural Identity- The struc2vec algorithm learns continuous representations for nodes in any graph. struc2vec captures structural equivalence between nodes.

GNN courses

  1. Graph deep learning course - graphs, sets, groups, GNNs. youtube

Deep walk



SDNE - structural deep network embedding



, git, paper, “Is a Single Embedding Enough? Learning Node Representations that Capture Multiple Social Contexts”

Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.

16. Self clustering graph embeddings

17. Walklets, similar to deep walk with node skips. - lots of improvements, works in scale due to lower size representations, improves results, etc.



  1. Fourier Transform - decomposing frequencies

  2. WAVELETS On youtube (4 videos):

    1. used for denoising, compression, detect edges, detect features with various orientation, analyse signal power, detect and localize transients, change points in time series data and detect optimal signal representation (peaks etc) of time freq analysis of images and data.

    2. Can also be used to reconstruct time and frequencies, analyse images in space, frequencies, orientation, identifying coherent time oscillation in time series

    3. Analyse signal variability and correlation


NN-Sequence Analysis

(did not read) A causal framework for explaining the predictions of black-box sequence-to-sequence models - can this be applied to other time series prediction?



Last updated