1. Lilian weng on attention, self, soft vs hard, global vs local, neural turing machines, pointer networks, transformers, snail, self attention GAN.

  2. Transformer NN by google - faster, better, more accurate

  1. Medium on comparing cnn / rnn / han - will change on other data, my impression is that the data is too good in this article

  2. Mastery on rnn vs attention vs global attention - a really unclear intro

  3. Mastery on attention - this makes the whole process clear, scoring encoder vs decoder input outputs, normalizing them using softmax (annotation weights), multiplying score and the weight summed on all (i.e., context vector), and then we decode the context vector.

    1. Soft (above) and hard crisp attention

    2. Dropping the hidden output - HAN or AB BiLSTM

    3. Attention concat to input vec

    4. Global vs local attention

  4. Mastery on attention with lstm encoding / decoding - a theoretical discussion about many attention architectures. This adds make-sense information to everything above.

    1. Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.

    2. Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

    3. A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

      1. Enc-decoder

      2. Recursive

note: word level then sentence level embeddings.

figure= >


  1. Do attention heads in bert roberta track syntactic dependencies? - tl;dr: The attention weights between tokens in BERT/RoBERTa bear similarity to some syntactic dependency relations, but the results are less conclusive than we’d like as they don’t significantly outperform linguistically uninformed baselines for all types of dependency relations. In the case of MAX, our results indicate that specific heads in the BERT models may correspond to certain dependency relations, whereas for MST, we find much less support “generalist” heads whose attention weights correspond to a full syntactic dependency structure.

In both cases, the metrics do not appear to be representative of the extent of linguistic knowledge learned by the BERT models, based on their strong performance on many NLP tasks. Hence, our takeaway is that while we can tease out some structure from the attention weights of BERT models using the above methods, studying the attention weights alone is unlikely to give us the full picture of BERT’s strength processing natural language.


  2. Lilian Wang on the transformer family (seems like it is constantly updated)

  3. Large memory layers with product keys - This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time.

  4. Adaptive sparse transformers - This sparsity is accomplished by replacing softmax with

α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the

α parameter -- which controls the shape and sparsity of

α-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets.


  1. Vidhya on elmo - everything you want to know with code

  2. Mean elmo, batches, with code and linear regression i


  1. Vidhaya on ulmfit + colab “The one cycle policy provides some form of regularisation”, if you wish to know more about one cycle policy, then feel free to refer to this excellent paper by Leslie Smith – “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”.


  1. (amazing) Deconstructing bert

    1. I found some fairly distinctive and surprisingly intuitive attention patterns. Below I identify six key patterns and for each one I show visualizations for a particular layer / head that exhibited the pattern.

    2. part 1 - attention to the next/previous/ identical/related (same and other sentences), other words predictive of a word, delimeters tokens

    3. (good) Deconstructing bert part 2 - looking at the visualization and attention heads, focusing on Delimiter attention, bag of words attention, next word attention - patterns.

  2. Bert demystified (read this first!)

  3. Read this after, the most coherent explanation on bert, 15% masked word prediction and next sentence prediction. Roberta, xlm bert, albert, distilibert.

  4. A thorough tutorial on bert, fine tuning using hugging face transformers package. Code

Youtube ep1, 2, 3, 3b,

  1. How to train bert from scratch using TF, with [CLS] [SEP] etc

  2. Bert tutorial, on fine tuning, some talk on from scratch and probably not discussed about using embeddings as input

  3. Bert on logs, feature names as labels, finetune bert, predict.

  4. sparse bert, paper - When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

  5. Bert with keras, blog post, colaboratory

  6. Finetuning - Better sentiment analysis with bert, claims 94% on IMDB. official code here “ it creates a single new layer that will be trained to adapt BERT to our sentiment task (i.e. classifying whether a movie review is positive or negative). This strategy of using a mostly trained model is called fine-tuning.”

  7. Explain bert - bert visualization tool.

  8. sentenceBERT paper

  9. Tabert - TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data.

Pruning - Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others. Some methods also impose regularization during training to increase prunability (layer dropout).

Weight Factorization - Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a low-rank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).

Knowledge Distillation - Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pre-training / downstream-data. Normally this would fail, but utilizing soft labels from a fully-sized model improves optimization for unknown reasons. Some methods also distill BERT into different architectures (LSTMS, etc.) which have faster inference times. Others dig deeper into the teacher, looking not just at the output but at weight matrices and hidden activations.

Weight Sharing - Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of self-attention in BERT.

Quantization - Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.

Pre-train vs. Downstream - Some methods only compress BERT w.r.t. certain downstream tasks. Others compress BERT in a way that is task-agnostic.

  1. What does bert look at, analysis of attention - We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT’s attention

  2. Bertviz BertViz is a tool for visualizing attention in the Transformer model, supporting all models from the transformers library (BERT, GPT-2, XLNet, RoBERTa, XLM, CTRL, etc.). It extends the Tensor2Tensor visualization tool by Llion Jones and the transformers library from HuggingFace.

  3. PMI-masking paper, post - Joint masking of correlated tokens significantly speeds up and improves BERT's pretraining

  4. (really good/) Examining bert raw embeddings - TL;DR BERT’s raw word embeddings capture useful and separable information (distinct histogram tails) about a word in terms of other words in BERT’s vocabulary. This information can be harvested from both raw embeddings and their transformed versions after they pass through BERT with a Masked language model (MLM) head


  1. the GPT-2 small algorithm was trained on the task of language modeling — which tests a program’s ability to predict the next word in a given sentence — by ingesting huge numbers of articles, blogs, and websites. By using just this data it achieved state-of-the-art scores on a number of unseen language tests, an achievement known as zero-shot learning. It can also perform other writing-related tasks, such as translating text from one language to another, summarizing long articles, and answering trivia questions.

  2. Medium code for GPT=2 - big algo


  1. GPT3 on medium - language models can be used to produce good results on zero-shot, one-shot, or few-shot learning.


  1. Xlnet is transformer and bert combined - Actually its quite good explaining it

  2. CLIP

  3. (keras) Implementation of a dual encoder model for retrieving images that match natural language queries. - The example demonstrates how to build a dual encoder (also known as two-tower) neural network model to search for images using natural language. The model is inspired by the CLIP approach, introduced by Alec Radford et al. The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe.

    1. Adversarial methodologies

  4. What is label flipping and smoothing and usage for making a model more robust against adversarial methodologies - 0

Label flipping is a training technique where one selectively manipulates the labels in order to make the model more robust against label noise and associated attacks - the specifics depend a lot on the nature of the noise. Label flipping bears no benefit only under the assumption that all labels are (and will always be) correct and that no adversaries exist. In cases where noise tolerance is desirable, training with label flipping is beneficial.

Label smoothing is a regularization technique (and then some) aimed at improving model performance. Its effect takes place irrespective of label correctness.

  1. Paper: when does label smoothing helps? Smoothing the labels in this way prevents the network from becoming overconfident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition...Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective.

  1. Label sanitazation against label flipping poisoning attacks - In this paper we propose an efficient algorithm to perform optimal label flipping poisoning attacks and a mechanism to detect and relabel suspicious data points, mitigating the effect of such poisoning attacks.

  2. Adversarial label flips attacks on svm - To develop a robust classification algorithm in the adversarial setting, it is important to understand the adversary’s strategy. We address the problem of label flips attack where an adversary contaminates the training set through flipping labels. By analyzing the objective of the adversary, we formulate an optimization framework for finding the label flips that maximize the classification error. An algorithm for attacking support vector machines is derived. Experiments demonstrate that the accuracy of classifiers is significantly degraded under the attack.

  3. GAN

  4. Great advice for training gans, such as label flipping batch norm, etc read!

Last updated