Attention

Illustrated attention- AMAZING
Illustrated self attention - great
Jay alamar on attention, the first one is better.
Attention is all you need (paper)
The annotated transformer - reviewing the paper
Lilian weng on attention, self, soft vs hard, global vs local, neural turing machines, pointer networks, transformers, snail, self attention GAN.
Understanding attention in rnns
Another good intro with gifs to attention
Clear insight to what attention is, a must read!
Transformer NN by google - faster, better, more accurate
Intuitive explanation to attention
Attention by vidhya
Augmented rnns - including turing / attention / adaptive computation time etc. general overview, not as clear as the one below.

A really good REVIEW on attention and its many forms, historical changes, etc
Medium on comparing cnn / rnn / han - will change on other data, my impression is that the data is too good in this article
Mastery on rnn vs attention vs global attention - a really unclear intro
Mastery on attention - this makes the whole process clear, scoring encoder vs decoder input outputs, normalizing them using softmax (annotation weights), multiplying score and the weight summed on all (i.e., context vector), and then we decode the context vector.
1. Soft (above) and hard crisp attention
2. Dropping the hidden output - HAN or AB BiLSTM
3. Attention concat to input vec
4. Global vs local attention
Mastery on attention with lstm encoding / decoding - a theoretical discussion about many attention architectures. This adds make-sense information to everything above.
1. Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
2. Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.
3. A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.
  1. Enc-decoder
  2. Recursive
  3. Enc-dev with recursive
Code on GIT:
1. HAN - GIT, paper
2. Non penalized self attention
3. LSTM, BiLSTM attention, paper
4. Tushv89, Keras layer attention implementation
5. Richliao, hierarchical Attention code for document classification using keras, blog, group chatter

note: word level then sentence level embeddings.

figure= >

BERT/ROBERTA

Do attention heads in bert roberta track syntactic dependencies? - tl;dr: The attention weights between tokens in BERT/RoBERTa bear similarity to some syntactic dependency relations, but the results are less conclusive than we’d like as they don’t significantly outperform linguistically uninformed baselines for all types of dependency relations. In the case of MAX, our results indicate that specific heads in the BERT models may correspond to certain dependency relations, whereas for MST, we find much less support “generalist” heads whose attention weights correspond to a full syntactic dependency structure.

In both cases, the metrics do not appear to be representative of the extent of linguistic knowledge learned by the BERT models, based on their strong performance on many NLP tasks. Hence, our takeaway is that while we can tease out some structure from the attention weights of BERT models using the above methods, studying the attention weights alone is unlikely to give us the full picture of BERT’s strength processing natural language.

TRANSFORMERS
Jay alammar on transformers (amazing)
J.A on Bert Elmo (amazing)
Jay alammar on a visual guide of bert for the first time
J.A on GPT2
Super fast transformers
A survey of long term context in transformers.
Lilian Wang on the transformer family (seems like it is constantly updated)
Hugging face, encoders decoders in transformers for seq2seq
The annotated transformer
Large memory layers with product keys - This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time.
Adaptive sparse transformers - This sparsity is accomplished by replacing softmax with

α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the

α parameter -- which controls the shape and sparsity of

α-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets.

ELMO

ULMFIT

Tutorial and code by vidhya, medium
Paper
Ruder on transfer learning
Medium on how - unclear
Fast NLP on how
Paper: ulmfit
Fast.ai on ulmfit, this too
Vidhya on ulmfit using fastai
Medium on ulmfit
Building blocks of ulm fit
Applying ulmfit on entity level sentiment analysis using business news artcles
Understanding language modelling using Ulmfit, fine tuning etc
Vidhaya on ulmfit + colab “The one cycle policy provides some form of regularisation”, if you wish to know more about one cycle policy, then feel free to refer to this excellent paper by Leslie Smith – “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”.

BERT

The BERT PAPER
1. Prerequisite about transformers and attention - this is not enough
2. Embeddings using bert in python - using bert as a service to encode 1024 vectors and do cosine similarity
3. Identifying the right meaning with bert - the idea is to classify the word duck into one of three meanings using bert embeddings, which promise contextualized embeddings. I.e., to duck, the Duck, etc
4. Google neural machine translation (attention) - too long
What is bert
(amazing) Deconstructing bert
1. I found some fairly distinctive and surprisingly intuitive attention patterns. Below I identify six key patterns and for each one I show visualizations for a particular layer / head that exhibited the pattern.
2. part 1 - attention to the next/previous/ identical/related (same and other sentences), other words predictive of a word, delimeters tokens
3. (good) Deconstructing bert part 2 - looking at the visualization and attention heads, focusing on Delimiter attention, bag of words attention, next word attention - patterns.
Bert demystified (read this first!)
Read this after, the most coherent explanation on bert, 15% masked word prediction and next sentence prediction. Roberta, xlm bert, albert, distilibert.
A thorough tutorial on bert, fine tuning using hugging face transformers package. Code

Youtube ep1, 2, 3, 3b,

How to train bert from scratch using TF, with [CLS] [SEP] etc
Extending a vocabulary for bert, another kind of transfer learning.
Bert tutorial, on fine tuning, some talk on from scratch and probably not discussed about using embeddings as input
Bert for summarization thread
Bert on logs, feature names as labels, finetune bert, predict.
Bert scikit wrapper for pipelines
What is bert not good at, also refer to the cited paper (is/is not)
Jay Alamar on Bert
Jay Alamar on using distilliBert
sparse bert, paper - When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.
Bert with keras, blog post, colaboratory
Bert with t-hub
Bert on medium with code
Bert on git
Finetuning - Better sentiment analysis with bert, claims 94% on IMDB. official code here “ it creates a single new layer that will be trained to adapt BERT to our sentiment task (i.e. classifying whether a movie review is positive or negative). This strategy of using a mostly trained model is called fine-tuning.”
Explain bert - bert visualization tool.
sentenceBERT paper
Bert question answering on covid19
Codebert
Bert multilabel classification
Tabert - TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data.
All the ways that you can compress BERT

Pruning - Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others. Some methods also impose regularization during training to increase prunability (layer dropout).

Weight Factorization - Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a low-rank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).

Knowledge Distillation - Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pre-training / downstream-data. Normally this would fail, but utilizing soft labels from a fully-sized model improves optimization for unknown reasons. Some methods also distill BERT into different architectures (LSTMS, etc.) which have faster inference times. Others dig deeper into the teacher, looking not just at the output but at weight matrices and hidden activations.

Weight Sharing - Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of self-attention in BERT.

Quantization - Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.

Pre-train vs. Downstream - Some methods only compress BERT w.r.t. certain downstream tasks. Others compress BERT in a way that is task-agnostic.

Bert and nlp in 2019
HeBert - bert for hebrwe sentiment and emotions
Kdbuggets on visualizing bert
What does bert look at, analysis of attention - We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT’s attention
Bertviz BertViz is a tool for visualizing attention in the Transformer model, supporting all models from the transformers library (BERT, GPT-2, XLNet, RoBERTa, XLM, CTRL, etc.). It extends the Tensor2Tensor visualization tool by Llion Jones and the transformers library from HuggingFace.
PMI-masking paper, post - Joint masking of correlated tokens significantly speeds up and improves BERT's pretraining
(really good/) Examining bert raw embeddings - TL;DR BERT’s raw word embeddings capture useful and separable information (distinct histogram tails) about a word in terms of other words in BERT’s vocabulary. This information can be harvested from both raw embeddings and their transformed versions after they pass through BERT with a Masked language model (MLM) head

GPT2

the GPT-2 small algorithm was trained on the task of language modeling — which tests a program’s ability to predict the next word in a given sentence — by ingesting huge numbers of articles, blogs, and websites. By using just this data it achieved state-of-the-art scores on a number of unseen language tests, an achievement known as zero-shot learning. It can also perform other writing-related tasks, such as translating text from one language to another, summarizing long articles, and answering trivia questions.
Medium code for GPT=2 - big algo

GPT3

GPT3 on medium - language models can be used to produce good results on zero-shot, one-shot, or few-shot learning.
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

XLNET

Xlnet is transformer and bert combined - Actually its quite good explaining it
git
CLIP
(keras) Implementation of a dual encoder model for retrieving images that match natural language queries. - The example demonstrates how to build a dual encoder (also known as two-tower) neural network model to search for images using natural language. The model is inspired by the CLIP approach, introduced by Alec Radford et al. The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe.
1. Adversarial methodologies
What is label flipping and smoothing and usage for making a model more robust against adversarial methodologies - 0

Label flipping is a training technique where one selectively manipulates the labels in order to make the model more robust against label noise and associated attacks - the specifics depend a lot on the nature of the noise. Label flipping bears no benefit only under the assumption that all labels are (and will always be) correct and that no adversaries exist. In cases where noise tolerance is desirable, training with label flipping is beneficial.

Label smoothing is a regularization technique (and then some) aimed at improving model performance. Its effect takes place irrespective of label correctness.

Paper: when does label smoothing helps? Smoothing the labels in this way prevents the network from becoming overconfident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition...Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective.
Label smoothing, python code, multi class examples

Label sanitazation against label flipping poisoning attacks - In this paper we propose an efficient algorithm to perform optimal label flipping poisoning attacks and a mechanism to detect and relabel suspicious data points, mitigating the effect of such poisoning attacks.
Adversarial label flips attacks on svm - To develop a robust classification algorithm in the adversarial setting, it is important to understand the adversary’s strategy. We address the problem of label flips attack where an adversary contaminates the training set through flipping labels. By analyzing the objective of the adversary, we formulate an optimization framework for finding the label flips that maximize the classification error. An algorithm for attacking support vector machines is derived. Experiments demonstrate that the accuracy of classifiers is significantly degraded under the attack.
GAN
Great advice for training gans, such as label flipping batch norm, etc read!
Intro to Gans
A fantastic series about gans, the following two what are gans and applications are there
1. What are a GANs?, and cool applications
2. Comprehensive overview
3. Cycle gan - transferring styles
4. Super gan resolution - super res images
5. Why gan so hard to train - good for critique
6. And how to improve gans performance
7. Dcgan good as a starting point in new projects
8. Labels to improve gans, cgan, infogan
9. Stacked - labels, gan adversarial loss, entropy loss, conditional loss - divide and conquer
10. Progressive gans - mini batch discrimination
11. Using attention to improve gan
12. Least square gan - lsgan
13. Unread:
“GAN” using xgboost and gmm for density sampling
Reverse engineering

PreviousDeep Network Optimization NextDeep Neural Machine Vision

Last updated 2 years ago

Was this helpful?