Embedding

Intro

  1. 1.
    ​Faiss - a library for efficient similarity search
  2. 2.
    ​Benchmarking - complete with almost everything imaginable
  3. 3.
    ​Singlestore​
  4. 4.
    Elastic search - dense vector​
  5. 5.
    Google cloud vertex matching engine NN search​
    1. 1.
      search
      1. 1.
        Recommendation engines
      2. 2.
        Search engines
      3. 3.
        Ad targeting systems
      4. 4.
        Image classification or image search
      5. 5.
        Text classification
      6. 6.
        Question answering
      7. 7.
        Chat bots
    2. 2.
      Features
      1. 1.
        Low latency
      2. 2.
        High recall
      3. 3.
        managed
      4. 4.
        Filtering
      5. 5.
        scale
  6. 6.
    Pinecone - managed vector similarity search - Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search.
  7. 7.
    ​Nmslib (benchmarked - Benchmarks of approximate nearest neighbor libraries in Python) is a Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.
  8. 8.
    scann,
  9. 9.
    ​Vespa.ai - Make AI-driven decisions using your data, in real time. At any scale, with unbeatable performance
  10. 10.
    ​Weaviate - Weaviate is an open source vector search engine and vector database. Weaviate uses machine learning to vectorize and store data, and to find answers to natural language queries, or any other media type.
  11. 11.
    ​Neural Search with BERT and Solr - Indexing BERT vector data in Solr and searching with full traversal
  12. 12.
    ​Fun With Apache Lucene and BERT Embeddings - This post goes much deeper -- to the similarity search algorithm on Apache Lucene level. It upgrades the code from 6.6 to 8.0
  13. 13.
    ​Speeding up BERT Search in Elasticsearch - Neural Search in Elasticsearch: from vanilla to KNN to hardware acceleration
  14. 14.
    ​Ask Me Anything about Vector Search - In the Ask Me Anything: Vector Search! session Max Irwin and Dmitry Kan discussed major topics of vector search, ranging from its areas of applicability to comparing it to good ol’ sparse search (TF-IDF/BM25), to its readiness for prime time and what specific engineering elements need further tuning before offering this to users.
  15. 15.
    ​Search with BERT vectors in Solr and Elasticsearch - GitHub repository used for experiments with Solr and Elasticsearch using DBPedia abstracts comparing Solr, vanilla Elasticsearch, elastiknn enhanced Elasticsearch, OpenSearch, and GSI APU
  16. 16.
    ​Not All Vector Databases Are Made Equal - A detailed comparison of Milvus, Pinecone, Vespa, Weaviate, Vald, GSI and Qdrant
  17. 17.
    ​Vector Podcast - Podcast hosted by Dmitry Kan, interviewing the makers in the Vector / Neural Search industry. Available on YouTube, Spotify, Apple Podcasts and RSS
  18. 18.
    ​Players in Vector Search: Video -Video recording and slides of the talk presented on London IR Meetup on the topic of players, algorithms, software and use cases in Vector Search

TOOLS

FLAIR

  1. 1.
    Name-Entity Recognition (NER): It can recognise whether a word represents a person, location or names in the text.
  2. 2.
    Parts-of-Speech Tagging (PoS): Tags all the words in the given text as to which “part of speech” they belong to.
  3. 3.
    Text Classification: Classifying text based on the criteria (labels)
  4. 4.
    Training Custom Models: Making our own custom models.
  5. 5.
    It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc. There are very easy to use thanks to the Flair API
  6. 6.
    Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results
  7. 7.
    ‘Flair Embedding’ is the signature embedding provided within the Flair library. It is powered by contextual string embeddings. We’ll understand this concept in detail in the next section
  8. 8.
    Flair supports a number of languages – and is always looking to add new ones

HUGGING FACE

  1. 1.
    ​Git​
  2. 2.
  3. 4.
    1. 1.
      how to make a custom pyTorch LSTM with custom activation functions,
    2. 2.
      how the PackedSequence object works and is built,
    3. 3.
      how to convert an attention layer from Keras to pyTorch,
    4. 4.
      how to load your data in pyTorch: DataSets and smart Batching,
    5. 5.
      how to reproduce Keras weights initialization in pyTorch.
  4. 5.
    A thorough tutorial on bert, fine tuning using hugging face transformers package. Code​
Youtube ep1, 2, 3, 3b,

LANGUAGE EMBEDDINGS

History

  1. 2.
    1. 1.
      Rnns are sequential, same word in diff position will have diff encoding due to the input from the previous word, which is inherently different.
    2. 2.
      Attention without positional! Will have distinct (Same) encoding.
    3. 3.
      Relative look at a window around each word and adds a distance vector in terms of how many words are before and after, which fixes the problem.
    4. 4.
      ​
      ​
    5. 5.
      ​
      ​
    6. 6.
      The authors hypothesized that precise relative position information is not useful beyond a certain distance.
    7. 7.
      Clipping the maximum distance enables the model to generalize to sequence lengths not seen during training.
  2. 3.
  3. 5.

Embedding Foundation Knowledge

Language modeling

  1. 1.
    ​Ruder on language modelling as the next imagenet - Language modelling, the last approach mentioned, has been shown to capture many facets of language relevant for downstream tasks, such as long-term dependencies , hierarchical relations , and sentiment . Compared to related unsupervised tasks such as skip-thoughts and autoencoding, language modelling performs better on syntactic tasks even with less training data.
  2. 2.
    A tutorial about w2v skipthought - with code!, specifically language modelling here is important - Our second method is training a language model to represent our sentences. A language model describes the probability of a text existing in a language. For example, the sentence “I like eating bananas” would be more probable than “I like eating convolutions.” We train a language model by slicing windows of n words and predicting what the next word will be in the text
  3. 4.
    ELMO - medium​
  4. 5.
    ​Bert **[python git](https://github.com/CyberZHG/keras-bert)- We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks.**
    ​
  5. 6.
    ​Open.ai on language modelling - We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. READ PAPER, VIEW CODE.
  6. 7.
    Scikit-learn inspired model finetuning for natural language processing.
  1. 1.
    Did not read - The annotated Transformer - jupyter on transformer with annotation
  2. 2.
    Medium on Dissecting Bert, appendix​

Embedding spaces

  1. 1.
  2. 5.
    ​Sent2vec by gensim - sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams of words present in each sentence, and averaging the n-gram embeddings along with the words
  3. 7.
    ​Wordrank vs fasttext vs w2v comparison - the better word similarity algorithm
  4. 10.
    ​Doc2vec tutorial by gensim - Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents. - Most importantly this tutorial has crucial information about the implementation parameters that should be read before using it.
  5. 11.
    ​Lbl2Vec, medium, is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embedded label, document and word vectors and returns documents of categories modeled by manually predefined keywords.
  6. 13.
    ​Skip-thought - **[git](https://github.com/ryankiros/skip-thoughts)- Where word2vec attempts to predict surrounding words from certain words in a sentence, skip-thought vector extends this idea to sentences: it predicts surrounding sentences from a given sentence. NOTE: Unlike the other methods, skip-thought vectors require the sentences to be ordered in a semantically meaningful way. This makes this method difficult to use for domains such as social media text, where each snippet of text exists in isolation.**
  7. 14.
    ​Fastsent - Skip-thought vectors are slow to train. FastSent attempts to remedy this inefficiency while expanding on the core idea of skip-thought: that predicting surrounding sentences is a powerful way to obtain distributed representations. Formally, FastSent represents sentences as the simple sum of its word embeddings, making training efficient. The word embeddings are learned so that the inner product between the sentence embedding and the word embeddings of surrounding sentences is maximized. NOTE: FastSent sacrifices word order for the sake of efficiency, which can be a large disadvantage depending on the use-case.
  8. 15.
    Weighted sum of words - In this method, each word vector is weighted by the factor
    \frac{a}{a + p(w)}
    where
    a
    is a hyperparameter and
    p(w)
    is the (estimated) word frequency. This is similar to tf-idf weighting, where more frequent terms are weighted downNOTE: Word order and surrounding sentences are ignored as well, limiting the information that is encoded.
  9. 16.
    ​Infersent by facebook - paper InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks. ABSTRACT: we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks.
  10. 17.
    ​Universal sentence encoder - google - notebook, git The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
  11. 19.
    Pair2vec - paper - paper proposes new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. I.e., using p2v information with existing models to increase performance. Experiments show that our pair embeddings can complement individual word embeddings, and that they are perhaps capturing information that eludes the traditional interpretation of the Distributional Hypothesis

Embedding Models

Cat2vec

  1. 2.
    ​Part2: cat2vec using w2v, and entity embeddings for categorical data

ENTITY EMBEDDINGS

  1. 2.
    ​Using embeddings on tabular data, specifically categorical - introduction, using fastai without limiting ourselves to pytorch - the material from this post is covered in much more detail starting around 1:59:45 in the Lesson 3 video and continuing in Lesson 4 of our free, online Practical Deep Learning for Coders course. To see example code of how this approach can be used in practice, check out our Lesson 3 jupyter notebook. Perhaps Saturday and Sunday have similar behavior, and maybe Friday behaves like an average of a weekend and a weekday. Similarly, for zip codes, there may be patterns for zip codes that are geographically near each other, and for zip codes that are of similar socio-economic status. The jupyter notebook doesn't seem to have the embedding example they are talking about.
  2. 3.
    ​Rossman on kaggle, used entity-embeddings, here, github, paper​
  3. 5.
    ​Embedder - git code for a simplified entity embedding above.
  4. 6.
    Finally what they do is label encode each feature using labelEncoder into an int-based feature, then push each feature into its own embedding layer of size 1 with an embedding size defined by a rule of thumb (so it seems), merge all layers, train a synthetic regression/classification and grab the weights of the corresponding embedding layer.
  5. 7.
    ​Entity2vec​

ALL2VEC EMBEDDINGS

  1. 1.
  2. 2.
    Fast.ai post regarding embedding for tabular data, i.e., cont and categorical data
  3. 3.
    ​Entity embedding for categorical data + notebook​
  4. 8.
    ​Tweet2Vec - code in theano, paper.
  5. 9.
    ​Clustering of tweet2vec, paper​
  6. 11.
    Diff2vec - might be useful on social network graphs, paper, code​
  7. 12.
    emoji 2vec (below)
  8. 13.
    ​Char2vec **[Git](https://github.com/IntuitionEngineeringTeam/chars2vec), similarity measure for words with types. [ **](https://arxiv.org/abs/1708.00524)
EMOJIS
  1. 1.
  2. 2.
    1. 1.
      how to make a custom pyTorch LSTM with custom activation functions,
    2. 2.
      how the PackedSequence object works and is built,
    3. 3.
      how to convert an attention layer from Keras to pyTorch,
    4. 4.
      how to load your data in pyTorch: DataSets and smart Batching,
    5. 5.
      how to reproduce Keras weights initialization in pyTorch.
  3. 5.
    ​Learning millions of for emoji, sentiment, sarcasm, medium​
  4. 6.
    ​EMOJI2VEC - medium article with keras code, another paper on classifying tweets using emojis​
  5. 7.
    ​Group2vec git and medium, which is a multi input embedding network using a-f below. plus two other methods that involve groupby and applying entropy and join/countvec per class. Really interesting
    1. 1.
      Initialize embedding layers for each categorical input;
    2. 2.
      For each category, compute dot-products among other embedding representations. These are our ‘groups’ at the categorical level;
    3. 3.
      Summarize each ‘group’ adopting an average pooling;
    4. 4.
      Concatenate ‘group’ averages;
    5. 5.
      Apply regularization techniques such as BatchNormalization or Dropout;
    6. 6.
      Output probabilities.

WORD2VEC

  1. 1.
    Monitor train loss using callbacks for word2vec
  2. 2.
    Cleaning datasets using weighted w2v sentence encoding, then pca and isolation forest to remove outlier sentences.
  3. 4.
    ​KPCA w2v approach on a very small dataset, similar git for correspondence analysis, paper​
  4. 6.
    ​Chris mccormick ml on w2v, **[post #2](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) - negative sampling “Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example). The “negative samples” (that is, the 5 output words that we’ll train to output 0) are chosen using a “unigram distribution”. Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.**
  5. 7.
    ​Chris mccormick on negative sampling and hierarchical soft max training, i.e., huffman binary tree for the vocabulary, learning internal tree nodes ie.,, the path as the probability vector instead of having len(vocabulary) neurons.
  6. 8.
    ​Great W2V tutorial​
  7. 9.
    Another gensim-based w2v tutorial, with starter code and some usage examples of similarity
  8. 12.
    Mean w2v
  9. 13.
    Sequential w2v embeddings.
  10. 19.
    ​W2v Analogies using predefined anthologies of the form x:y:
    🅰
    b, plus code, plus insights of why it works and doesn't. presence : absence :: happy : unhappy absence : presence :: happy : proud abundant : scarce :: happy : glad refuse : accept :: happy : satisfied accurate : inaccurate :: happy : disappointed admit : deny :: happy : delighted never : always :: happy : Said_Hirschbeck modern : ancient :: happy : ecstatic
GLOVE
  1. 2.
    ​W2v against glove performance comparison - glove wins in % and time.
  2. 3.
    ​How glove and w2v work, but the following has a very good description - “GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.”

FastText

  1. 4.
    ​Docs​
  2. 5.
    ​Medium: word embeddings with w2v and fast text in gensim , data cleaning and word similarity
  3. 6.
    Gensim - fasttext docs, similarity, analogies
  4. 7.
    ​Alternative to gensim - promises speed and out of the box support for many embeddings.
  5. 10.
    ​A comparison of w2v vs ft using gensim - “Word2Vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.
    1. 1.
      ​Syntactic means syntax, as in tasks that have to do with the structure of the sentence, these include tree parsing, POS tagging, usually they need less context and a shallower understanding of world knowledge
    2. 2.
      ​Semantic tasks mean meaning related, a higher level of the language tree, these also typically involve a higher level understanding of the text and might involve tasks s.a. question answering, sentiment analysis, etc...
    3. 3.
      As for analogies, he is referring to the mathematical operator like properties exhibited by word embedding, in this context a syntactic analogy would be related to plurals, tense or gender, those sort of things, and semantic analogy would be word meaning relationships s.a. man + queen = king, etc... See for instance this article (and many others)
  6. 11.
    ​Skip gram vs CBOW​
  1. 1.
    ​Paper on fasttext vs glove vs w2v on a single DS, performance comparison. Ft wins by a small margin
  2. 5.
    Medium on universal Sentence encoder, w2v, Fast text for sentiment with code.

SENTENCE EMBEDDING

Sense2vec

  1. 1.
    ​Blog, github: Using spacy or not, with w2v using POS/ENTITY TAGS to find similarities.based on reddit. “We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”
  2. 2.
    >>> model.similarity('fair_game|NOUN', 'game|NOUN') 0.034977455677555599 >>> model.similarity('multiplayer_game|NOUN', 'game|NOUN') 0.54464530644393849

SENT2VEC aka “skip-thoughts”

  1. 1.
    ​Gensim implementation of sent2vec - usage examples, parallel training, a detailed comparison against gensim doc2vec
  2. 2.
    ​Git implementation​
  3. 3.

USE - Universal sentence encoder

BERT+W2V

  1. 1.

PARAGRAPH2Vec

Doc2Vec

  1. 1.
    ​Shuffle before training each epoch in d2v in order to fight overfitting