TF-IDF - how important is a word to a document in a corpus

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

measures how important a term is


Data sets:

Sparse textual content

  1. mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)

def mean_weighted_embedding(model, words, idf=1.0):

if words:

return np.mean(idf * model[words], axis=0)a


print('we have an empty list')

return []

idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])

logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]

  1. Enriching using lstm-next word (char or word-wise)

  2. Using external wiktionary/pedia data for certain words, phrases

  3. Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters

Last updated