TF-IDF
TF-IDF - how important is a word to a document in a corpus
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
measures how important a term is
TF-IDF is TF*IDF
Data sets:
Sparse textual content
mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)
def mean_weighted_embedding(model, words, idf=1.0):
if words:
return np.mean(idf * model[words], axis=0)a
else:
print('we have an empty list')
return []
idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])
logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]
Enriching using lstm-next word (char or word-wise)
Using external wiktionary/pedia data for certain words, phrases
Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters
Last updated