TF-IDF

TF-IDF - how important is a word to a document in a corpus

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

measures how important a term is

TF-IDF is TF*IDF

Data sets:

Sparse textual content

mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)

def mean_weighted_embedding(model, words, idf=1.0):

if words:

return np.mean(idf * model[words], axis=0)a

else:

print('we have an empty list')

return []

idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])

logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]

Multiply by TFIDF
Enriching using lstm-next word (char or word-wise)
Using external wiktionary/pedia data for certain words, phrases
Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters
Applying deep nlp methods without big data, i.e., sparseness

PreviousString Matching NextLanguage Detection Identification Generation (NLD, NLI, NLG)

Last updated 1 year ago