NLP intro
TF-IDF
# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words : built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.8, max_features=2000,
min_df=0, stop_words='english',
use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,3))
tfidf_matrix = tfidf_model.fit_transform(synopses) #fit the vectorizer to synopses
print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
" synoposes and " + str(tfidf_matrix.shape[1]) + " terms.")Ngram
Late Dirichlet Allocation (LDA)
Last updated