Term Frequency: count of word A in doc B/total words count in doc B. 词频,出现了几次就是几
Inverse Document Frequency: log(total number of documents in the corpus/ (number of documents where word A appears +1)) 词出现在了几个文档里;出现次数越多,词在文档中越不重要
TF-IDF(wordA in docB)=TF(wordA_in_docB)*IDF(wordA)
# define vectorizer parameters# TfidfVectorizer will help us to create tf-idf matrix# max_df : maximum document frequency for the given word# min_df : minimum document frequency for the given word# max_features: maximum number of words# use_idf: if not true, we only calculate tf# stop_words : built-in stop words# tokenizer: how to tokenize the document# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gramtfidf_model =TfidfVectorizer(max_df=0.8, max_features=2000, min_df=0, stop_words='english', use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,3))tfidf_matrix = tfidf_model.fit_transform(synopses)#fit the vectorizer to synopsesprint ("In total, there are "+str(tfidf_matrix.shape[0]) + \" synoposes and "+str(tfidf_matrix.shape[1]) +" terms.")
最后也可以cosine similarities看相似度 文字和文字之间的相似度
from sklearn.metrics.pairwise import cosine_similaritycos_matrix =cosine_similarity(tfidf_matrix)print (cos_matrix)