Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • TF-IDF
  • Ngram
  • Late Dirichlet Allocation (LDA)
  1. Chapter9 Special Topics

NLP intro

TF-IDF

Term Frequency: count of word A in doc B/total words count in doc B. 词频,出现了几次就是几

Inverse Document Frequency: log(total number of documents in the corpus/ (number of documents where word A appears +1)) 词出现在了几个文档里;出现次数越多,词在文档中越不重要

TF-IDF(wordA in docB)=TF(wordA_in_docB)*IDF(wordA)

# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words : built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.8, max_features=2000,
                                 min_df=0, stop_words='english',
                                 use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,3))

tfidf_matrix = tfidf_model.fit_transform(synopses) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " synoposes and " + str(tfidf_matrix.shape[1]) + " terms.")

最后也可以cosine similarities看相似度 文字和文字之间的相似度

from sklearn.metrics.pairwise import cosine_similarity
cos_matrix = cosine_similarity(tfidf_matrix)
print (cos_matrix)

Ngram

1 gram: 1个词为单位分

2 gram, bigram: 2个词为单位,有词组的概念了,所以词的前后顺序就重要了

3 gram:3个词为单位

Late Dirichlet Allocation (LDA)

LDA是一个topic model 文档内暗含了主题

PreviousSocial Network AnalysisNextTime Series

Last updated 5 years ago