Spark SQL
Window Function
# Load trainsched.txt
df = spark.read.csv("trainsched.txt", header=True)
# Create temporary table called table1
df.createOrReplaceTempView('table1')
# Inspect the columns in the table df
spark.sql("DESCRIBE schedule").show()
# window function
# LEAD function can query more than 1 row in a table without having to self join
query = """
SELECT train_id, station, time
LEAD(time, 1) OVER (ORDER BY time) AS time_next
FROM sched
WHERE train_id=324 """
spark.sql(query).show()
# Add col running_total that sums diff_min col in each group
query = """
SELECT train_id, station, time, diff_min,
SUM(diff_min) OVER (PARTITION BY train_id ORDER BY time) AS running_total
FROM schedule
"""
query = """
SELECT *
ROW_NUMBER() OVER (PARTITION BY train_id ORDER BY time) AS id
FROM schedule
"""
#using dot notation
from pyspark.sql import Window
from pyspark.sql.functions import row_number
df.withColumn(
"id", row_number()
.over(Window.partitionBy('train_id').orderBy('time'))
)
#using a WindowSpecDot notation and SQL
Load natural language text
Caching:
Eviction Policy Least Recently Used (LRU) Caching is a lazy operation. It requires an action to trigger it.
eg. - spark.sql("select count(*) from text").show() - partitioned_df.count()
Spark UI
- Spark Task is a unit of execution that runs on a single cpu - Spark Stage a group of tasks that perform the same computation in parallel, each task typically running on a different subset of the data - Spark Job is a computation triggered by an action, sliced into one or more stages. - Jobs, Stages, Storage, Environment, Executors, SQL -Storage: in memory, or on disk, across the cluster, at a snapshot in time.
如果是local,从http://[DRIVER_HOST]:4040开始依次往下。
Logging primer
如果用一个timer来对logging计时,
要注意的是即使是在info的level,debug里的操作其实还是执行了,比如
比较好的做法是,disable action
(行吧,这章没学懂)
Query Plans
Extract, Transform, Select
Creating feature data for classification
CountVectorizer is a Feature Extractor Its input is an array of strings Its output is a vector
Text Classification
Predicting and evaluating
prediction : 1.0 label : 1 endword : him doc : ['and', 'pierre', 'felt', 'that', 'their', 'opinion', 'placed', 'responsibilities', 'upon', 'him'] probability : [0.28537355252312796,0.714626447476872]
Last updated