Spark SQL

Window Function

# Load trainsched.txt
df = spark.read.csv("trainsched.txt", header=True)

# Create temporary table called table1
df.createOrReplaceTempView('table1')

# Inspect the columns in the table df
spark.sql("DESCRIBE schedule").show()

# window function 
# LEAD function can query more than 1 row in a table without having to self join
query = """
SELECT train_id, station, time
LEAD(time, 1) OVER (ORDER BY time) AS time_next
FROM sched
WHERE train_id=324 """
spark.sql(query).show()

# Add col running_total that sums diff_min col in each group
query = """
SELECT train_id, station, time, diff_min,
SUM(diff_min) OVER (PARTITION BY train_id ORDER BY time) AS running_total
FROM schedule
"""

query = """
SELECT *
ROW_NUMBER() OVER (PARTITION BY train_id ORDER BY time) AS id
FROM schedule
"""

#using dot notation
from pyspark.sql import Window
from pyspark.sql.functions import row_number
df.withColumn(
"id", row_number()
.over(Window.partitionBy('train_id').orderBy('time'))
)

#using a WindowSpec

Dot notation and SQL

Load natural language text

Caching:

Eviction Policy Least Recently Used (LRU) Caching is a lazy operation. It requires an action to trigger it.

eg. - spark.sql("select count(*) from text").show() - partitioned_df.count()

Spark UI

- Spark Task is a unit of execution that runs on a single cpu - Spark Stage a group of tasks that perform the same computation in parallel, each task typically running on a different subset of the data - Spark Job is a computation triggered by an action, sliced into one or more stages. - Jobs, Stages, Storage, Environment, Executors, SQL -Storage: in memory, or on disk, across the cluster, at a snapshot in time.

如果是local,从http://[DRIVER_HOST]:4040开始依次往下。

Logging primer

如果用一个timer来对logging计时,

要注意的是即使是在info的level,debug里的操作其实还是执行了,比如

比较好的做法是,disable action

(行吧,这章没学懂)

Query Plans

Extract, Transform, Select

Creating feature data for classification

CountVectorizer is a Feature Extractor Its input is an array of strings Its output is a vector

Text Classification

Predicting and evaluating

prediction : 1.0 label : 1 endword : him doc : ['and', 'pierre', 'felt', 'that', 'their', 'opinion', 'placed', 'responsibilities', 'upon', 'him'] probability : [0.28537355252312796,0.714626447476872]

Last updated