Chapter7 Big Data and Spark
Last updated
Last updated
Data Warehousing--OLAP
ACID (Atomicity, Consistency, Isolation, Durability)
DataBase--OLTP
้ข่ฏไธญ่่ฟๆฐๆฎๅบไธญ็ๅธธ่งjoin๏ผSort-based Merge JoinใHash-based Join
BigTable๏ผ้ฝๆฏkey-value pair๏ผๅ ไธบvalue็ๅผไธไธๅฎๆฐ็ฎ๏ผไธ่ฝ็จๅ ณ็ณปๅๆฐๆฎๅบไบใ
HBase๏ผcolumn familyๅtimestamp๏ผๆนไพฟๅๆฐๆฎ็ๆ็ดขๆทปๅ ๆฅๆพ๏ผ่ฟๆๆถ้ดๆณ
้ข่ฏ้ข๏ผmerge nไธชsorted list็ๅคๆๅบฆ
k่ฏปๅ ฅใkๆฅheapifyใnlogkๆฅpushๅpopkๆฌก
้ข่ฏ้ข๏ผๅฏนๆฐๆฎ้ไธๅคง็ๆๆกฃๅword count๏ผๆฐๆฎ้้ๅธธๅคง็๏ผๅ่ฏๆ่ๅด็ใๅ่ฏๆฒกๆ่ๅด็... ๅฆไฝๅๅฐๆๅฟซ
ๅฏไปฅhashmapๆ่ external sort๏ผmerge sort๏ผ็ๆๆณ๏ผ ๆฐๆฎ้้ๅธธๅคงๅฐฑexternal sort๏ผๅ่ฏๆ่ๅดๅฏไปฅhashmap๏ผๅ่ฏๆฒกๆ่ๅดๅช่ฝexternal sortใ ๅคไธชmachine๏ผcounting in parallel๏ผๆๅreduceๆฑๆป็ปๆใๆปไน๏ผmap๏ผshuffle๏ผreduce ๅฏนterabyte็ๆฐๆฎ่ฟ่กๆๅบ 1. sampleๆฐๆฎ ๅพๅฐhistogram 2. ๅฐฝ้่ฎฉๆฏไธชreducerๆไธๆ ทๆฐ้็data ๅๆถไนๆฏ่ตทๅๆไบ่ฟไนไธไธชๅฐ็ๆๅบๅๅฐreducer็ๅทฅไฝ followup๏ผๅฆไฝๅจไธๅ bucket้ๆพๆๆฒกๆๆไธชๆฐๅญ๏ผ ็ฑปไผผbinary search็ๆๅทง
้ข่ฏ้ข๏ผๅพฎๅ็ญๆ็top10ๆฏไธๅๅบๅ้ๅบtop10 ๆฏไธชๆบๅจๅค็ไธไธชๅญๆฏinitial ๆๅๆฑๆป
้ข่ฏ้ข๏ผๅฆๆdata skewๅฆไฝไผๅparallel่ฟ็จ ็ปhistogram
ๅฅฝๅค๏ผๆ่ฟนๅฏๅพช๏ผๅญไบไธญ้ด็ปๆ
ๅๅค๏ผๆฏๆฌก่ฏปๅ ฅๅๅ ฅ็ฃ็ๅพๆ ข
One Computation Engine to fit all
From disk to memory
From single job to DAG
From 1 pass to multiple pass
ๆๅๆ ็ฏๅพDAGๆนไพฟไผๅๆญฅ้ชค๏ผไธญ้ดๆฒกๆ่ฏปๅ
ๅฆๆๆ100GB data on disk๏ผ20G in memory๏ผSparkไผmemory and disk๏ผๆไธ้จๅๆพๅจmemory๏ผไธ้จๅๆพๅจdisk๏ผ้่ฟๆฅๅswap็ๆนๅผ๏ผๅจๆฐๆฎๆพไธไธmemory็ๆถๅๅไธไบๅฐdisk๏ผ้ข่ฏไธๅคชไผ่ไบ๏ผใ
Spark session๏ผ้ ็ฝฎไธไบๅบๆฌไฟกๆฏ๏ผๆฏๅฆapplication name๏ผmemoryๅCPU sparksessionไธๆไธไบfunction๏ผๆฏๅฆspark.read.format("csv")๏ผload่ฟๆฅๅฐฑๆฏspark dataframe
Spark contextๆฏspark RDDๆไฝ็ๆฅๅฃ๏ผ็ฑปๆฏๅฐฑๅๆฏSparkContextๆฏๅฎๅบ็ๅคง้จ๏ผ้้ข็ปRDDๅผไบไธไบๅฐ้จ๏ผ้่ฟdata=sc.textFile()ไน็ฑป็function
Spark SQL is data warehouse Spark ML support deep learning Spark Streaming is able to process stream data Spark SQL can query Hive Table
Organization data flow of RDD Optimize network communication cost One job is divided into different stages based on data shuffle
RDDๆฏๅๅธๅผๅ ๅญ็ฏๅข็ๆฝ่ฑก๏ผๆ่ ่ฏด๏ผๅๅธๅผๅ ๅญ็ไธไธชๅฏน่ฑก๏ผ ้่ฟRDD่ฎฟ้ฎๆฏไธไธชๆบๅจๅ ๅญ้็ๆฐๆฎ ไฝๆฏๅจ็จๆทไฝฟ็จๆถๆฏๆๅไธๅฐๅฎๆๅคไธชๅๅ็ RDD is an immutable collection of objects that can be operated on in parallel (้ข่ฏๅธธ่)
Sparkไธญๅฎ็ฐไบๅ ๅญ็บงๅซ็ๅคไปฝ๏ผuse lineage information, userไธ่ฝ่ฎฟ้ฎ็๏ผ๏ผๅ็กฌ็็บงๅซ็ๅคไปฝ๏ผcheckpoint๏ผ
Resilient: RDD keeps its lineage information; it can be recreated from parent RDDs
Distributed: partitions can be distributed across multiple nodes in the cluster; each RDD is composed of 1 or more partitions.
Big data object abstraction
Contains multiple data partitions
Can persist into disk
Spark RDD caching
save the iterative computation cost
caching data into memory function is : RDD.cache()
Store data into memory
Reusable for different stages
Reduce overhead to read data from disk
ๅฝๅ ๆๅ ฌๅธ้ข่ฏ้ฎๅฐ๏ผgroupbykeyๅreducebykeyๆไปไนๅบๅซ๏ผ
reducebykeyๅญไบไธไธชไธญ้ด็ปๆ๏ผไธญ้ดๆไธไธชC2๏ผ่ฟๅฐฑๅๅฐไบไธญ้ดๆฐๆฎ็shuffleใreducebykey็ธๅฝไบๆaggregation็ๆไฝๆพๅจไบๅ้ขๆฅๅใ
Spark data lineageๅฏไปฅfault tolerance๏ผtrack back to the data source
่พๅ ฅๆฏkey-value๏ผ่พๅบๆฏ the partition id
ๅ ๅชๆRDD๏ผๆฒกๆdataframe็ๆฆๅฟต๏ผRDD is a key-value pairs object, ๆไปฅๆฐๆฎๆฒกๆschemaใ่็จdataframe็ๆถๅ๏ผ่ฝ็ดๆฅassessๆไธๅ๏ผๅ ไธบๆฐๆฎๆไบschemaใ
่ณไบSQL Query๏ผๅ ถๅฎๅฏไปฅๅdataframe่ตฐไธๆ ท็้ป่พๅผๆ๏ผ่ฟๅฐฑๆฏไธบไปไนๅฎไปฌไน้ดๅฏไปฅไบ็ธ่ฝฌๅ๏ผdataframe็็ปๆๅฏไปฅ่ฝฌๆขๆquery๏ผๅไนไบฆ็ถใ
ML-lib API: ้ๅฏน็ๆฏRDD็ๆไฝ
Spark-ML: ้ๅฏน็ๆฏdata frame็ๆไฝ
Spark is memory based Hive is based on MapReduce Hive Spark can execute Hive table
็จไธไธชๅ จๅฑๅ้ๅญๆพไธไธๆฌก่ฟญไปฃๅ็่ดจๅฟ
map้๏ผ่ฎก็ฎๆฏไธช่ดจๅฟไธๆ ทๆฌไน้ด็่ท็ฆป๏ผๅพๅฐไธๆ ทๆฌ่ท็ฆปๆ็ญ็่ดจๅฟ๏ผไปฅ่ฟไธช่ดจๅฟไฝไธบkey๏ผๆ ทๆฌไฝไธบvalue๏ผ่พๅบ
reduce้๏ผ่พๅ ฅ็keyๆฏ่ดจๅฟ๏ผvalueๆฏๅ ถไป็ๆ ทๆฌ๏ผ่ฟๆถ้ๆฐ่ฎก็ฎ่็ฑปไธญๅฟ๏ผๅฐ่็ฑปไธญๅฟputๅฐไธไธชๅ จ้จๅ้tไธญใ
ๅจmain้ๆฏ่พๅไธๆฌก็่ดจๅฟๅๆฌๆฌก็่ดจๅฟๆฏๅฆๅ็ๅๅ๏ผๅฆๆๅๅ๏ผๅ็ปง็ปญ่ฟญไปฃ๏ผๅฆๅ้ๅบใ
่็ๅๆณ๏ผๆๆฐๆฎๅๆๅๅ๏ผ็ถๅๆฏไธๅๅญฆw๏ผๆๅๆฑๆปๅนณๅ๏ผไฝ้ฎ้ขๅจไบ่ฟไธไธๅฎๆฏๆๅฅฝ็็ปๆ๏ผๆฏๅฆๅฏนไบskewed dataๅฐฑไธ่ก๏ผ
็ฐๅจ็ๅๆณ๏ผparameter server๏ผๆฐๆฎๅๆๅคไธชๅญๅจ๏ผๅๆฐไนๅๅๅญๅจ
Sampling ๅฆๆๆฏfraud๏ผๅฏ่ฝsample็ๆถๅๅฎๅ จไธไผ็ๅฐfraud cases๏ผ sampleไธ่ฝ็จไบๅalerting็๏ผ ๅฆๅค ๅพๅฐ็้ฝๆฏapproximation
Sketching ๅฐฝ้ๅฐ็จๅ ๅญ๏ผๅฏนไบๆฐไธๅ ๆฐ็ๆถๅๆ็จ ๅพๅฐ็้ฝๆฏaccurate็
Concept Drift ๆฏๅ ถไธญๆ้่ฆ็๏ผ้่ฆๅ ณๆณจๆฏๅฆๆdrifting
ๅaggregate็ธๆฏ๏ผๅไธชdata็้่ฆๆง่ฆๅคงๅพๅค๏ผๅ ไธบ่ฟๅฐฑๆฏๅๆ็ๅฏน่ฑก
Approaches๏ผ
Reservoir Sampling๏ผ sample first m items, ็ผบ็นๆฏไธ่ฝparallelize
Min-Wise Sampling๏ผ generate a random number uniformly distributed between 0 and 1,
"Smart" Sampling: domain knowledge sampling, ๆฏๅฆsecurity็ไบบไผ็ฅ้ๅ ๅ ณๆณจsmall packets data