Building Data Engineering Pipelines in Python
The landing zone contains raw data, the clean zone contains clean data, and the business zone contains domain-specific data, usually related to solve business problems.

parquet文件也可以是csv,用file filename可以看到file的类型。
type(catalog["diaper_reviews"].read())可以看到file的内部结构。
Singer’s core concepts
Aim: “The open-source standard for writing scripts that move data”
Singer is a specication
data exchange format: JSON
extract and load with taps and targets => language independent
communicate over streams: schema (metadata) state (process metadata) record (data)

Last updated