Building Data Engineering Pipelines in Python

The landing zone contains raw data, the clean zone contains clean data, and the business zone contains domain-specific data, usually related to solve business problems.

parquet文件也可以是csv,用file filename可以看到file的类型。

type(catalog["diaper_reviews"].read())可以看到file的内部结构。

Singer’s core concepts

Aim: “The open-source standard for writing scripts that move data”

Singer is a specication

  • data exchange format: JSON

  • extract and load with taps and targets => language independent

  • communicate over streams: schema (metadata) state (process metadata) record (data)

Last updated