Organizations that want to build their data lakehouse using open source technologies only can easily do so by using low cost object storage provided by Google Cloud Storage, storing data in open formats like Parquet, with processing engines like Spark and use frameworks like Delta, Iceberg or Hudi through Dataproc to enable transactions. This architecture offers low-cost storage in an open format accessible by a variety of processing engines like Spark while also providing powerful management and optimization features.Īt Google cloud we believe in providing choice to our customers. To address some of these issues, a new architecture choice has emerged: the data lakehouse, which combines key benefits of data lakes and data warehouses. As organizations are moving to the cloud they want to break these silos. Operationalizing and governing this architecture was challenging, costly and reduced agility. This approach often resulted in extensive data movement, processing, and duplication requiring complex ETL pipelines. Historically, organizations have implemented siloed and separate architectures, data warehouses used to store structured aggregate data primarily used for BI and reporting whereas data lakes, used to store unstructured and semi-structured data, in large volumes, primarily used for ML workloads. For more than a decade the technology industry has been searching for optimal ways to store and analyze vast amounts of data that can handle the variety, volume, latency, resilience, and varying data access requirements demanded by organizations.
0 Comments
Leave a Reply. |