Microsoft Fabric Updates Blog

Optimizing Spark Compute for Medallion Architectures in Microsoft Fabric

Guidance to Maximizing Productivity and Efficiency for your Data Engineering Workloads

Data engineering teams often grapple with the complexities of planning and configuring compute resources for their data platforms. This is especially true when working with large-scale, complex datasets and demanding downstream SLAs. A one-size-fits-all approach is rarely effective, as different data layers and datasets have unique requirements that necessitate tailored compute configurations.

To address these challenges, organizations often adopt a layered data architecture pattern. This pattern emphasizes modularity and scalability, allowing for better data management and governance. One popular approach is the Medallion architecture.

What is a Medallion Architecture?

Medallion architecture is a popular data engineering approach that emphasizes modularity. It organizes the data platform into three distinct layers: Bronze, Silver, and Gold. This layered structure aims to improve data quality and accessibility as data moves from left to right across the layers.

Enterprise data platforms often employ similar layered structures to ingest data from various on-premises and cloud sources, build entities, and ultimately serve these data assets through reports or applications for broader organizational insights. This common approach aligns well with the modular philosophy of the Medallion pattern.

Tailoring Compute for Medallion Layers

Its crucial to make sure that the compute configurations are custom fit for these different layers for achieving better performance and efficiency in your data engineering workloads.

Bronze Layer: The Bronze layer serves as the raw data landing zone, where unstructured data is ingested “as-is” from various sources (on-premises or cloud). This layer typically handles full or delta loads and acts as an archive of the original data. To minimize delays in subsequent processing stages, it’s essential to maintain a write-optimized configuration for the Bronze layer so the data ingestion jobs complete within your sla requirements.

Spark Configurations for Bronze Layer:

spark.sql("ALTER TABLE $tableName SET TBLPROPERTIES ('delta.checkpointInterval' = '25')"

delta.checkpointInterval: Checkpointing helps Delta Lake maintain data consistency during writes. Setting a higher interval reduces the frequency of checkpoints, allowing for faster data ingestion.

spark.databricks.delta.collect.stats: This collects statistics on Delta tables for faster future queries. However, for the bronze layer with frequent writes with potentially changing data and scenarios where there are no join, filter or aggregate queries on these tables, disabling stats would make the jobs more optimized.

spark.conf.set("spark.sql.parquet.vorder.enabled","false")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","false")
spark.conf.set("spark.databricks.delta.collect.stats","false")

spark.sql.parquet.vorder.enabled and spark.databricks.delta.optimizeWrite.enabled configurations are primarily beneficial for sorting and data analysis (more relevant to Gold layer which is which is read-heavy). Disabling them in the Bronze layer avoids unnecessary overhead during writes (bypassing steps to reorganize data), which translates to faster data landing.

Silver Layer: In the Silver layer, data is cleaned, enriched, and subjected to data quality checks. While not directly consumable for business reporting, this layer is vital for applying business requirements and standardizing data into a common format.

The Silver layer serves as a valuable resource for data engineers, data scientists, and analysts who create data entities and business aggregates for the Gold layer. However, in the silver layer, where frequent data writes occur alongside occasional sorting, vorder can introduce some overhead due to additional steps involved to reorganize data during writes.

Disabling vorder prioritizes write performance for the silver layer. This is crucial because data transformation and cleaning in the silver layer often involve numerous write operations.

Spark Configurations for Silver Layer:

spark.conf.set("spark.sql.parquet.vorder.enabled","false")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","false") 
                                  or 
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize","157286400")

Disabling delta.optimizeWrite helps eliminate pre-shuffle step, bypassing the effort to reorganize data during write leading to quicker write operations. Alternatively, you can fine-tune the optimization for Silver layers that are used for querying and data exploration, by enabling the optimizeWrite property and setting the binSize value between 150-250 MB can enhance parallelism for Spark jobs that create aggregates in the Gold layer.

Gold Layer: The Gold layer houses aggregated and curated data entities designed for business consumption. These entities are primarily accessed by business users and analysts to derive key insights. To ensure a low-latency experience for concurrent queries from multiple users, the Gold layer should be read-optimized.

Spark Configurations for Gold Layer:

spark.conf.set("spark.sql.parquet.vorder.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize","1073741824")

Gold layer tables are configured with vorder and optimizeWrite enabled, and a binSize of 1GB to minimize the number of files and optimize for read performance. Since the Gold layer is heavily queried, vorder can dramatically improve query execution times, especially for complex queries involving sorting and with optimizeWrite enabled Delta Lake automatically reorganizes data during writes, making it more efficient for queries to scan and filter data in the Gold layer to provide faster reads for Power BI reports accessed by business users.

You can configure all the above-mentioned properties using the environment item and attach them to your notebook or spark job definitions.

You can also use magic commands (%%configure) for notebook jobs in your Fabric workspace.

We hope these tips help you tailor your Spark compute configurations to meet the needs of your Medallion architecture and achieve optimal performance.

To learn more about how to create, configure and use an environment, please refer to our documentation Create, configure, and use an environment in Fabric – Microsoft Fabric | Microsoft Learn

To learn more about how to configure your sessions using magic commands, please refer to our documentation Develop, execute, and manage notebooks – Microsoft Fabric | Microsoft Learn

Bài đăng blog có liên quan

Optimizing Spark Compute for Medallion Architectures in Microsoft Fabric

tháng 10 29, 2024 của Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”

tháng 10 28, 2024 của Estera Kot

We’re thrilled to announce that the Native Execution Engine is now available at no additional cost, unlocking next-level performance and efficiency for your workloads. What’s New?  The Native Execution Engine now supports Fabric Runtime 1.3, which includes Apache Spark 3.5 and Delta Lake 3.2. This upgrade enhances Microsoft Fabric’s Data Engineering and Data Science workflows, … Continue reading “Native Execution Engine available at no additional cost!”