Public Preview of Native Execution Engine for Apache Spark on Fabric Data Engineering and Data Science
The Native Execution Engine showcases our dedication to innovation and performance, transforming data processing in Microsoft Fabric. We are excited to announce that the Native Execution Engine for Fabric Runtime 1.2 is now available in public preview.
The Native Execution Engine leverages technologies such as a columnar format and vectorized processing to boost query execution performance. Validated by the TPC-DS 1TB benchmark, our internal tests have shown a significant 4x speed enhancement over OSS Spark. This leap in performance is achieved by translating SparkSQL code into optimized C++ code, coupled with Microsoft’s internal efforts on query optimizations embedded in Apache Spark on Fabric. One of the standout features of the Native Execution Engine is its seamless integration with Apache Spark™ APIs. Whether you are working with PySpark, Scala, R, or Spark SQL, enabling this engine requires no code alterations or concerns about vendor lock-in. It works with both Parquet and Delta formats, ensuring your data, whether in OneLake or accessed via shortcuts, is processed efficiently. Users benefit from accelerated processing times, and that leads to significant operational cost savings.
Powered by innovations
At the heart of the Native Execution Engine are two key open-source components: Velox, a community sustained C++ database acceleration library created and open sourced by Meta, and Apache Gluten (incubating) initiated by Intel, a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines. Gluten offers seamless integration with existing Spark applications and employs modern C++ for accelerated native code execution. Utilizing CPU vectorization and advanced memory management, Gluten provides features such as native query execution, columnar-based shuffle, and compatibility with Spark’s Catalyst optimizer. This combination enables the Native Execution Engine to leverage the native capabilities of underlying data sources, minimizing the overhead associated with data movement and serialization in traditional Spark workloads. This high-performance solution is poised to revolutionize big data analytics by significantly improving the efficiency of Spark SQL’s operations.
Query Optimizations improvements
Microsoft has enhanced Apache Spark query optimizations and integrated these into Apache Spark on Fabric, including:
- Split Block Bloom Filters to reduce false positives when verifying the existence of elements in large datasets.
- Parquet Footer Caching to reduce I/O operations by caching metadata, which speeds up data retrieval in large-scale analytics projects.
- Smart Shuffle Optimizations to improve data distribution across nodes, minimizing network overhead.
- Optimized Sorting for Window Functions to accelerate data sorting within partitions for more efficient processing.
For further details on these native optimizations, refer to the article and previous blog posts: New Query Optimization Techniques in the Spark Engine of Azure Synapse, Apache Spark in Azure Synapse – Performance Update, and Speed up your data workloads with performance updates to Apache Spark 3.1.2 in Azure Synapse.
The Native Execution Engine architecture
In the realm of resource scheduling within the Spark framework, the Catalyst logical plan optimizer remains a pivotal component. Spark begins by parsing a query, applying numerous optimization rules to forge an optimized logical plan. This plan is subsequently transformed into a physical plan slated for execution. Gluten plays a crucial role here, converting this physical plan into a Substrait plan and utilizing JNI (Java Native Interface) calls to initiate execution in Velox. Velox, recognized as a robust C++ database acceleration library, offers reusable, extensible, and high-performance data processing components. In instances where Velox does not support a particular operator, the system reverts to the existing Spark JVM-based operator. This fallback mechanism, however, brings with it the overhead of converting data between columnar and row formats, impacting performance.
Microsoft’s recent rollout of a Native Execution Engine marks a significant enhancement to Apache Spark on Fabric, building upon open-source software while also contributing innovations back to the Gluten and Velox projects. This initiative introduces several pivotal improvements:
- the integration of the Azure Blob Filesystem (ABFS) storage adapter;
- advanced operators such as Expand, BroadcastNestedLoopJoin, and CartesianProduct;
- the suite of new functions, including uuid, date_from_unix_date, and from_utc_timestamp;
- support for INT96/INT64 timestamps in Velox parquet scans;
- the capability to handle metadata columns within Gluten;
- significant reliability enhancements achieved by addressing over 300 unit tests across 40 suites for Apache Spark versions 3.3, 3.4, and 3.5.
Scenarios for accelerated performance
The current release of the Native Execution Engine performs particularly when:
- working with data in Parquet and Delta formats;
- handling queries that involve complex transformations and aggregations, benefiting from the engine’s columnar processing and vectorization capabilities;
- running computationally intensive queries, rather than simple or I/O-bound operations.
Enabling the Native Execution Engine
To use the Native Execution Engine, no code changes are required. During the preview stage, you have the flexibility to activate the Native Execution Engine either through your environment settings or selectively for an individual notebook or job. Enabling this feature within your environment item ensures that all subsequent jobs and notebooks associated with that environment will automatically inherit this configuration. For comprehensive guidance on how to activate this feature, please visit our documentation.
In the meantime, we are actively developing mechanisms to enable the Native Execution Engine at the tenant, workspace, and environment levels, ensuring seamless integration with the UI.
Results from early adopters
Miles Cole (Director, Data & AI, Hitachi Solutions America) and Sandeep Pawar (Senior Power BI Architect, Hitachi Solutions America) tested the Native Execution Engine on various Data Engineering and Data Science workloads in Fabric. Miles focused on testing a wide array of Spark SQL functionality representative of typical analytical queries that included complex joins, aggregations, windowing and other expressions. He saw an average of 1.6x improvement in execution duration on the 100M row scale. Sandeep focused on testing data science workflows involving NLP applications, such as text processing and transformations, and training ML models. He saw on average a 1.3-2x speed up. Hitachi Solutions is excited about the Native Execution Engine and anticipates using it for all Spark workloads since it provides a significant improvement in execution time without any code changes.
Better together – acknowledging our partners
We want to recognize the invaluable contributions of our partners in developing the components of the Native Execution Engine:
- Meta – for their active contributions to Velox.
- IBM and members of the Presto foundation – for their active contributions to Velox.
- Intel – for their active contributions to Apache Gluten (incubating) and over the past year of close-knit collaboration with the Microsoft team, seamlessly incorporating the Native Execution Engine from Gluten and Velox into Apache Spark on Microsoft Fabric.
Additionally, we extend our gratitude to the more than 150 contributors to Velox and Gluten. Their commitment and ongoing efforts to accelerate big data query execution empower every individual and organization that processes big data to achieve more, thanks to elevated performance. This collaboration has also been essential in bringing the Native Execution Engine into the hands of our customers.
Learn more, and help us with your feedback
We encourage you to share your feedback directly to our product team by using this form. We look forward to your valuable input and are eager to discuss your findings in detail.