Microsoft Fabric Updates Blog

Semantic Link: Data validation using Great Expectations

Great Expectations Open Source(GX OSS) is a popular Python library that provides a framework for describing and validating the acceptable state of data. It helps data engineers and data scientists ensure that their data meets specific quality standards before using it for analysis, machine learning, or other data-driven tasks. With the recent integration of Microsoft Fabric semantic link, GX can now access semantic models, further enabling seamless collaboration between data scientists and business analysts.

Semantic link is a feature in Microsoft Fabric that establishes a connection between semantic models (aka Power BI datasets) and Synapse Data Science. It facilitates data connectivity, enables the propagation of semantic information, and seamlessly integrates with established tools used by data scientists, such as notebooks. Semantic link helps preserve domain knowledge about data semantics in a standardized way that can speed up data analysis and reduce errors.

Ensuring data quality in the semantic model, also known as the diamond layer, is crucial for organizations to make informed decisions based on accurate and reliable data. Data scientists play a vital role in this process by validating, cleaning, and transforming raw data into meaningful insights. With the integration of Microsoft Fabric semantic link and Great Expectations, data scientists can now leverage the power of both platforms to ensure the highest quality of data assets in the semantic model.

Here’s why ensuring data quality in the semantic model is important:

1. Trustworthy insights: high-quality data assets in the semantic model lead to more accurate and reliable insights, enabling organizations to make better-informed decisions. Data scientists can use GX to define and validate data quality standards, ensuring that the data used in the semantic model is consistent, complete, and accurate.

2. Improved collaboration: the integration of semantic link and GX allows data scientists and business analysts to work together seamlessly, sharing a common understanding of data quality standards. This collaboration ensures that both parties can efficiently and effectively use the data in the semantic model, maximizing the potential of their data-driven insights.

3. Reduced errors: by validating data quality in the semantic model, data scientists can identify and address potential issues before they impact downstream processes, such as reporting and analytics. This proactive approach helps reduce errors and minimize the risk of making decisions based on inaccurate or incomplete data.

In this blog post, we will explore the core concepts of GX, including Data Sources, Assets, Expectations, and Checkpoints. We will also discuss the new integration with Microsoft Fabric Semantic Link, which allows you to access semantic models and leverage the vast library of Expectations provided by GX and its community.

Core GX Concepts: Data Sources and Assets

GX revolves around four core components:

  1. Data Sources: Connect to your data, regardless of its format or location, and organize it for future use.
  2. Data Assets: Collections of records within a Data Source that can be further partitioned into Batches.
  3. Expectations: Verifiable assertions about your data that describe the standards it should conform to.
  4. Checkpoints: Validate a set of Expectations against a specific set of data.

GX provides a vast library of Expectations, which are classes that implement specific validations. These Expectations can be used to ensure that your data meets the required quality standards.

New Integration: Accessing Semantic Models with Semantic Link

The new integration allows GX to access Power BI datasets in Microsoft Fabric using Semantic Link. This is achieved through the addition of new methods in the GX API, such as add_fabric_powerbi, add_powerbi_table_asset, add_powerbi_measure_asset, and add_powerbi_dax_asset.

In the example below, we first create a GX Data Context and add a GX Data Source for a semantic model. We then add a GX Asset for a Power BI table, a GX Asset for Power BI measures, and a GX Asset for Power BI DAX queries.

A screenshot of a computer

Description automatically generated

We are now ready to define our Expectations, which verify our assertion about the data:

A screenshot of a computer

Description automatically generated

Expectations are reusable across Data Assets; thus, we need to specify which Expectations we want to apply to which Asset:

A screenshot of a computer

Description automatically generated

Finally, we run the Checkpoint and inspect the results or use them for further automation steps.

You can find the tutorial notebook with additional examples in our GitHub samples repository.

Conclusion

The new integration between Great Expectations and Microsoft Fabric unlocks the potential of semantic models for data validation and quality assurance. By enabling seamless access to Power BI data in the familiar GX environment, data scientists and business analysts can collaborate more effectively, ensuring that their data-driven insights are based on high-quality, reliable data.

Start leveraging the power of semantic models in Great Expectations today and unlock the full potential of your data-driven insights.

Related blog posts

Semantic Link: Data validation using Great Expectations

April 11, 2024 by Matthew Hicks

Microsoft OneLake is a unified data lake for all of your organization’s data. With OneLake shortcuts, you can reference data in different locations and have that data logically represented within OneLake, with no data movement or duplication. With the recent announcement of shortcuts to Google Cloud Storage, you can use shortcuts to seamlessly bring in … Continue reading “Public Preview of OneLake shortcuts to S3-compatible data sources”

April 11, 2024 by Trevor Olson

We are excited to announce that you can now create OneLake shortcuts to your Google Cloud Storage (GCS) buckets. With the addition of GCS, you can now utilize cross-cloud shortcuts to analyze your data across all three major cloud platforms. Shortcuts in OneLake allow you to connect to your existing data through a single unified … Continue reading “Shortcuts to Google Cloud Storage, now available in Public Preview”