Demystifying Data Ingestion in Fabric: Fundamental Components for Ingesting Data into a Fabric Lakehouse using Fabric Data Pipelines
✎ Co-Author – Abhishek Narain
Overview
Building an effective Lakehouse starts with establishing a robust ingestion layer. Ingestion refers to the process of collecting, importing, and processing raw data from various sources into the data lake. Data ingestion is fundamental to the success of a data lake as it enables the consolidation, exploration, and processing of diverse and raw data. It forms the backbone for downstream analytics, machine learning, and reporting activities, equipping organizations with the flexibility and agility to extract meaningful insights from their data. In this blog, we will walk through the key components of a data Ingestion framework based on experience working with different customers while building a lakehouse in Fabric.
Data Ingestion can be divided in 2 types –
- One time Ingestion — One time ingestion/load refers to initial ingestion of historical data to data lake.
- Incremental ingestion — Post one time ingestion, incremental ingestion.
Please note that we have only considered batch ingestion. Real time ingestion is not in scope of this blog.
Basic components of Ingestion framework
Below graphic represents the basic building blocks (components) of the ingestion framework.
Below table details out different components and purpose of each component.
# | Component Name | Purpose |
---|---|---|
1 | Control Table | Control table is used to control and select an item from the source system to be moved into a data lake. |
2 | Data Ingestion | Data Copy component is used to copy the data from source system to data lake. Typically, same format is used for ingesting data from Source to Bronze Layer/Staging layer but we have the flexibility to change the source type during ingestion. |
3 | Auditing | Auditing component is used to audit the records ingested into data lake. Auditing includes identifying any errors or issues that may have occurred during the process, as well as reviewing performance metrics to identify areas for improvement. |
4 | Notification | Notification component is used to notify in case of success or failure events. This is super critical to ensure that the operation team is notified during all critical events. |
5 | Config Management | Config Management component is used for managing the configuration. |
6 | Reporting | This component is used for visual reporting of different ingestion metrics. |
- Control Table – This component is used to control and select an item to be moved into a Lakehouse. This is typically represented using a table stored in a Data warehouse table. Below are the typical columns used to control the ingestion in lakehouse. source type – representing the type of source (Blob, ADLS Gen2, Database, AWS S3), source details – representing the source which could be container details in case of blob or database & table details in case of relational data source, target details specifying the lakehouse or Data warehouse location where the data needs to be ingested. Below is a sample Fabric data pipeline used to look up to this control table to control the ingestion into lakehouse.
2. Data Ingestion – This component is essential for copying data from the source. A key consideration is datatype mapping from source to target, as Microsoft Fabric stores data in Delta Lake format. For more information on the datatypes supported by Microsoft Fabric, refer here. There may be scenarios where unsupported data types are encountered. For example, in SQL Server, data types such as geography, money, and datetime2(7) do not have corresponding equivalents in Delta Lake. This must be handled carefully to avoid loss of information when ingesting data into the Microsoft Fabric Lakehouse or warehouse.
3. Auditing – The auditing component is used to inspect the records ingested into the lakehouse. It helps in tracking data lineage, ensuring transparency in data transformations. It logs errors and anomalies, aiding in issue detection and correction. Performance metrics are captured to optimize the transformation processes. It also supports compliance by maintaining historical records of data activities. Additionally, it provides an audit trail for accountability and security.
Fabric’s data warehouse can be used to audit all the information. Some of the key fields which can be audited are –
Field Name | Purpose |
---|---|
source_type | Type of source (Blob,ADLS Gen2,Database,AWS S3) |
event_run_id | Captured from data pipeline run id (Data pipeline system variable) e.g. 36ff2fde-dd85-4b1f-982e-651530df7c90. |
item_name | Actual Item Captured from source e.g. For files Container/folder name & for Database it will be Database.schemaname.tablename |
data_read | Total amount of data retrieved from source data store, including data copied in this run and data resumed from the last failed run. |
data_written | The actual amount of data written/committed to the sink. The size may be different from dataRead size, as it relates to how each data store stores the data. |
files_read | Total number of files copied from source data store, including files copied in this run and files resumed from the last failed run. Note: In case of database this field will be NA |
files_written | The actual amount of data written/committed to the sink. The size may be different from data Read size, as it relates to how each data store stores the data. |
rows_read | The actual amount of data read from the source. |
rows_written | Number of rows copied to sink. This metric does not apply when copying files as-is without parsing them, for example, when source and sink datasets are binary format type, or other format type with identical settings. |
data_consistency_verification | Tell the result whether the data has been verified to be consistent between source and destination store after being copied. |
copy_duration | Total amount of duration take to copy the data. |
event_start_time | Time when the pipeline started |
event_end_time | Time when the pipeline ended. |
source_cutoff_time | Source_cutoff_time will be the incremental pipeline execution time. Incase of full load it will be same as source_cutoff_time in the control table. This will be updated on each incremental load. |
load_type | Status either Full or Incr (Incremental). |
status | Status either success or failure. |
event_triggered_by | Status either Manual or Scheduled. |
error_details | Contain error codes and messages that pinpoint the location and nature of the problem. |
4. Notification – The notification component is utilized to alert in the event of success or failure. This is crucial to ensure the operations team is informed during all critical events. In Microsoft fabric, the Office 365 Outlook activity in Data Factory allows you to send an email with your Office 365 account. We also have the Teams activity in Data Factory which allows you to send a message to a Teams channel or group chat. The message in Office 365 outlook account and Teams can include dynamic expressions to be customized as much as necessary. Please find below a sample Teams notification sample.
5. Config Management – Config management is used to manage configurations, which are essentially the connection details of the source.
6. Reporting – The ingestion reporting dashboard is a single pane of glass for all the displaying all the information captured by audit. One of the advantages of Fabric is that all audit information captured in data warehouse will be displayed using Power BI. Please find below a sample Teams notification sample.
List of key design considerations
When designing our Ingestion framework, adhere to several fundamental principles to ensure its effectiveness, maintainability, and adaptability.
- Prioritize simplicity to reduce errors and facilitate understanding and maintenance.
- Embrace modularity by dividing the system into smaller, independent components for ease of development and maintenance, while maintaining high cohesion within each module to focus on specific tasks.
- Minimize coupling to allow independent modifications, ensuring scalability to handle growing data volumes, and support incremental loading for efficient updates.
- Implement comprehensive monitoring and logging, along with robust metadata management, for tracking data and system health.
- Aim for high performance through parallel ingestion and maintain extensive documentation to aid understanding, usage, collaboration, and troubleshooting.
Conclusion
I hope this blog helped you in understanding how to design an effective ingestion framework for a lake house using Fabric platform.
Special shout out to Gyani, Winnie, Polly, Hideo, and Dharmendra for their continuous support!