Microsoft Fabric Updates Blog

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

CSV files are widely used for data exchange and data ingestion into data warehouses, but they often pose challenges on performance. In accordance to a study from Microsoft Research, up to 90% of the total time spent in data ingestion occurs in parsing non-binary data such as JSON when using conventional file parsers. This is illustrated in Figure 1.

A bar chart comparing percentage of time spent on parsing versus query processing, for four different queries. The chart shows that more than 90% of the cost comes from parsing for queries one, two, and three, and over 80% of the cost comes from parsing for query number four.
Figure 1: Parsing vs. Query processing Cost
Twitter Dataset, Queries from [30], Spark+Jackson
Source: “Mison: A Fast JSON Parser for Data Analytics”

Today, we’re excited to announce a new, faster way to ingest data from CSV files into Data Warehouse in Microsoft Fabric: introducing CSV file parser version 2.0 for COPY INTO. The new CSV file parser builds on innovation from Microsoft Research’s Data Platform and Analytics group to make CSV file ingestion blazing fast on Data Warehouse.

Benefits

The performance benefits you will enjoy with the new CSV file parser vary depending on the number of files you have in the source, the size of these files, and the data layout. Our testing revealed an overall improvement of 38% in ingestion times on a diverse set of scenarios, and in some cases, more than 4 times faster when compared to the legacy CSV parser.  

How it works

To use the new CSV file parser, we have introduced a new option to the COPY INTO statement: PARSER_VERSION. When this option is used with the value ‘2.0’, the new CSV file parser is used. For example:

COPY INTO mytable
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    PARSER_VERSION = '2.0' --this parameter is optional, and is the new default
)

The performance of the new CSV file parser is so great that we have decided to make it the default option for COPY INTO, so you don’t even have to specify that option to enjoy the benefits of the new file parser. In some rare cases, however, the new CSV parser is not supported, so you may need to use the legacy CSV file parser by specifying the option PARSER_VERSION = ‘1.0’ with COPY INTO. For more information on unsupported scenarios and full syntax, refer to our documentation.

Next steps

The new CSV file parser is now globally available. As mentioned, it is the new default file parser for CSV files during ingestion, so you do not need to do anything to enjoy its benefits.

To learn more about the Mison parser, visit Mison: A Fast JSON Parser for Data Analytics. Even though this research has been published with JSON formats at its origin, this work has since expanded to other file formats, such as CSV.

Bài đăng blog có liên quan

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

tháng 10 31, 2024 của Jovan Popovic

Fabric Data Warehouse is a modern data warehouse optimized for analytical data models, primarily focused on the smaller numeric, datetime, and string types that are suitable for analytics. For the textual data, Fabric DW supports the VARCHAR type that can store up to 8KB of text, which is suitable for most of the textual values … Continue reading “Announcing public preview of VARCHAR(MAX) and VARBINARY(MAX) types in Fabric Data Warehouse”

tháng 10 29, 2024 của Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”