Database Reference
In-Depth Information
Figure 5-12 . Data Flow processing logic following a DQS Cleansing transform
Note Although the Confidence column's output by the DQS Cleansing trans-
forms are numeric, they are output as DT_WSTR(100) columns (strings). To check the
confidence level against a minimum threshold, you'll need to cast the value to a DT_R4
(float) or DT_R8 (double).
Performance Considerations
Data cleansing can be a CPU and memory intensive operation and may take some time
to complete. Domains that rely on online reference data services may round trip incom-
ing data to the Azure Data Marketplace, which will have a further impact on the time it
takes to cleanse your data. As a result, when processing large amounts of data, you will
typically want to reduce your dataset before passing it through the DQS Cleansing
transform.
The DQS Cleansing transform sends incoming data to the DQS server (running
within a SQL Server instance), where the actual cleansing operations are performed.
Although this may offload a lot of the work being done by the SSIS machine, there
may be some overhead in sending the data across the network to another server. Anoth-
er thing to note is that the DQS Cleansing transform is an asynchronous component,
which means it makes copies of data flow buffers at runtime. This can further impact
the performance of your data flow and is another reason for only passing through the
rows that need to be cleansed.
The following sections describe some package design tips that can be used to im-
prove overall performance when cleansing data with the DQS Cleansing transform.
Parallel Processing
The DQS Cleansing transform sends its rows to the DQS server one batch at a time.
This single threaded approach isn't ideal if you have a lot of spare CPU power on your
system, so designing your packages in a way that allows DQS to send multiple batches
to the server in parallel will give you a performance boost. You have two main options
for parallel processing. First, you can split the incoming rows down multiple paths and
have a separate DQS Cleansing transform on each path performing the same set of
work. If your data set has a key or row that can be easily split using SSIS Expressions,
you can use a Conditional Split transform. Otherwise, you can consider using a third
Search WWH ::




Custom Search