Database Reference
In-Depth Information
party component like the Balanced Data Distributor. The second approach is to design
your data flow in such a way that multiple instances of your package can be run in par-
allel. For this approach to work, you will need to partition your source query so that it
pulls back a certain key range, and each instance of the package will work on a differ-
ent range. This approach gives you a bit more flexibility, because you can dynamically
control how many package instances you run in parallel by playing with the key
ranges.
Note You might find that the DQS Client performs its cleansing operations faster
than the DQS Cleansing transform in SSIS. This is because the client processes multiple
batches in parallel by default, whereas the DQS Cleansing transform processes them
one at a time. To get the same performance in SSIS as you do in the DQS Client, you'll
need to add your own parallelism.
Tracking Which Rows Have Been Cleansed
You can track which rows have already been cleansed and when the cleansing opera-
tion was performed. This allows you to filter out rows that have already been cleansed
so you don't need to process them a second time. By using a date value for this marker,
you can also determine which rows need to be reprocessed if your knowledge base gets
updated. Remember, as your knowledge base changes and your cleansing rules im-
prove, you will get more accurate results each time data is processed by the DQS
Cleansing transform.
To track when a row has been cleansed, add a new datetime column to your des-
tination table ( DateLastCleansed ). A NULL or very early date value can be used
to indicate that a row has never been processed. Alternatively, you can track dates in a
separate table, linked to the original row with a foreign key constraint. Your SSIS pack-
age will contain the following logic:
1. Retrieve the date the DQS knowledge base was last updated using an
Execute SQL task. This value should be stored in a package variable
( @[User::DQS_KB_Date] ).
2. Inside of a Data Flow task, retrieve the data to be cleansed with the
appropriate source component. The source data should contain a
DateLastCleansed column to track when the row was last pro-
cessed with the DQS Cleansing transform.
Search WWH ::




Custom Search