Worked Examples - Getting Started with Talend Open Studio for Data Integration

Databases Reference

In-Depth Information

sizes such as 36, 38, 40, and so on. We will need to transform these into the

North American equivalents.

5. Thirdly and fourthly, we have product data from another of the retailer's

suppliers, Runway Collections Ltd. This supplier's product data comes in

two files, one containing the main product's content: names, descriptions,

and so on, and the other containing SKUs and prices only (this allows the

supplier to send price changes without having to send all of the product

data). To make this integration even more interesting, there's no guarantee

that a product file and a price file will arrive at the same time, and, even

if they do, they might contain data for different products and SKUs. This

presents a challenge for us. There's a constraint on the website such that it

can accept products without prices, but not prices without products, so we'll

need to figure out how we can work around this issue.

6. All three data sources described previously can send multiple datafiles per

day and there's no fixed time for each of the files to be sent. Further more, the

source systems will FTP the data onto the server hosting the Studio and the

website into some nominated directories.

7. There is no connection between the three systems supplying the data and it

is possible that they may use the same product and SKU IDs, so we'll need

some way of making the SKUs unique across the website platform.

8. The datafiles are presented with filenames of a similar format, namely:

[data_source]_[yyyyMMddhhmmss].[file_extension]

Examples of filenames are:

° erp_20120930120000.xml

° fabulous_fashions_20120930142524.csv

Our next task is to pick out the key information from the previous scenarios and use

this to define the high-level job requirements.

1. The scenario described previously is quite complex, and it often makes sense

to break down complex requirements into smaller, simpler requirements. An

obvious way to do this here is to define four separate jobs, one for each data

source, rather than trying to combine the requirements into one mega job.

Sometimes this will not be appropriate, but in this case, we'll go with

this approach.

2. The website has a standard import process and it requires a file named

catalog.xml . As we have four data sources feeding into this process at

an undefined schedule, we need some way of checking that a file has not

already been presented to the website import process before we try to

present another, otherwise files will be overwritten.

Search WWH ::

Custom Search

Home