Difficulty: Medium
Correct Answer: By using change capture logic such as timestamps or change data capture stages to filter only new and updated rows since the last successful run
Explanation:
Introduction / Context:
Incremental load is a very important concept in extract transform load and data integration projects built with IBM InfoSphere DataStage. Instead of loading all records from the source every time, an incremental load job picks up only new or changed rows and applies them to the target data store. This question checks your understanding of how to design such a job in DataStage so that it is both efficient and reliable.
Given Data / Assumptions:
Concept / Approach:
The core idea of incremental load is to identify which rows have changed since the last successful run and to move only those rows. In DataStage this can be achieved by using change capture stages, comparing source and target key sets, or filtering on audit columns such as last update date. The design usually includes storing a watermark, for example the maximum processed timestamp, and using that value on the next run to pull only newer records. This approach reduces load time, network traffic, and impact on the source system.
Step-by-Step Solution:
1. Identify change detection columns in the source, such as last_update_timestamp or a change flag.
2. Store the watermark from the last successful run, such as the maximum last_update_timestamp in a control table.
3. In the DataStage job, read only rows where last_update_timestamp is greater than the stored watermark.
4. Use a Change Capture and Change Apply pattern or lookup against the target to decide whether each row is an insert or an update.
5. After a successful run, update the control table watermark so the next run starts from the correct point.
Verification / Alternative check:
You can verify that the job is truly incremental by running it twice in a row without changing the source and confirming that the second run processes zero rows. Another check is to insert or update a few test records in the source, then confirm that only those records appear in the DataStage log and in the target after the next run. If you see full table scans and large numbers of unchanged rows, then the job is not configured as a proper incremental load.
Why Other Options Are Wrong:
Common Pitfalls:
Common mistakes include forgetting to store and update the watermark, using an incorrect comparison operator on timestamps, or not handling late arriving updates correctly. Another pitfall is ignoring deleted records; true incremental patterns often need a way to detect and propagate deletes. Designing a robust incremental load in DataStage means thinking carefully about change detection, control tables, error handling, and restartability, not just about raw performance.
Final Answer:
The correct approach is By using change capture logic such as timestamps or change data capture stages to filter only new and updated rows since the last successful run, because this implements a true incremental load pattern in DataStage rather than a full refresh.
Discussion & Comments