In the context of data quality for warehousing, what best describes “data scrubbing” (also called data cleansing)?

Difficulty: Easy

Correct Answer: Upgrading data quality before it is moved into the warehouse

Explanation:


Introduction / Context:
Data scrubbing (cleansing) removes or corrects errors, inconsistencies, and duplicates to improve reliability. Performing cleansing before loading prevents polluting the warehouse and downstream analytics.



Given Data / Assumptions:

  • ETL pipelines include validation, standardization, deduplication, and reference-data conformance.
  • High-quality data is essential for trustworthy BI and ML.
  • Indexes are unrelated to cleansing.


Concept / Approach:
Scrubbing is typically part of the “T” in ETL, prior to the load step. It may leverage rules, reference tables, postal standardization, and fuzzy matching to unify entities (e.g., customers).



Step-by-Step Solution:

Identify timing: scrubbing happens pre-load to keep the warehouse clean.Eliminate options about indexes; indexing is a separate physical design task.Choose the option that upgrades quality before data lands in DW.


Verification / Alternative check:
Most ETL toolchains and best practices recommend data quality gates prior to load (reject/repair/route records accordingly).



Why Other Options Are Wrong:
After load: Possible but suboptimal; fixes should prevent bad data from entering.
Index creation: Unrelated to cleansing.



Common Pitfalls:
Deferring quality fixes until after loading, which increases rework and can corrupt aggregates.



Final Answer:
Upgrading data quality before it is moved into the warehouse

More Questions from Data Warehousing

Discussion & Comments

No comments yet. Be the first to comment!
Join Discussion