Design from existing data: are inconsistent values a common problem?

Difficulty: Easy

Correct Answer: Valid (inconsistent codes and formats are common in legacy data)

Explanation:


Introduction / Context:
When building a database from existing datasets, one of the first obstacles is data quality. Inconsistent values—such as multiple spellings for the same concept, irregular date formats, mixed units, or ad hoc code systems—can derail a clean relational design and must be identified and corrected during profiling.


Given Data / Assumptions:

  • Legacy sources may include spreadsheets, CSVs, or lightly constrained tables.
  • Validation rules may have been enforced inconsistently, if at all.
  • Multiple systems may have contributed data, each with different conventions.


Concept / Approach:
Recognize and standardize synonymous or conflicting values (for example, “NY”, “N.Y.”, “New York”). Create reference tables and enforce foreign keys. Apply CHECK constraints for formats and ranges. Normalize units and use consistent data types. These steps reduce heterogeneity so the model can enforce integrity reliably.


Step-by-Step Solution:

Profile each column for distinct values, patterns, and outliers.Map variants to canonical codes (for example, use ISO country/state codes).Introduce reference data tables and constraints to enforce consistency going forward.Backfill and correct legacy rows to meet the new standards.


Verification / Alternative check:
After cleansing and constraints, attempts to insert inconsistent data should fail, proving the system now enforces uniformity. Reports should show reduced distinct value counts for standardized domains.


Why Other Options Are Wrong:

  • Dismissing inconsistencies as rare is contradicted by practice.
  • Presence of primary keys or file format does not prevent inconsistent domain values.


Common Pitfalls:
Overlooking subtle variations (case, whitespace, punctuation); failing to convert units; neglecting historical data that do not meet current rules.


Final Answer:
Valid (inconsistent codes and formats are common in legacy data)

Discussion & Comments

No comments yet. Be the first to comment!
Join Discussion