When analytics or machine learning results look “off”, the root cause is often data quality, not modelling. Data quality assessment gives you repeatable checks for whether data is usable, while statistical profiling creates a baseline so defects and sudden changes become visible. If you are building practical habits through a data scientist course in Delhi, these techniques help you move from intuition (“something seems wrong”) to evidence (“this field’s missingness doubled overnight”).
1) Statistical profiling: establish what “normal” looks like
Statistical profiling is a structured summary of a dataset at a point in time. The goal is to capture descriptive metrics that are quick to compute and easy to compare across runs.
For numerical columns, profile:
- Row count and non-null count
- Min, max, mean, median
- Spread (standard deviation) and tail values (p95/p99)
For categorical columns, profile:
- Distinct count and cardinality ratio (distinct/rows)
- Top values with frequencies and a rare-value share
- Null/blank share plus placeholders like “NA” or “unknown”
For timestamps, profile earliest/latest values, gaps, and out-of-order events.
Store these profiles per batch/day and compare them to a recent baseline. This immediately surfaces issues like a volume drop after an upstream deployment, a category disappearing, or a numeric feature collapsing to a constant.
2) Completeness and validity: are the right values present and acceptable?
Complete answers: “Do we have the data we expect?” Validity answers: “Do the values match allowed rules?”
Completeness can be measured with:
- Field completeness = 1 − (missing_count / total_rows)
- Record completeness = % of rows with all mandatory fields present
- Coverage completeness = % of expected entities represented (e.g., customers with at least one event)
Use conditional rules when business logic requires it, such as “delivery_date must exist if order_status = delivered”. Also define thresholds: you might tolerate 0.5% missingness in an optional field but require 99.9% completeness for join keys.
Validity rules are usually deterministic:
- Type/parse checks (dates parse; numbers are not stored as strings with commas)
- Range checks (age 0–120; discount 0–1)
- Set membership (status in {new, active, churned})
- Format checks using regex (email, phone, PIN code)
In many course projects, including those in a data scientist course in Delhi, teams implement these as automated tests so every new batch is evaluated the same way. Validity also benefits from profiling: if the “unknown” category suddenly jumps, or a numeric median collapses, that often signals a broken mapping even if type checks still pass.
3) Consistency and uniqueness: do values agree, and are keys duplicated?
Consistency checks ensure the dataset is internally coherent and aligns with related tables.
Common consistency checks include:
- Cross-field rules (start_date ≤ end_date; paid_amount ≤ invoice_amount)
- Referential integrity (order.customer_id exists in customers)
- Standardisation (currency codes, units, casing, trimming)
Uniqueness checks focus on duplicates. Useful metrics are:
- Duplicate rate = 1 − (distinct_key_count / total_rows)
- Duplicate cluster sizes (how many keys repeat 2×, 3–5×, 6+×)
- Near-duplicates after normalisation (case/whitespace differences)
Uniqueness must be defined per entity. In a data scientist course in Delhi, this is a common lesson when students compare event logs with master data tables. A user_id repeating in an event table is expected, but email might need to be unique in a master user table. In multi-source data, basic normalisation plus clear keys reduces accidental duplication.
4) Anomaly detection: catch outliers and sudden shifts automatically
Once you have baseline profiles, anomaly detection turns data quality into continuous monitoring. Start with methods that are transparent:
- Outliers via z-score, or robust z-score using median and MAD (better for skew)
- IQR rules for heavy-tailed distributions
- Volume anomalies (unexpected spikes/drops in row counts)
- Distribution shifts using bucketed comparisons (PSI-style) or changes in median/p95
If you use more advanced detectors (for example, multivariate outlier models), pair them with diagnostics that pinpoint what changed, so alerts lead to action.
Conclusion
Statistical profiling gives you the baseline, and data quality checks give you the rules. Combine completeness, validity, consistency, and uniqueness metrics with lightweight anomaly detection to detect defects and shifts early. The result is more trustworthy analysis, more stable models, and faster debugging when something changes. These habits matter whether you are in a data scientist course in Delhi or maintaining production pipelines in a real organisation.
