Stop Plotting Randomly: A Systematic Framework for Exploratory Data Analysis

DS
LDS Team
Let's Data Science
10 minAudio
Listen Along
0:00 / 0:00
AI voice

A colleague once handed me a dataset with 47 columns, 120,000 rows, and column names like col_23_v2_final. He'd already trained an XGBoost model and was confused by terrible performance. Ten minutes of structured exploration revealed the problem: a date column had silently loaded as a string, a zip code column was being treated as a continuous integer, and 18% of the target variable was missing. No model can fix data it doesn't understand.

Exploratory data analysis is an interrogation, not a slideshow. Random histograms might produce a pretty notebook, but they won't catch the data leakage hiding in your time column or the class imbalance that makes your 97% accuracy meaningless. This framework breaks EDA into four repeatable phases: Structure, Uniqueness, Relationships, and Anomalies. Every phase has specific checks, specific tools, and a clear "done" condition. We'll build every example around a single housing dataset so the concepts compound as you go.

Four-phase EDA pipeline showing Structure, Uniqueness, Relationships, and Anomalies phasesFour-phase EDA pipeline showing Structure, Uniqueness, Relationships, and Anomalies phases

The Running Example: A Housing Dataset

Throughout this article, we'll work with a synthetic housing dataset containing 500 properties. It includes continuous features (price, square footage), discrete features (bedrooms, garage), a categorical feature (property type), and intentional missing values in the garage column. This mirrors the messy reality of data you'd pull from a production database or a Kaggle competition.

Every EXEC block generates this dataset fresh with np.random.seed(42), so you can reproduce every number exactly.

Phase 1: The Structural Health Check

The structural health check answers one question: does the data match what you expect? Before computing a single statistic, you need to verify that Python has interpreted your columns correctly. A price column loaded as a string, a boolean loaded as float64, or a date column stored as an integer will silently poison every downstream step.

Dimensions, Types, and Null Counts

The first command on any new dataset should confirm shape, data types, and missing value counts. This takes five seconds and prevents hours of debugging later.

Expected Output:

text
Rows: 500, Columns: 6

First 5 rows:
        price  sqft  bedrooms      type  year_built  garage
242482.703539  2263         2     Condo        2016     NaN
188093.466317  2754         2     House        1962     2.0
257577.368781  1100         3     House        1997     NaN
365569.592119  2081         4     House        1974     0.0
181015.628190  1474         1 Apartment        1993     1.0

Null counts per column:
  price: 0
  sqft: 0
  bedrooms: 0
  type: 0
  year_built: 0
  garage: 154

Common Pitfall: If a column like price appears with dtype object instead of float64, Pandas detected a non-numeric character somewhere in that column. Common culprits include dollar signs (\$250,000), commas, or stray text like "N/A". You must clean and coerce these types with pd.to_numeric(df['price'], errors='coerce') before any analysis.

Data type diagnosis decision tree for identifying column treatment in EDAData type diagnosis decision tree for identifying column treatment in EDA

The Cardinality Check

Cardinality is the number of unique values in a column. This single number reveals several common problems: a zip code column with 40,000 unique values is categorical data pretending to be numeric; a "status" column with only 2 unique values might belong in a boolean type; a supposedly continuous feature with only 5 unique values is actually ordinal.

Expected Output:

text
Cardinality analysis:
Column       | Unique |  Ratio |  Level
------------------------------------------
price        |    500 | 100.0% |   HIGH
sqft         |    430 | 86.0% |   HIGH
bedrooms     |      5 |  1.0% |    LOW
type         |      3 |  0.6% |    LOW
year_built   |     69 | 13.8% |    MED
garage       |      3 |  0.6% |    LOW

Property type breakdown:
type
House        239
Condo        163
Apartment     98

The cardinality ratio (unique values / total rows) immediately flags price and sqft as continuous variables, bedrooms and type as categoricals, and year_built as something in between. When you encounter high-cardinality categoricals (hundreds or thousands of unique values, like city names or product IDs), one-hot encoding explodes your feature space. That's where Frequency Encoding becomes essential.

Key Insight: The cardinality ratio is your quickest heuristic for deciding how to treat a column. Below 5%: treat as categorical. Between 5% and 50%: investigate further. Above 50%: almost certainly continuous or an identifier column that shouldn't be a feature at all.

Phase 2: Univariate Analysis

Univariate analysis examines each variable in isolation to understand its distribution shape. You're answering three questions per feature: Is it symmetric or skewed? Is it unimodal or multimodal? Are there obvious impossibilities (negative prices, ages over 200)?

Distribution Shape and Skewness

For continuous variables, skewness is the key diagnostic. It determines whether your data needs transformation before feeding it to models that assume normality (linear regression, LDA, many statistical tests).

Skewness=E[(Xμ)3]σ3\text{Skewness} = \frac{E[(X - \mu)^3]}{\sigma^3}

Where:

  • EE is the expected value (average)
  • XX is the observed data point
  • μ\mu is the population mean
  • σ\sigma is the population standard deviation
  • The cubic power preserves the direction of deviation (left vs. right)

In Plain English: Skewness measures how lopsided your data is. A value near 0 means a symmetric bell curve. Positive skew (like our housing prices) means a long tail stretching to the right: most homes are affordable, but a few mansions pull the average way above the median. Negative skew is the opposite, like exam scores where most students do well but a few bomb the test.

Expected Output:

text
Column       |         Mean |       Median |   Skew | Action
-----------------------------------------------------------------
price        |     215713.9 |     199809.3 |   1.95 | Log transform
sqft         |       1816.1 |       1814.0 |   0.08 | OK
bedrooms     |          3.1 |          3.0 |  -0.00 | OK
year_built   |       1987.9 |       1987.0 |   0.09 | OK

Price skewness of 1.95 confirms right-skew from log-normal generation.
After log transform: skewness = 0.18

The price column has a skewness of 1.95, well above the 1.0 threshold. The mean ($215,714) sits noticeably above the median ($199,809), a classic sign of right-skew. A log transformation collapses the skewness from 1.95 to 0.18, which is close enough to symmetric for most parametric models.

Skewness RangeInterpretationAction
s<0.5\|s\| < 0.5Approximately symmetricNo transformation needed
$0.5 \leq |s| < 1.0$Moderate skewMonitor; transform if model demands normality
s1.0\|s\| \geq 1.0High skewLog, square root, or Box-Cox transform

Pro Tip: Always compare mean and median side by side. When they diverge significantly, the mean is being dragged by outliers. For skewed features, report the median and IQR rather than the mean and standard deviation.

Categorical Class Balance

For categorical variables, the equivalent of skewness is class imbalance. If 96% of your target is one class, a model that predicts the majority class for every input achieves 96% accuracy while being completely useless. You need to know the proportions before choosing an evaluation metric.

In our housing dataset, the type column breaks down as 47.8% House, 32.6% Condo, and 19.6% Apartment. That's not problematic for a feature, but if this were a classification target, you'd want stratified splits and an F1 or PR-AUC metric instead of raw accuracy.

For encoding these categories before modeling, see Categorical Encoding.

Phase 3: Bivariate Relationships

Bivariate analysis tests whether variables move together. This phase answers the modeling question: which features actually carry predictive signal? A feature with zero correlation to the target is dead weight. Two features with 0.95 correlation with each other create multicollinearity that destabilizes coefficient estimates in linear models.

The Pearson Correlation Coefficient

Pearson's rr is the standard measure for linear relationships between two continuous variables.

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}

Where:

  • xi,yix_i, y_i are individual data points from the two variables
  • xˉ,yˉ\bar{x}, \bar{y} are their respective means
  • nn is the number of observations
  • The numerator measures how much xx and yy move together (covariance)
  • The denominator normalizes by each variable's individual spread

In Plain English: Think of the correlation coefficient as a score for how synchronized two variables are. In our housing dataset, if every time square footage goes up by 100, price goes up by $15,000, that's a strong positive correlation near +1. If price bounced randomly regardless of square footage, rr would hover near 0. The denominator just rescales the result so it always lands between -1 and +1.

Expected Output:

text
Correlation matrix:
            price   sqft  bedrooms  year_built
price       1.000  0.891     0.281       0.047
sqft        0.891  1.000     0.013       0.018
bedrooms    0.281  0.013     1.000       0.062
year_built  0.047  0.018     0.062       1.000

Strongest correlations with price:
  sqft        : +0.891 (Strong)
  bedrooms    : +0.281 (Weak)
  year_built  : +0.047 (Weak)

Square footage dominates with r=0.891r = 0.891. Bedrooms contribute a modest +0.281, and year built has almost no linear relationship with price. If sqft and bedrooms had been highly correlated with each other (above 0.8), we'd need to drop one or combine them to avoid multicollinearity. They're at 0.013 here, so both carry independent signal.

Common Pitfall: Pearson's rr only captures linear relationships. A parabolic pattern (U-shaped) can produce r0r \approx 0 while being perfectly predictable. Always validate the correlation matrix with scatter plots. Anscombe's Quartet, published in 1973, famously shows four datasets with identical correlation statistics but wildly different shapes. For non-linear relationships, consider Spearman's rank correlation instead. For a deeper treatment, see Correlation Analysis.

Categorical vs. Numerical: Grouped Comparisons

When you need to test whether a category affects a numeric variable ("Do Houses cost more than Condos?"), grouped boxplots and group-level statistics are the right tools. If the interquartile ranges of each group overlap heavily, the categorical variable likely has weak predictive power for that target. If the medians are clearly separated, you've found a useful feature.

For our housing dataset, comparing price distributions across property types reveals whether type deserves a place in the model or adds noise.

Phase 4: Missingness and Anomaly Detection

Missing data and outliers are where data science projects actually break. A model trained on incomplete data makes biased predictions. A model trained with uncapped outliers gives disproportionate weight to a handful of extreme cases. This phase investigates both.

Missing Data Patterns

Not all missing data is the same. The mechanism behind the missingness determines which imputation strategy works and which introduces bias.

Missingness pattern classification showing MCAR, MAR, and MNAR decision treeMissingness pattern classification showing MCAR, MAR, and MNAR decision tree

PatternDefinitionExampleSafe Strategies
MCARMissing completely at randomSensor glitch drops random readingsDrop rows, mean/median impute
MARMissing depends on observed variablesHigh-income respondents skip income questionConditional imputation, MICE
MNARMissing depends on the missing value itselfSick patients too ill to report health statusDomain expertise, model-based approaches

The critical test: compare the distribution of other variables between rows where the suspect column is missing and rows where it's present. If the distributions differ significantly, the data is NOT missing at random.

Expected Output:

text
Missing data summary:
Column       | Count |    Pct
------------------------------
price        |     0 |   0.0%
sqft         |     0 |   0.0%
bedrooms     |     0 |   0.0%
type         |     0 |   0.0%
year_built   |     0 |   0.0%
garage       |   154 |  30.8%

Is garage missingness related to price?
  Mean price (garage missing):  \$213,245
  Mean price (garage present):  \$216,813
  Welch t-test p-value: 0.6786
  Verdict: No evidence against MCAR (safe to drop or simple impute)

The garage column is 30.8% missing. That's a lot, but the Welch t-test shows no significant difference in price between properties with and without garage data (p = 0.68). This suggests MCAR, meaning median imputation or simply dropping garage rows won't bias the model. If the p-value had been below 0.05, we'd need a more sophisticated approach.

For a complete toolkit of imputation methods, from simple median fill to iterative MICE, see Missing Data Strategies.

Outlier Detection: IQR vs. Z-Score

Outliers distort means, inflate variance, and confuse gradient-based optimization. But not all extreme values are errors. A $2 million mansion in a dataset of $200K homes is a real data point, not a typo. The question is always: does this point represent genuine variation, or is it corrupted data?

Two classic methods detect univariate outliers:

The IQR Method (non-parametric, works on any distribution):

Outlier if x<Q11.5IQR or x>Q3+1.5IQR\text{Outlier if } x < Q_1 - 1.5 \cdot IQR \text{ or } x > Q_3 + 1.5 \cdot IQR

Where:

  • Q1Q_1 is the 25th percentile (first quartile)
  • Q3Q_3 is the 75th percentile (third quartile)
  • IQR=Q3Q1IQR = Q_3 - Q_1 is the interquartile range (the middle 50% spread)
  • $1.5$ is Tukey's conventional multiplier (use 3.0 for "extreme" outliers)

In Plain English: The IQR method builds a fence around the middle 50% of your data, then extends it by 1.5 times the width of that middle zone in each direction. Any point outside these fences is flagged. For our housing prices, the fence runs from roughly $-12K to $423K. Since negative prices are impossible, only the upper fence matters here.

The Z-Score Method (parametric, assumes approximate normality):

Z=xμσZ = \frac{x - \mu}{\sigma}

Where:

  • xx is the data point being tested
  • μ\mu is the sample mean
  • σ\sigma is the sample standard deviation
  • A Z>3|Z| > 3 flags the point as an outlier (0.3% probability under a normal distribution)

In Plain English: The Z-score converts every data point into "how many standard deviations away from average is this?" A Z-score of 3 means the value sits three standard deviations above the mean, which should happen in only 0.15% of cases if the data is normally distributed. For our right-skewed house prices, this assumption is violated, which is why the Z-score method catches fewer outliers than IQR.

Expected Output:

text
IQR Method:
  Q1 = \$151,252, Q3 = \$260,111, IQR = \$108,859
  Lower fence = $-12,036, Upper fence = \$423,399
  Outliers found: 23

Z-Score Method (|z| > 3):
  Mean = \$226,060, Std = \$128,272
  Outliers found: 8

Overlap: 8 flagged by both

The IQR method is more aggressive: it flags 15 more points
because it does not assume a normal distribution.

The IQR method flags 23 outliers; the Z-score method catches only 8. All 8 Z-score outliers are also caught by IQR. The gap exists because the Z-score assumes normality, and our prices are right-skewed. In skewed distributions, the inflated standard deviation makes extreme values look less extreme in Z-score terms.

Outlier detection method selection guide for choosing between IQR, Z-score, Isolation Forest, and LOFOutlier detection method selection guide for choosing between IQR, Z-score, Isolation Forest, and LOF

For multivariate outlier detection where points might look normal in each individual feature but are anomalous when features are considered together, you need density-based methods. Isolation Forest works well for high-dimensional data, while Local Outlier Factor excels at detecting cluster-boundary anomalies. For a comprehensive comparison, see Statistical Outlier Detection.

MethodAssumes NormalityMulti-FeatureScalabilityBest For
IQRNoSingle featureAny dataset sizeSkewed data, initial scan
Z-ScoreYesSingle featureAny dataset sizeRoughly normal data
Isolation ForestNoMulti-featureScales to millionsHigh-dimensional, mixed types
Local Outlier FactorNoMulti-featureModerate (KNN-based)Density-varying clusters

When to Use This Framework (and When NOT to)

Use the four-phase framework when:

  • You're starting with a dataset you've never seen before
  • You're building a model and need to choose features, transformations, and imputation strategies
  • You're auditing a teammate's work and want a systematic checklist
  • You've inherited a production pipeline and need to understand what the data actually looks like

Skip the full framework when:

  • You're working with a well-documented dataset you've used many times before (still run Phase 1 as a sanity check)
  • Time constraints demand automated profiling tools like ydata-profiling or skrub for a quick first pass
  • The dataset is tiny enough (under 20 rows) to inspect visually in a spreadsheet

The EDA Checklist

After completing all four phases, you should be able to answer every question on this checklist. If you can't, go back and investigate.

PhaseCheckQuestion Answered
StructureDtypes correctAny numeric columns loading as strings?
StructureCardinality classifiedWhich columns are continuous vs. categorical?
StructureDuplicates checkedAre there exact duplicate rows?
UniquenessSkewness computedWhich features need transformation?
UniquenessClass balance verifiedIs the target variable imbalanced?
RelationshipsCorrelation matrix builtWhich features carry signal for the target?
RelationshipsScatter plots validatedAre correlations real or artifacts?
RelationshipsMulticollinearity checkedAre any features redundant with each other?
AnomaliesMissingness mechanism testedIs the data MCAR, MAR, or MNAR?
AnomaliesOutliers quantifiedHow many exist and by which method?

Pro Tip: Save this checklist as a reusable function. In my projects, I keep an eda_report(df) utility that prints the entire structural check in one call. Over time, it becomes second nature. For cleaning issues you discover during EDA, follow a systematic approach from Data Cleaning Workflow.

Conclusion

The difference between productive EDA and wasted time comes down to structure. Random plotting feels like progress but rarely produces actionable findings. The four-phase framework forces you to start with the boring stuff (data types, cardinality) because that's where the worst bugs hide. By the time you reach Phase 4's outlier analysis, you already understand the distribution of every feature and can make informed decisions about what's genuinely anomalous versus what's just heavy-tailed.

Your EDA should produce a concrete artifact: a list of transformations needed (log-transform price), features to drop (highly correlated pairs), an imputation strategy for each missing column, and a decision on outlier treatment (cap, remove, or keep). That artifact becomes the spec for your Feature Engineering pipeline.

If you're dealing with too many candidate features after EDA, Feature Selection helps you narrow down to the set that actually improves model performance. And for turning the issues you find during EDA into a systematic repair process, see Data Cleaning Workflow.

The framework John Tukey outlined in his 1977 book Exploratory Data Analysis (Addison-Wesley) still holds: look at data before you model it, question every assumption, and let the data tell you what's wrong before you ask it what's right.

Frequently Asked Interview Questions

Q: You receive a new dataset with 200 features. Walk me through your first 30 minutes with the data.

Start with df.shape and df.info() to understand dimensions and types. Check for type mismatches (numeric columns loaded as strings). Run df.isnull().sum() to map missing values. Compute df.nunique() to classify features into continuous, discrete, and categorical. Then examine the target variable's distribution to decide on evaluation metrics. This structural pass takes about 10 minutes and prevents wasting time on downstream analysis built on corrupted inputs.

Q: What is the difference between MCAR, MAR, and MNAR? Why does it matter for imputation?

MCAR (Missing Completely at Random) means the probability of a value being missing is unrelated to any variable. MAR (Missing at Random) means missingness depends on other observed variables but not the missing value itself. MNAR (Missing Not at Random) means the missingness is related to the unobserved value. MCAR allows simple imputation (mean, median). MAR requires conditional approaches like MICE. MNAR needs domain expertise or model-based correction because the data is systematically biased.

Q: Your correlation matrix shows two features with r = 0.95. What do you do?

High correlation indicates multicollinearity. For linear models, this inflates coefficient standard errors and makes individual feature importance unreliable. I'd compute the VIF (Variance Inflation Factor) for both features, then either drop the one with lower correlation to the target, combine them into a single derived feature (like a ratio or PCA component), or use a regularized model (Ridge, Elastic Net) that handles multicollinearity natively.

Q: A feature has skewness of 3.2. How do you handle it?

Skewness above 1.0 indicates severe asymmetry. I'd apply a log transform first (np.log1p to handle zeros). If log doesn't bring it below 0.5, try Box-Cox or Yeo-Johnson transforms. For tree-based models (XGBoost, Random Forest), skewness matters less because they split on rank order, not absolute values. But for linear regression, neural networks, and KNN, skewness causes gradient instability and distance metric distortion.

Q: You find that 40% of a column's values are missing. Should you drop the column?

Not automatically. First, test whether the missingness is random by comparing distributions of other features between missing and non-missing groups (Welch t-test or chi-squared test). If the column has strong predictive signal for the target, 40% missingness is recoverable with conditional imputation or adding a binary "is_missing" indicator feature. Only drop if the column has both weak signal and non-random missingness that you cannot model.

Q: Why can Pearson correlation be misleading? Give an example.

Pearson's r only measures linear relationships. A perfect quadratic relationship (y = x^2) can have r near 0 because the positive and negative halves cancel out. Anscombe's Quartet demonstrates this with four datasets that have identical means, variances, and correlations but completely different shapes. Always pair the correlation matrix with scatter plots to visually confirm the relationship matches the statistic.

Q: What's the difference between the IQR method and Z-score method for outlier detection?

The IQR method is non-parametric. It uses the 25th and 75th percentiles to build fences, making it appropriate for any distribution shape. The Z-score method assumes approximate normality and measures how many standard deviations a point is from the mean. For skewed data, the Z-score method underdetects outliers because the inflated standard deviation absorbs extreme values. In practice, I use IQR for initial screening and Z-scores only when I've already confirmed the data is roughly normal.

Q: How would you detect data leakage during EDA?

Look for features with suspiciously perfect or near-perfect correlation with the target. Check if any feature encodes information from the future (like a "claim_amount" column available before the claim is filed). Examine whether the data was sorted by time and split randomly instead of chronologically. Also verify that derived features (rolling averages, cumulative sums) were computed correctly within each split, not across train and test sets.

Hands-On Practice

Following the systematic framework outlined in the article, Structure, Uniqueness, Relationships, and Anomalies, we will perform a structured Exploratory Data Analysis (EDA) on the Customer Analytics dataset. This process moves beyond random plotting to rigorously check data types, distribution shapes, correlations, and potential outliers using pure Python and Matplotlib.

Dataset: Customer Analytics (Data Analysis) Rich customer dataset with 1200 rows designed for EDA, data profiling, correlation analysis, and outlier detection. Contains intentional correlations (strong, moderate, non-linear), ~5% missing values, ~3% outliers, various distributions, and business context for storytelling.

By systematically checking the structure, distributions, relationships, and anomalies, we have validated the dataset's quality before attempting any complex modeling. This process revealed the distribution of spending, confirmed potential relationships between income and purchase habits, and identified outliers that could skew linear models.