Automating EDA & Handling Schema Drift: From Excel to Python
If you've ever imported a dataset only to find that column names have changed, critical fields are missing, or dates are stored in different formats, you've experienced schema drift. Manually inspecting these differences in Excel is fine for a few files, but what if you have hundreds of datasets spanning multiple years?
In this post, we'll explore:
Why Exploratory Data Analysis (EDA) is crucial for data integrity.
How to detect schema drift using Excel.
How to automate schema drift handling with Python.
A free Python script you can download to streamline your workflow.
Why EDA is Critical Before Importing Data
Before pushing data into a production database, you need to analyze it for inconsistencies. Without proper EDA, you risk:
✅ Schema Drift – Column names, data types, or entire fields changing over time.
✅ Missing Values – Some years might be missing critical data fields.
✅ Duplicates – The same transactions might exist across multiple files.
✅ Inconsistent Formats – Date fields may switch formats (YYYY-MM-DD vs. MM/DD/YYYY).
💡 What is Schema Drift?
Schema drift happens when the structure of your data changes over time—whether it's new columns, removed fields, or even different data types for the same column.
📌 Example:
2019-2020: Column is named
sales_date
(stored asTEXT
).2021-2023: Column is now
date_sold
(stored asTIMESTAMP
).2023: A new column
discount_applied
appears.
If you append these files without checking schema drift, you’ll end up with broken queries, incorrect joins, and type conversion errors in SQL.
📸 Screenshot Idea: Show a table with different column names across years, highlighting inconsistencies.
Identifying Schema Drift Using Excel
If you're working with historical CSV files, Excel is a quick way to manually inspect changes.
Steps in Excel:
✅ Open multiple CSV files side by side.
✅ Compare column headers manually (spot differences).
✅ Use COUNTIF or LEN formulas to detect missing values & inconsistencies.
✅ Identify data type mismatches and inconsistent date formats.
📌 Pain Point: While Excel works for small datasets, it becomes time-consuming and error-prone when dealing with dozens or hundreds of files. That’s where Python can save hours of work.
Automating EDA & Schema Drift Handling with Python
Instead of manually checking files, we can use Python to automatically detect schema drift and standardize datasets.
Step 1: Detect Schema Drift Dynamically
✅ Load all files, extract column headers, and compare across years. ✅ Identify missing columns, renamed fields, and data type mismatches. ✅ Display a schema drift table (automatically generated).
📜 Code Snippet:
import pandas as pd
from glob import glob
# Get all CSV files
file_paths = glob("data/*.csv")
# Detect Schema Drift
def detect_schema_drift(file_paths):
schema_dict = {}
for file in file_paths:
df = pd.read_csv(file, nrows=0)
schema_dict[file] = set(df.columns)
all_columns = set().union(*schema_dict.values())
schema_comparison = pd.DataFrame(index=file_paths, columns=sorted(all_columns))
for file, columns in schema_dict.items():
for col in all_columns:
schema_comparison.loc[file, col] = "✅" if col in columns else "❌"
return schema_comparison
schema_drift_df = detect_schema_drift(file_paths)
print(schema_drift_df)
Step 2: Standardize & Merge the Files
✅ Rename inconsistent columns. ✅ Add missing fields to older files. ✅ Ensure consistent data types (e.g., order_id
as string). ✅ Merge everything into one clean dataset ready for SQL insertion.
📜 Code Snippet:
def standardize_schema(file_paths, standard_schema):
standardized_dfs = []
for file in file_paths:
df = pd.read_csv(file)
df.rename(columns={"sales_date": "date_sold"}, inplace=True)
for col in standard_schema:
if col not in df.columns:
df[col] = None
df = df[standard_schema]
df["order_id"] = df["order_id"].astype(str)
standardized_dfs.append(df)
return pd.concat(standardized_dfs, ignore_index=True)
standard_schema = sorted(set().union(*[set(pd.read_csv(f, nrows=0).columns) for f in file_paths]))
cleaned_data = standardize_schema(file_paths, standard_schema)
print(cleaned_data.head())
🔥 Key takeaway: "Instead of manually checking columns in Excel every time new data arrives, this script does it for you. Run it once, and it automatically adapts to whatever files you throw at it!"
Download the Free Python Script
Why waste hours manually checking data when you can automate the whole process? Download the free Python script below and start automating your historical data imports today!
📥 Download the Schema Drift Handling Script
💬 Question: What’s the worst case of schema drift you’ve ever seen? Drop a comment below!
Final Thoughts
✅ Start with Excel for quick manual EDA.
✅ Use Python for scalable, automated schema drift detection.
✅ Download the script to speed up your workflow.
If you found this useful, make sure to subscribe to my YouTube channel for more data engineering content!