5 Common Pitfalls Every New Data Engineer Faces

(and How to Avoid Them!)

Starting your journey as a data engineer can feel exciting and overwhelming at the same time. You're learning new skills, building your first pipelines, and discovering just how messy real-world data can be. But don’t worry—you're not alone! I've been there, and today I'll walk you through five common pitfalls that new data engineers often face, along with practical advice on how to avoid them.

Pitfall #1: Overlooking Data Profiling

It's tempting to dive straight into writing code when you're assigned a new project. However, skipping the data profiling step can lead to major headaches later. Early in my career, I learned this the hard way—I assumed a field was unique, only to discover duplicates later, causing hours of unnecessary work.

How to Avoid: Always begin your project with thorough data profiling. Use tools like pandas profiling or run exploratory SQL queries to identify potential data issues upfront.

Pitfall #2: Ignoring Scalability

Your pipeline might run perfectly fine with a few thousand records, but what happens when your data grows to millions of records? Many beginners overlook scalability, leading to performance bottlenecks down the road.

How to Avoid: Design your solutions with scalability in mind from the beginning. Simple practices like proper indexing, batching processes, and selecting appropriate data types can dramatically improve performance.

Pitfall #3: Hardcoding Values

Hardcoded paths, connection strings, dates, or constants seem harmless initially, but they become maintenance nightmares over time. I've personally experienced late-night emergencies caused by hardcoded values that unexpectedly broke production pipelines.

How to Avoid: Store configurations in external files or environment variables. This makes your code more flexible, easier to maintain, and significantly reduces the risk of production issues.

Pitfall #4: Poor Naming Conventions

Ever stumbled across variables named 'temp1', 'data_final_FINAL', or 'stuff'? Poorly named resources lead to confusion, wasted time, and increased frustration—especially when you or someone else revisits your code later.

How to Avoid: Follow clear, descriptive naming conventions consistently. Write code as if you’re explaining your logic to someone else without needing additional explanation.

Pitfall #5: Not Asking Questions

New data engineers sometimes hesitate to ask questions for fear of appearing inexperienced. But not asking questions can slow your learning and impact the success of your projects.

How to Avoid: Always ask questions. Clarify requirements, check assumptions, and engage stakeholders frequently. Asking questions demonstrates initiative, curiosity, and commitment to getting it right.

Pitfall Alert: Fix Your Documentation Before It’s Too Late!

Here’s a pitfall we all fall into: Lack of documentation. It might not feel as exciting as building a pipeline or solving tricky errors, but trust me, good documentation will save your sanity down the line.

Developing strong documentation habits early in your career makes maintaining them far easier as your workload grows. Initially, you might even lean toward over-documentation, and that’s perfectly fine. Over time, you'll naturally learn what’s necessary and what's redundant.

Why Document?
Poor documentation doesn’t just waste your time; it impacts your whole team's productivity, makes onboarding new hires difficult, and turns troubleshooting into a nightmare.

Things to document (that many engineers miss):

  • Python scripts:
    Include not just comments, but a header with author, creation date, purpose, and change log. For example:

"""
Script Name: Daily_Sales_ETL.py
Author: Chris Gambill
Created: 2024-03-31
Purpose: Loads daily sales data from HubSpot to Azure SQL DB.

Change Log:
- 2024-04-02: Fixed null value issue in sales totals (Chris Gambill)
"""
  • Pipeline Tasks:
    Give every pipeline task or node clear, descriptive names. Avoid generic names like "Copy Data" or "Task 1." Make it intuitive and straightforward.

  • Jira / ADO / Ticket Systems:
    Clearly document what changes you made, why, and include the exact scripts or pipeline names affected. Future-you will thank you.

  • GIT Commits:
    Never use vague commit messages. Avoid just your initials or "updated file." Clearly state what changed and why. Example:
    "Updated Daily_Sales_ETL.py to handle null values in sales totals (ticket #DE-1047)."

Consider documentation tools like Confluence, Notion, or even GitHub Wiki to simplify and standardize your documentation efforts.

Share your documentation experiences in the comments!

  • What did I miss?

  • What’s your favorite documentation trick or tool?

Quick Wins for Immediate Improvement

Here are five habits you can immediately adopt to sidestep these pitfalls:

  1. Always profile your data first.

  2. Plan for scalability from day one.

  3. Avoid hardcoded values; use configurations.

  4. Consistently use clear naming conventions.

  5. Ask questions proactively and frequently.

  6. Document, Document, Document!

What's your experience with these pitfalls? Do you have other tips you'd like to share? Drop your thoughts in the comments below—I’d love to hear from you!

Keep engineering smarter, and remember—every great data engineer was once a beginner!

Next
Next

You Are NOT Ready for a Senior Data Engineer Role!