AWS Data Engineer Roadmap 2026: From Beginner to AWS Certified Data Engineer

AWS data engineering looks substantially different in 2026 than it did in 2022. The Data Analytics Specialty certification (DAS-C01) was retired in April 2024 and replaced by the Data Engineer Associate (DEA-C01), which is now the canonical credential. The service mix that hiring managers expect has shifted as well — Glue has replaced EMR for most ETL workloads, Iceberg has overtaken proprietary table formats, and the convergence between AWS data services and the rest of the modern data stack (dbt, Airflow, Snowflake on AWS) means the 2026 AWS data engineer is rarely a pure-AWS specialist.

This roadmap is for someone who wants to be hireable as an AWS-leaning data engineer in 6-12 months. It assumes basic SQL and some programming experience. It does not assume prior AWS experience. The path lands you at certified, portfolio-equipped, and ready to interview.

What an AWS data engineer does in 2026
The certification landscape post-DAS retirement
Phase 1 (Months 1-2): AWS fundamentals + cloud basics
Phase 2 (Months 2-4): Storage and ingestion services
Phase 3 (Months 3-5): Compute and ETL services
Phase 4 (Months 4-6): Analytics and warehousing
Phase 5 (Months 5-7): Streaming and real-time
Phase 6 (Months 6-8): Security, monitoring, governance
Phase 7 (Months 7-9): The DEA-C01 exam
Phase 8: The portfolio that gets interviews
Salary and market expectations 2026
Common mistakes that waste months
Beyond entry-level: where to specialize
Closing: AWS-first vs cloud-agnostic career strategy

What an AWS data engineer does in 2026

The role title “AWS data engineer” covers a recognizable set of responsibilities in 2026 organizations:

30-40% of the work is building and maintaining data pipelines (ingestion, transformation, scheduling) using AWS-native tools (Glue, Step Functions, MWAA) plus increasingly Airflow or dbt
15-25% is data modeling — schema design in S3 (with Iceberg or Delta), Redshift modeling, dimensional modeling for analytics consumption
15-20% is operational work — incident response, pipeline reliability, cost optimization
10-15% is collaboration — working with analysts, data scientists, software engineers on data needs
10-15% is architecture and design — picking the right service for new data sources, evaluating vendor options
5-10% is documentation and governance — schema documentation, data dictionary maintenance, lineage updates

What an AWS data engineer is not in 2026:

Not a data scientist (no modeling work)
Not a BI developer (limited dashboard building)
Not a software engineer (writes SQL and Python primarily, not Java or backend services)
Not a DBA (operates managed services, doesn’t tune a self-hosted RDBMS)

The AWS data engineer is the person who keeps the pipelines running and the warehouses fresh. The work is essential, the salaries are good, and the path has clear progression to senior data engineer, staff DE, or platform engineering.

The certification landscape post-DAS retirement

April 2024 brought a meaningful change: AWS retired the Data Analytics Specialty (DAS-C01) and replaced it with the Data Engineer Associate (DEA-C01). The DEA-C01 is now the canonical AWS data engineering credential. Roadmaps written before mid-2024 referencing DAS-C01 are out of date.

The 2026 cert hierarchy for AWS data engineering:

AWS Cloud Practitioner (CLF-C02) — entry-level, optional but reasonable warmup. Skip if you have any cloud experience.
AWS Solutions Architect Associate (SAA-C03) — strongly recommended foundation. Covers the architecture vocabulary you’ll need.
AWS Data Engineer Associate (DEA-C01) — the target credential. Replaces the retired DAS-C01.
AWS Database Specialty (DBS-C01) — optional, useful if you go deep on database internals
AWS Machine Learning Specialty (MLS-C01) — optional, if you head toward ML engineering

The DEA-C01 exam covers four domains: data ingestion and transformation, data store management, data operations and support, data security and governance. The exam is multiple choice plus multiple response, 65 questions in 130 minutes, $150 USD.

Practical certification advice for 2026:

Don’t get certified before you have hands-on practice. Cert without experience is recognizable to interviewers and weakly signals.
The SAA-C03 → DEA-C01 path takes most candidates 4-6 months of focused study.
Free credits via AWS Free Tier are sufficient for almost all hands-on practice; budget $20-50/month for any beyond-tier work.
The exam emphasizes service-selection reasoning: “given this scenario, which service?” Practice that mode of thinking, not service trivia.

Phase 1 (Months 1-2): AWS fundamentals + cloud basics

Foundational concepts that everything else builds on.

Cloud computing basics:

IaaS vs PaaS vs SaaS, and where AWS data services fit
Regions and availability zones
Shared responsibility model
Pay-as-you-go pricing (and how it goes wrong)

AWS account setup:

Create a sandbox AWS account
Set up billing alerts (budget $20-30/month max)
Configure IAM users (don’t use root for daily work)
Set up MFA on the root account
Install AWS CLI and configure profiles

IAM (Identity and Access Management) basics:

Users, groups, roles, policies
Inline vs managed policies
Trust relationships
Service-linked roles
The principle of least privilege (in practice)

Networking essentials:

VPC, subnets (public vs private)
Security groups vs NACLs
VPC endpoints for service access
NAT gateways

Cost basics:

Reading the cost explorer
Cost-saving tactics: lifecycle policies on S3, instance scheduling, Reserved Instances vs Savings Plans
The most common ways to surprise yourself with a bill: leaving an EC2 running, leaving a Redshift cluster on, accidentally enabling expensive features

Recommended resources:

AWS Skill Builder (free) — solid foundational courses
Adrian Cantrill’s AWS courses (paid, but the gold standard for hands-on quality)
A Cloud Guru / Pluralsight (paid, alternative)

Time investment: 8 weeks at 10-15 hours per week.

Phase 2 (Months 2-4): Storage and ingestion services

The data layer foundation.

S3 (Simple Storage Service) — the most important service to master:

Buckets, objects, prefixes
Storage classes: Standard, Intelligent-Tiering, Standard-IA, Glacier, Glacier Deep Archive
Lifecycle policies
Versioning and replication
Bucket policies vs IAM policies vs ACLs (and why ACLs are deprecated)
Server-side encryption (SSE-S3, SSE-KMS, SSE-C)
Object Lock for compliance/WORM
S3 Select and S3 Object Lambda
Performance: prefix design, request rates, multi-part upload

Modern table formats on S3:

Apache Iceberg — now the default in 2026 for new lakes
Delta Lake — Databricks ecosystem, less common in pure-AWS environments
Apache Hudi — niche but supported
The differences and when to choose each

S3 best practices:

Layered architecture: raw → bronze → silver → gold (or raw → conformed → curated)
Compaction strategies for Iceberg/Delta
Schema evolution patterns
Partitioning strategies that don’t backfire

Other storage:

DynamoDB — operational data, key-value, document workloads
DocumentDB / RDS — for OLTP workloads that feed analytics
Aurora — managed Postgres/MySQL with cloud-native architecture
ElastiCache (Redis/Memcached) — caching layer

Ingestion services:

AWS DMS (Database Migration Service) — for replicating from operational databases
AWS DataSync — for bulk transfer from on-premises
AWS Transfer Family — SFTP/FTPS/FTP into S3
Kinesis Data Firehose — managed delivery to S3, Redshift, OpenSearch

Time investment: 8 weeks.

Phase 3 (Months 3-5): Compute and ETL services

The processing layer.

AWS Glue — the central ETL service in 2026:

Glue Data Catalog (the metadata store underneath everything else)
Glue ETL Jobs (Spark and Python)
Glue Crawlers
Glue Studio (visual development)
Glue DataBrew (no-code data prep)
Workflow orchestration (less commonly used; people prefer Step Functions or Airflow)
Cost considerations — Glue can be expensive if you don’t manage worker counts

EMR (Elastic MapReduce) — for the heavy workloads:

Persistent vs transient clusters
EMR Serverless (the modern recommendation for most non-edge cases)
EMR on EKS (for cost-conscious large-scale)
Spark on EMR — knowing when this is the right tool
Hive, Presto/Trino on EMR (less common in 2026)

In 2026 most teams use Glue for the smaller workloads and EMR Serverless for the bigger ones. EMR with persistent clusters is mostly legacy.

Step Functions — workflow orchestration:

State machines and the workflow language
Standard vs Express workflows
Integration with Lambda, Glue, ECS
Error handling and retries

Lambda for data engineering:

Event-driven processing patterns (S3 event → Lambda)
Lambda’s 15-minute execution limit and what it means
When Lambda is the wrong tool

MWAA (Managed Workflows for Apache Airflow):

The 2026 default for serious orchestration on AWS
Airflow vs Step Functions trade-offs
DAG patterns for data pipelines

dbt on AWS:

dbt Core on Glue, EMR, or Redshift
dbt Cloud as managed alternative
dbt + Airflow patterns

Time investment: 10 weeks, in parallel with continued S3/Glue work.

Phase 4 (Months 4-6): Analytics and warehousing

The serving layer.

Amazon Redshift — the warehouse:

Provisioned vs Serverless (Serverless is the recommendation in 2026 for most teams)
Cluster sizing and node types (RA3 nodes are now standard)
Distribution and sort keys
Materialized views and AQUA
Federated queries (querying RDS/Aurora directly from Redshift)
Spectrum (querying S3 from Redshift)
Data sharing (cross-account, cross-region)
Concurrency scaling
Workload management (WLM)

Athena — serverless SQL on S3:

Querying Iceberg, Parquet, ORC, JSON, CSV
Athena Engine v3 (Trino under the hood)
Apache Iceberg support
Federated queries to RDS, DynamoDB, etc.
Cost optimization (partition pruning, columnar formats, query result reuse)

OpenSearch (formerly Elasticsearch):

Use cases: log analytics, full-text search, observability
Cluster sizing and shard design
OpenSearch Serverless (newer, still maturing)

QuickSight — AWS-native BI:

Reasonable for AWS-native shops, less common as primary BI tool in 2026
SPICE engine for in-memory speed
Embed APIs for app integration

Choosing the right warehouse pattern:

Pure Redshift: simple, fast, AWS-native, costs more
Athena + S3 + Iceberg: serverless, cheap for moderate volumes, more pieces to assemble
Snowflake on AWS: hybrid path that many teams choose
Databricks on AWS: hybrid path for ML-heavy teams

Time investment: 10 weeks.

Phase 5 (Months 5-7): Streaming and real-time

Where many junior data engineers underinvest. Streaming knowledge sets you apart.

Amazon Kinesis ecosystem:

Kinesis Data Streams (the backbone)
Kinesis Data Firehose (managed delivery)
Kinesis Data Analytics for Apache Flink (managed Flink)
Kinesis Video Streams (specialized)

MSK (Managed Streaming for Apache Kafka):

When to choose Kafka over Kinesis
MSK Serverless
MSK Connect

Stream processing patterns:

Real-time aggregations
CDC (Change Data Capture) from operational databases via DMS or Debezium
Stream-to-warehouse (Firehose to Redshift)
Stream-to-data-lake (Firehose to Iceberg on S3)

Practical streaming projects:

A simple producer/consumer using Kinesis or Kafka
A CDC pipeline replicating from RDS to Redshift
A real-time aggregation using Flink

Time investment: 6-8 weeks.

Phase 6 (Months 6-8): Security, monitoring, governance

The make-or-break-in-production phase.

Security:

IAM at depth (policies, conditions, ABAC vs RBAC)
KMS — keys, key rotation, cross-account access
Secrets Manager and Parameter Store
VPC endpoints to keep traffic private
Lake Formation for fine-grained access control on the lake
Network isolation patterns

Monitoring and observability:

CloudWatch metrics, logs, alarms
CloudTrail for audit logs
AWS X-Ray for tracing
AWS Config for compliance monitoring
Cost anomaly detection

Data governance:

AWS Glue Data Catalog as a metadata foundation
Lake Formation for permissions and tag-based policies
Data lineage (DataZone, custom solutions)
AWS DataZone (newer, AWS’s answer to a unified data marketplace)
Integration with broader data governance programs
See also: data governance framework for the non-AWS-specific governance picture

Cost control:

Cost Explorer, Cost and Usage Reports
Budgets and alerts
Tagging strategy (tag everything; reports depend on it)
The most expensive AWS data services and how to spot runaway costs

Time investment: 6-8 weeks.

Phase 7 (Months 7-9): The DEA-C01 exam

Once you’ve done the hands-on work above, the cert exam is a ~6-8 week additional sprint.

Preparation strategy:

Take an end-to-end course. Stephane Maarek (Udemy) and Adrian Cantrill both have DEA-C01 courses; Cantrill is more thorough, Maarek is more efficient.
Use TutorialsDojo or Whizlabs practice tests. These are the highest-quality practice tests for AWS exams; if you can score 80%+ consistently, you’re ready.
Read the exam guide thoroughly. AWS publishes the official exam guide listing every domain and sub-objective.
Practice service-selection thinking. Most exam questions are scenarios where you pick the right service. Practice this mode.
Schedule the exam in advance. Having a date forces studying. $150 USD; book through Pearson VUE.

Common pitfalls on DEA-C01:

Out-of-date study material referencing DAS-C01 (the exam being replaced)
Missing recent service updates (Iceberg support, Glue updates, Redshift Serverless)
Underweighting security/governance domain (it’s 18% of the exam)
Memorizing service trivia instead of understanding service trade-offs

Pass rate is decent (around 50-65% of people who take it pass first time). Most failures are people who under-prepared the security/governance domain.

Phase 8: The portfolio that gets interviews

Cert plus portfolio is the winning combination. Cert alone is not enough.

Strong portfolio projects for AWS data engineering:

End-to-end data lake — ingest a public dataset, land in S3, transform with Glue, register in Iceberg, query with Athena, optional Redshift Spectrum, dashboard with QuickSight or Tableau. Document every architectural decision in the README.
CDC pipeline — set up a small RDS Postgres database, configure DMS for change data capture, land changes in S3, materialize a current-state table in Redshift Serverless. Show schema evolution handling.
Streaming pipeline — Kinesis Data Streams ingestion of a synthetic event source, real-time aggregation with Kinesis Data Analytics for Flink, output to S3 and OpenSearch.
Cost-optimized ETL — same workload built two different ways (Glue PySpark vs Athena CTAS), with cost analysis showing the trade-offs. This is interview gold.
Production-grade pipeline — orchestration in Step Functions or MWAA, with monitoring, alerting, retry logic, lineage documentation, IaC (CloudFormation or Terraform).

For each project:

Code on GitHub with clean commit history
README explaining what the pipeline does, design choices, results
Architecture diagram (drawn in draw.io or similar)
IaC where reasonable (Terraform or CloudFormation)
Cost analysis section

Avoid:

Tutorial copy-paste with no original content
Projects that don’t show the trade-offs (why Glue vs EMR? Why Iceberg vs Delta?)
Skipping the IaC — for senior roles, Terraform fluency is expected

Salary and market expectations 2026

US base salaries for AWS data engineering roles (Glassdoor, levels.fyi, Robert Half):

Junior (0-2 years): $90,000-115,000 mid-cost / $105,000-140,000 high-cost
Mid-level (2-5 years): $115,000-150,000 / $135,000-180,000
Senior (5+ years): $150,000-200,000 / $180,000-250,000
Staff/Principal: $200,000-275,000 / $250,000-360,000

Total comp at FAANG: senior data engineer roles at AWS, Amazon retail, Google, Meta, Microsoft typically clear $300K-450K total comp.

Market notes:

Strong demand in healthcare, financial services, e-commerce, supply chain — all data-heavy verticals running on AWS
Pure data engineering roles are slightly more stable than data science roles in 2026 (less ML-hype-cycle exposure)
AWS-specific data engineering pays a small premium over generalist DE in companies that are AWS-committed
Remote work is still common in DE roles; many companies have settled on flexible/hybrid

The hiring loop typically:

Recruiter screen
Technical phone screen — SQL + Python, sometimes AWS service trivia
Take-home or live coding (build a small pipeline)
Onsite/virtual onsite (3-5 hours): system design, deeper coding, architectural discussion, behavioral
Reference checks

Plan 5-10 loops to offer.

Common mistakes that waste months

Studying for cert before doing hands-on. Cert knowledge without service experience is recognizable to interviewers. Build first, certify second.

Memorizing service catalogs without understanding trade-offs. “Glue is for ETL” is not the same as “I’d choose Glue over EMR for this workload because…” The latter wins interviews.

Underinvesting in IAM. Senior roles get filtered by IAM depth. Spend more time on it than feels comfortable.

Skipping IaC. Building everything via console is fine for learning. Production work uses Terraform or CloudFormation. Get fluent in one.

Ignoring cost. Senior data engineers know what their pipelines cost. Junior candidates who don’t are an obvious flag.

Treating “AWS data engineer” as the destination. AWS-specific specialization is a starting point. Most senior data engineers are cloud-agnostic by year 5.

Beyond entry-level: where to specialize

After year 2-3, paths open up:

Senior Data Engineer — broader pipeline ownership, more architectural decisions, mentoring juniors

Analytics Engineer — overlap with DE but deeper on dbt, modeled data products, business semantics

ML Platform Engineer — bridges DE and ML engineering, owns feature stores, model serving infrastructure

Data Platform Engineer — owns the underlying platform (warehouse selection, catalog, orchestration), often a higher-impact role at scale

Cloud Solutions Architect — pivots from DE specialist to multi-discipline cloud architect

Specialist verticals — healthcare data engineer, fintech DE, gaming DE — domain knowledge plus DE skills pays well

Closing: AWS-first vs cloud-agnostic career strategy

A strategic question worth thinking about: should you specialize in AWS or build cloud-agnostic skills?

Pure AWS specialization wins when:

You’re targeting AWS-committed enterprises (large healthcare, financial services, government)
You want a clear, well-paid niche with predictable demand
You’re early career and need a focused identity

Cloud-agnostic generalist wins when:

You want to be senior at a company-of-the-future that picks platforms based on workload
You’re at year 3-5+ and want to broaden
The vendor consolidation through 2030 makes single-vendor expertise riskier

The honest 2026 answer: start AWS-specialized for a clear identity, broaden by year 3 to include at least one of GCP or Azure, dbt fluency, and at least one general open-source compute framework (Spark or Flink). The candidates who do best long-term are AWS-fluent generalists, not AWS-only specialists.

The DEA-C01 cert and a strong portfolio gets you in the door. Year 2-5 of deliberate skill building gets you to senior. The data engineer market in 2026 is solid; the path is clear if you follow it.