HELP

+40 722 606 166

messenger@eduailast.com

Google Professional Data Engineer (GCP-PDE) Exam Prep

AI Certification Exam Prep — Beginner

Google Professional Data Engineer (GCP-PDE) Exam Prep

Google Professional Data Engineer (GCP-PDE) Exam Prep

Master GCP-PDE with domain-mapped lessons, labs, and exam-style practice.

Beginner gcp-pde · google · professional-data-engineer · bigquery

Prepare for the Google Professional Data Engineer (GCP-PDE) exam

This beginner-friendly exam-prep course is built specifically around the official Google Professional Data Engineer exam domains. You’ll learn how Google expects you to think: start from business requirements, choose the right Google Cloud services, design dependable pipelines, and operate them safely at scale. The course emphasizes the tools most commonly tested in real scenarios—BigQuery, Dataflow (Apache Beam), Pub/Sub, Cloud Storage, and ML/analytics workflows—while keeping the focus on exam objectives rather than product trivia.

What this course covers (mapped to official exam domains)

  • Design data processing systems: translate requirements into architectures, choose batch vs streaming, and design for security, reliability, and cost.
  • Ingest and process data: ingestion patterns for files, databases, and events; Dataflow streaming (windows/triggers) and batch choices (Dataflow vs Dataproc vs BigQuery).
  • Store the data: pick the right storage service and model data effectively for BigQuery performance, governance, and access control.
  • Prepare and use data for analysis: BigQuery SQL patterns, optimization, and how analytics connects to BI and ML workflows.
  • Maintain and automate data workloads: monitoring, alerting, orchestration, CI/CD, and operational playbooks aligned to exam scenarios.

How the 6-chapter “book” is structured

Chapter 1 orients you to the exam: registration, remote testing readiness, scoring expectations, and a practical study plan for beginners. Chapters 2–5 are the core of the program, each mapped to one or two official domains and designed to build the mental models you need for scenario-based questions. Chapter 6 is a full mock exam with rationales and a final review so you can identify weak spots and tighten your strategy before test day.

Practice in the style of the real exam

The GCP-PDE exam is heavily scenario-driven. This course therefore emphasizes trade-offs: when to choose BigQuery vs Cloud Storage, Dataflow vs Dataproc, streaming vs micro-batch, and how to meet SLAs while controlling cost and risk. Each core chapter includes exam-style practice prompts designed to reinforce objective-level decisions (security, correctness, performance, governance, and operations), not just “how-to” steps.

Why this helps you pass (even as a beginner)

If you’re new to certification exams, the hardest part is learning how questions are framed and what details matter. This course teaches a repeatable approach: clarify requirements, eliminate mismatched services, validate with reliability/security/cost checks, then choose the simplest architecture that meets the constraints. You’ll also learn common anti-patterns Google tests for—like poor partitioning choices, incorrect windowing semantics, insufficient IAM boundaries, and brittle orchestration designs.

Get started on Edu AI

You can begin immediately and follow the chapter sequence as a guided plan. To create your learner account, use Register free. If you want to compare this with other certification paths, you can also browse all courses.

By the end of the course, you’ll be able to map any exam scenario to the official domains, pick the right Google Cloud services with confidence, and approach the GCP-PDE exam with a clear timing strategy and a tested checklist.

What You Will Learn

  • Design data processing systems aligned to reliability, scalability, security, and cost constraints
  • Ingest and process batch and streaming data using Google Cloud patterns (Pub/Sub, Dataflow, Dataproc)
  • Store the data using the right storage and modeling choices across BigQuery, Cloud Storage, and operational stores
  • Prepare and use data for analysis with BigQuery SQL, optimization, governance, and ML/BI integration
  • Maintain and automate data workloads with monitoring, CI/CD, orchestration, and incident response best practices

Requirements

  • Basic IT literacy (networking, files, command line basics)
  • Comfort using a web browser and cloud consoles
  • No prior certification experience required
  • Helpful (not required): basic SQL familiarity and a general understanding of ETL/ELT concepts

Chapter 1: GCP-PDE Exam Orientation and Study Plan

  • Understand the exam format, domains, and question styles
  • Registration, scheduling, and remote testing readiness
  • Build a beginner-friendly 4-week study strategy
  • Set up a hands-on practice environment on Google Cloud

Chapter 2: Design Data Processing Systems (Domain 1)

  • Translate business requirements into data architecture
  • Choose batch vs streaming designs and patterns
  • Design for security, compliance, and governance
  • Domain 1 practice set: scenario-based architecture questions

Chapter 3: Ingest and Process Data (Domain 2)

  • Build ingestion strategies for files, databases, and events
  • Implement streaming pipelines with Pub/Sub and Dataflow
  • Implement batch pipelines with Dataflow and Dataproc
  • Domain 2 practice set: pipeline debugging and correctness questions

Chapter 4: Store the Data (Domain 3)

  • Select storage services for analytics, lakes, and serving
  • Model data for BigQuery performance and governance
  • Manage lifecycle, partitioning, and cost controls
  • Domain 3 practice set: storage and modeling scenario questions

Chapter 5: Prepare & Use Data for Analysis + Maintain & Automate (Domains 4-5)

  • Analyze and optimize with BigQuery SQL and BI patterns
  • Operationalize ML pipelines with BigQuery ML and Vertex AI integration
  • Automate workflows with orchestration and CI/CD
  • Domains 4-5 practice set: troubleshooting, monitoring, and automation questions

Chapter 6: Full Mock Exam and Final Review

  • Mock Exam Part 1
  • Mock Exam Part 2
  • Weak Spot Analysis
  • Exam Day Checklist

Priya Nandakumar

Google Cloud Certified: Professional Data Engineer (Instructor)

Priya Nandakumar is a Google Cloud Certified Professional Data Engineer who has designed and delivered exam-prep programs for analytics, streaming, and ML data platforms. She specializes in translating Google exam objectives into practical architecture decisions and test-taking strategies.

Chapter 1: GCP-PDE Exam Orientation and Study Plan

This chapter sets your trajectory for the Google Cloud Professional Data Engineer (GCP-PDE) exam by clarifying what the exam is really testing, how questions are written, and how to prepare efficiently with a 4-week plan and a hands-on lab environment. The PDE exam rewards architectural judgment more than memorization: you must repeatedly choose designs that balance reliability, scalability, security, and cost—often under constraints like “minimal operational overhead,” “regulatory requirements,” or “near-real-time analytics.”

You will see scenarios spanning batch and streaming ingestion (Pub/Sub, Dataflow, Dataproc), storage and modeling decisions (BigQuery, Cloud Storage, and operational stores), analytics enablement (BigQuery SQL, optimization, governance, ML/BI integration), and operational excellence (monitoring, CI/CD, orchestration, incident response). In other words: the exam expects you to think like an on-call data engineer who can ship durable systems, not just write pipelines.

Exam Tip: When two answers both “work,” the correct one usually best matches Google Cloud’s managed-service bias: prefer serverless/managed options (Dataflow, BigQuery, Pub/Sub) when the prompt emphasizes reduced ops, elasticity, and reliability—unless a requirement explicitly forces cluster control (custom libraries, HDFS, Spark tuning), which points toward Dataproc.

The rest of this chapter walks you through the exam format and question styles, registration and remote testing readiness, a beginner-friendly 4-week study strategy, and how to set up a safe practice environment on Google Cloud with IAM and cost controls.

Practice note for Understand the exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Registration, scheduling, and remote testing readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a beginner-friendly 4-week study strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up a hands-on practice environment on Google Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand the exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Registration, scheduling, and remote testing readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a beginner-friendly 4-week study strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up a hands-on practice environment on Google Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand the exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Certification overview and role expectations

Section 1.1: Certification overview and role expectations

The Google Professional Data Engineer certification validates that you can design, build, operationalize, secure, and monitor data processing systems on Google Cloud. On the exam, you are rarely asked “what is X?” Instead, you are asked to choose the best end-to-end approach given business goals and constraints. Your mental model should be “responsible engineer”: you need to pick services, configure them correctly, and anticipate failure modes and operating realities.

Role expectations map closely to the course outcomes: (1) design reliable, scalable, secure, cost-aware data systems; (2) ingest and process both batch and streaming data using common patterns (Pub/Sub → Dataflow; GCS → Dataproc/Dataflow; CDC into BigQuery); (3) store data with correct format and lifecycle (GCS for raw/archival, BigQuery for analytics, operational stores when low-latency point lookups are required); (4) enable analytics with BigQuery SQL, partitioning/cluster, governance and access controls, and integration with BI/ML; (5) maintain pipelines with monitoring, alerting, orchestration, and CI/CD.

Common exam trap: treating the problem as a single-service decision. Many questions are about the seams—identity boundaries, schema evolution, late-arriving events, replay, backfills, and cost growth. If you only optimize for one dimension (e.g., lowest latency) and ignore operational burden or governance, you’ll pick an answer that feels “powerful” but is not “professional.”

Exam Tip: Look for requirement keywords. “Exactly-once,” “deduplicate,” “late data,” “replay,” and “event time” usually imply Dataflow streaming with windowing/triggers plus idempotent writes. “Minimal ops” points away from self-managed clusters. “Auditable access” points to IAM + logs + governance features (e.g., BigQuery authorized views, policy tags).

Section 1.2: Exam domains breakdown and weighting mindset

Section 1.2: Exam domains breakdown and weighting mindset

Even if the published domain weights shift over time, the PDE exam consistently clusters around a few durable skill areas: designing data processing systems; building/operationalizing pipelines; choosing storage systems and data models; analyzing and enabling ML/BI; and ensuring security, governance, and reliability. Your study strategy should reflect “weighting mindset,” not rote percentages: master the patterns that appear across many scenarios.

A practical breakdown for studying is: (1) architecture decisions (batch vs streaming, managed vs self-managed, reliability patterns); (2) ingestion and processing (Pub/Sub, Dataflow templates, Dataproc/Spark, schema and serialization choices); (3) storage and modeling (BigQuery partition/cluster, GCS formats like Parquet/Avro, data lakes vs warehouses, operational stores for serving); (4) analytics optimization and governance (SQL performance, materialized views, access boundaries, data quality); (5) operations (monitoring, SLOs, alerting, CI/CD, orchestration with Cloud Composer/Workflows, incident response).

Common exam trap: over-indexing on service trivia instead of “fit.” For instance, knowing every BigQuery feature name is less important than recognizing when to use partitioning to reduce scan cost, clustering to improve selective queries, or separation of raw/curated datasets to simplify governance. Another trap: defaulting to “BigQuery for everything.” The exam expects you to place data in the right tier: GCS for low-cost storage and reprocessing, BigQuery for analytic queries, and operational stores for low-latency reads/writes when required.

Exam Tip: When a prompt emphasizes “analytics at scale with minimal management,” BigQuery is often central. When it emphasizes “stream processing with transformations and windowing,” Dataflow is a strong default. When it emphasizes “Spark/Hadoop migration” or “custom Spark jobs,” Dataproc becomes relevant.

Section 1.3: Registration, scheduling, ID checks, and policies

Section 1.3: Registration, scheduling, ID checks, and policies

Registration and scheduling are not just administrative—they reduce test-day risk. Plan your exam delivery method (test center vs online proctoring) early so you can align your practice routine with the actual environment. Online proctoring adds constraints: a clean desk, stable internet, and strict identity and room checks. Test centers reduce home-tech uncertainty but require travel and timing buffers.

For ID checks, assume strict matching: your legal name on the registration must match your government ID, and you typically need an approved photo ID. If you use remote proctoring, you may also need to show the room with your camera, remove additional monitors, and keep your phone out of reach. Read policy details ahead of time—rescheduling windows, prohibited items, and rules about breaks. Many candidates lose time and focus due to avoidable logistics.

Common exam trap: treating “remote readiness” as a last-minute checklist. Your goal is zero surprises: run a system check on the same machine, same network, same room you’ll use on exam day; confirm you can connect without corporate VPN restrictions; and practice a full 2-hour seated session to simulate fatigue.

Exam Tip: Schedule the exam for a time when your energy is highest and interruptions are least likely. Your score is strongly correlated with focus and time discipline, not cramming the night before.

Section 1.4: Scoring approach, case studies, and time management

Section 1.4: Scoring approach, case studies, and time management

The PDE exam is scenario-driven: you will face multi-step narratives, sometimes resembling mini case studies, and you must select the “best” answer, not merely a correct one. Scoring is not about perfection; it’s about consistent good judgment across domains. That means your strategy should minimize unforced errors: misreading constraints, missing a keyword, or choosing an over-engineered design.

Expect question styles such as: selecting an architecture for streaming ingestion; choosing storage formats and partitioning; improving reliability (retries, dead-letter, backpressure); implementing least-privilege access; diagnosing performance or cost issues; and deciding among Dataflow/Dataproc/BigQuery based on requirements. Case study-like questions often embed constraints like “must support GDPR deletion,” “must be auditable,” “must process late events,” or “must minimize cost.” Each constraint rules out some otherwise attractive options.

Time management approach: do a first pass to answer “confident” questions quickly, mark ambiguous ones, and return. On complex prompts, spend the first 10–15 seconds extracting constraints and writing a mental checklist: latency target, volume, schema evolution, governance, and ops burden. Then evaluate answers against that checklist. If two options remain, pick the one that uses managed services and simplest operations while meeting requirements.

Exam Tip: Beware of distractor answers that are “more complex = more correct.” The exam rewards fit-for-purpose. If the prompt says “simple” or “minimal ops,” a self-managed Spark cluster with custom orchestration is usually a trap.

Section 1.5: Study plan, note-taking, and spaced repetition

Section 1.5: Study plan, note-taking, and spaced repetition

A beginner-friendly 4-week plan should mix concept learning with hands-on labs and systematic review. The goal is to build pattern recognition: when you see “streaming with late data,” you immediately think windowing + triggers; when you see “optimize BigQuery cost,” you think partitioning/cluster, predicate pushdown, and avoiding SELECT * on wide tables.

Week 1: Foundations and architecture. Learn core services and when to use them: Pub/Sub, Dataflow, Dataproc, BigQuery, Cloud Storage, IAM, and basic networking/security concepts. Build a simple pipeline end-to-end, even if it’s small. Week 2: Processing patterns. Practice batch vs streaming, windowing, retries, dead-letter topics, idempotent sinks, schema evolution, and data formats (Avro/Parquet). Week 3: Analytics and governance. Focus on BigQuery modeling, partitioning, clustering, query optimization, authorized views, policy tags, and data lifecycle. Week 4: Operations and exam readiness. Monitoring/alerting, orchestration, CI/CD concepts, incident response, and timed practice sets. In the final week, do at least two full-length timed sessions and review mistakes deeply.

Note-taking should be “decision rules,” not lecture transcripts. Create a living cheat sheet: “If requirement is X, prefer Y; unless Z.” Spaced repetition: convert frequent traps and decision rules into flashcards and review them daily (5–15 minutes). Track weak areas in a log: service confusion (Dataflow vs Dataproc), governance (IAM boundaries), or optimization (partition/cluster).

Exam Tip: Every incorrect practice answer must end with a rule you can reuse. If your takeaway is only “I got it wrong,” you won’t improve fast enough.

Section 1.6: Lab setup: projects, IAM basics, and cost controls

Section 1.6: Lab setup: projects, IAM basics, and cost controls

Your hands-on practice environment should be safe, reproducible, and cheap. Create a dedicated Google Cloud project (or a small set: dev + sandbox) so experiments don’t pollute work resources and so you can reset easily. Enable billing, then immediately set guardrails: budgets, alerts, and cleanup habits. Most PDE labs involve BigQuery, Cloud Storage, Pub/Sub, and Dataflow; those can incur surprise costs if you leave streaming jobs running or store large data indefinitely.

IAM basics for labs: use least privilege even in practice. Create a dedicated user or service account for pipeline components and grant roles at the project or dataset level as narrowly as possible. Learn the difference between primitive roles (Owner/Editor) and predefined roles (e.g., BigQuery Data Editor, Pub/Sub Publisher). The exam frequently tests security posture implicitly: if an answer requires broad Owner access to “make it work,” it’s likely wrong. Also learn where auditability comes from: Cloud Audit Logs and resource-level permissions.

Cost controls: set a monthly budget with alert thresholds, and prefer small regions/datasets. For BigQuery, understand on-demand vs flat-rate (conceptually) and practice scanning less data using partition filters and selecting only needed columns. For Dataflow, always stop jobs when done; for Pub/Sub, expire unused subscriptions. Keep raw datasets in GCS with lifecycle rules to transition or delete.

Exam Tip: In cost-related scenarios, the correct answer often combines architecture and hygiene: partitioning + lifecycle policies + managed services that autoscale. “Just buy bigger machines” is almost never the best choice on this exam.

Chapter milestones
  • Understand the exam format, domains, and question styles
  • Registration, scheduling, and remote testing readiness
  • Build a beginner-friendly 4-week study strategy
  • Set up a hands-on practice environment on Google Cloud
Chapter quiz

1. You are creating a 4-week study plan for a colleague who is new to Google Cloud but has general data engineering experience. They want the highest score with the least wasted effort. Which approach best aligns with what the Professional Data Engineer exam is primarily testing?

Show answer
Correct answer: Prioritize scenario-based design practice that weighs reliability, scalability, security, and cost, and validate choices with hands-on labs using managed services.
The PDE exam emphasizes architectural judgment and tradeoffs under constraints (reliability, scalability, security, cost). A managed-service, design-first approach with hands-on validation mirrors real exam scenarios. Memorizing feature lists/quota values is rarely the deciding factor in certification questions, and the exam is not primarily a coding/debugging test—implementation details matter, but usually in the context of design decisions.

2. A company is choosing between Dataflow and Dataproc for an upcoming exam-style scenario. The requirement states: "minimal operational overhead" and "elastic scaling" for near-real-time ingestion and transformation. No special cluster-level tuning or custom HDFS integration is required. Which option most closely matches the expected solution pattern?

Show answer
Correct answer: Use Cloud Dataflow with Pub/Sub for ingestion to minimize operations and scale automatically.
When prompts emphasize reduced ops and elasticity without requiring cluster control, the managed/serverless bias points to Dataflow (often with Pub/Sub for streaming ingestion). Dataproc is better when you need explicit cluster control (custom libraries, HDFS, Spark tuning). Custom VM-based streaming increases operational burden and is typically not favored when the scenario calls out minimal overhead.

3. Your team will take the PDE exam via online proctoring. One engineer has had issues with remote exams in the past. Which preparation step best reduces the risk of being unable to start or complete the exam due to environment issues?

Show answer
Correct answer: Run the official system check on the same device/network you will use, verify ID and room requirements, and schedule a buffer window to handle check-in.
Remote testing readiness is largely about compatibility and compliance: validating the proctoring environment ahead of time on the same setup reduces failure risk. Last-minute study does not address technical blockers. VPNs/remote desktops are commonly restricted or destabilizing for proctored exams and can cause disqualification or technical issues.

4. You are setting up a Google Cloud practice environment for a beginner-friendly 4-week plan. The goal is hands-on experience while preventing unexpected charges and limiting risk. Which setup is most appropriate?

Show answer
Correct answer: Create a dedicated project, apply a budget with alerts, use least-privilege IAM, and prefer serverless managed services where possible to reduce always-on costs.
A dedicated project with budgets/alerts and least-privilege IAM supports safe experimentation and cost control, which aligns with best practices for hands-on learning. Using production with broad Owner access increases security and operational risk. Always-on clusters/VMs are a common source of unexpected cost and contradict the cost-control objective.

5. During practice, a learner struggles because many questions have multiple plausible solutions. In an exam scenario, the prompt includes: "must meet regulatory requirements," "near-real-time analytics," and "minimal operational overhead." How should they choose between two solutions that both technically work?

Show answer
Correct answer: Select the option that best satisfies all stated constraints and aligns with managed-service patterns unless a requirement explicitly demands cluster-level control.
Certification questions often differentiate answers by how well they match stated constraints (compliance, latency, ops burden) and Google Cloud best-practice patterns (managed services for reliability and reduced ops). Using more services is not inherently better and can add complexity and cost. Personal familiarity is not a scoring criterion; the exam expects choices grounded in requirements and cloud architecture principles.

Chapter 2: Design Data Processing Systems (Domain 1)

Domain 1 of the Google Professional Data Engineer exam is where architecture decisions get tested: can you translate business outcomes into a cloud-native data design that is reliable, scalable, secure, and cost-aware? Expect scenario prompts that hide key constraints (latency, governance, regionality, recovery objectives) inside business language. Your job is to identify those constraints, select the simplest set of managed components that meets them, and avoid anti-patterns like over-engineering with self-managed clusters or mixing incompatible storage paradigms.

This chapter follows the same workflow you should use on exam questions: (1) extract requirements and explicit/implicit SLAs, (2) map them to a reference architecture (lake/warehouse/lakehouse), (3) select components for ingestion and processing (batch vs streaming patterns), (4) apply security-by-design, then (5) validate reliability and cost trade-offs. The final section highlights how to recognize correct answers and eliminate distractors without getting lost in product trivia.

As you read, keep a mental checklist: Who consumes the data (BI analysts, ML, operational apps)? What latency is needed (seconds vs hours)? What is the system of record and where is it located? What compliance regime applies (PCI, HIPAA, GDPR)? These are the signals the exam uses to guide you to the right architecture.

Practice note for Translate business requirements into data architecture: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose batch vs streaming designs and patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design for security, compliance, and governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 1 practice set: scenario-based architecture questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate business requirements into data architecture: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose batch vs streaming designs and patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design for security, compliance, and governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 1 practice set: scenario-based architecture questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate business requirements into data architecture: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose batch vs streaming designs and patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Requirements gathering: SLAs, SLOs, RPO/RTO, and constraints

Most Domain 1 questions are won before you pick a service. The exam expects you to translate vague business requirements into measurable targets: SLAs (what you promise externally), SLOs (your internal reliability/latency targets), and error budgets (the tolerated failure window). In data systems, the most common SLOs are end-to-end freshness (time from event creation to availability in BigQuery), pipeline success rate, and correctness guarantees (exactly-once vs at-least-once semantics and how duplicates are handled).

Disaster recovery objectives show up as RPO (how much data you can lose) and RTO (how fast you must recover). A “no data loss” requirement typically implies an RPO near zero and pushes you toward durable ingestion (for example, Pub/Sub with retention and replay, or writing raw files to Cloud Storage) and idempotent processing. A “recover within 1 hour” requirement affects orchestration choices, runbooks, and whether you rely on managed services vs self-managed clusters that take time to rebuild.

Constraints are often implicit: data residency (must stay in EU), security (CMEK required), networking (private IP only), cost ceilings, and operational constraints (“small team,” “no on-call,” “must be managed”). The correct answer usually uses managed services (Pub/Sub, Dataflow, BigQuery) when the prompt emphasizes limited ops capacity.

Exam Tip: When a scenario mentions “auditable,” “regulated,” or “least privilege,” write down governance constraints first. Many distractor answers satisfy performance but violate compliance (e.g., exporting regulated data to an unmanaged environment).

Common trap: conflating SLA/SLO with batch schedule. If the business says “reports must be updated every 15 minutes,” that is a freshness SLO, not necessarily a streaming requirement. You could meet it with micro-batch (scheduled Dataflow flex template, or BigQuery scheduled queries) if upstream systems can deliver data predictably. The exam rewards fitting the simplest design to the SLO.

Section 2.2: Reference architectures: lake, warehouse, lakehouse on Google Cloud

The exam uses “lake,” “warehouse,” and “lakehouse” as shorthand for data organization and governance patterns. On Google Cloud, a data lake is commonly Cloud Storage (raw/bronze, cleaned/silver, curated/gold zones) plus metadata/governance (Dataplex, Data Catalog concepts) and processing engines (Dataflow/Dataproc). A data warehouse typically centers on BigQuery as the governed analytics store with modeled tables, performance controls, and BI integration.

A lakehouse combines lake storage with warehouse-like governance and performance. In GCP terms, that often means Cloud Storage as the landing zone and BigQuery as the serving layer, sometimes with BigLake/BigQuery external tables for unified access patterns. The exam objective is not to memorize labels, but to choose the right pattern given consumers, query needs, and governance.

Use a warehouse-first design when the workload is primarily analytics/BI with strong SQL modeling needs, many concurrent users, and requirements like row-level security and straightforward cost controls through reservations. Use a lake-first design when data types are heterogeneous (logs, images, semi-structured), schema evolves frequently, or you need inexpensive long-term retention. Use a lakehouse approach when you want raw retention and flexible processing, but still need a single governed SQL interface for most users.

Exam Tip: In scenario questions, look for phrases like “data scientists need raw event history” (lake) versus “executives need consistent KPIs with a semantic layer” (warehouse). If both appear, the best answer usually lands raw data in GCS and curates into BigQuery.

Common trap: treating Cloud Storage as an “analytics database.” GCS is an object store; it is excellent for durability and cost, but not for interactive BI unless paired with an engine (BigQuery external tables/BigLake, Dataproc, or Dataflow). Another trap is over-normalizing BigQuery like an OLTP database; the exam expects dimensional modeling or denormalized patterns for analytics, balanced with partitioning/clustering for performance.

Section 2.3: Component selection: Pub/Sub, Dataflow, Dataproc, BigQuery, GCS

Domain 1 tests whether you can match ingestion and processing patterns to requirements. Pub/Sub is the default choice for event ingestion when you need decoupling, elastic throughput, and replay. Dataflow is the default for unified batch/stream processing with managed autoscaling, windowing, and exactly-once processing semantics (when using supported sources/sinks and appropriate design). Dataproc is typically chosen when you need Spark/Hadoop ecosystem compatibility, custom libraries, or lift-and-shift of existing jobs; it carries more operational responsibility than Dataflow.

BigQuery is the primary analytical warehouse: use it for interactive SQL, managed storage/compute separation, governance features, and integration with BI/ML (for example, BigQuery ML). Cloud Storage is the durable landing and archival zone, and often the “system of record” for raw files. The exam frequently expects a pattern like: ingest (Pub/Sub or GCS) → transform (Dataflow/Dataproc) → serve (BigQuery) with GCS as raw retention.

Batch vs streaming is about latency and correctness needs. Streaming fits user-facing metrics, alerting, and operational dashboards requiring seconds-to-minutes freshness. Batch fits daily finance closes, periodic reporting, and large backfills. Hybrid designs are common: stream into a staging dataset, then periodic batch compaction/curation into partitioned BigQuery tables.

Exam Tip: If you see “late data,” “out-of-order events,” “sessionization,” or “sliding windows,” that is a strong signal for Dataflow streaming with windowing and triggers—not a simple batch ETL.

Common traps: (1) using Dataproc for simple transformations that Dataflow handles with less ops, (2) ignoring idempotency—streaming pipelines must handle duplicates and retries, and (3) writing to BigQuery in a way that causes hot partitions (e.g., constantly updating a single partition). The exam likes patterns that append to partitioned tables, use clustering for selective filters, and perform merges in controlled batch jobs when needed.

Section 2.4: Security-by-design: IAM, service accounts, CMEK, VPC-SC basics

Security design is not an afterthought in Domain 1. The exam expects least privilege with IAM: grant roles at the smallest scope (project, dataset, topic, bucket) and prefer predefined roles over broad primitive roles. Service accounts should represent workloads (Dataflow worker service account, CI/CD deployer account) and be granted only the permissions required to read sources and write sinks.

Customer-managed encryption keys (CMEK) appear in regulated scenarios. Know the implication: you must manage Cloud KMS keys, grant the relevant service accounts permission to use the key, and plan for key rotation and availability (KMS access affects workloads). For BigQuery and GCS, CMEK is a common requirement for compliance, and the correct option often combines CMEK with audit logging and controlled sharing.

VPC Service Controls (VPC-SC) is frequently tested at a “basic concept” level: it creates service perimeters that reduce data exfiltration risk by restricting access to supported Google APIs from outside the perimeter. In exam scenarios involving highly sensitive data and concerns about exfiltration, adding VPC-SC around BigQuery and GCS is a strong architectural move, especially combined with Private Google Access and restricted egress.

Exam Tip: If a question mentions “prevent data exfiltration” or “limit access from the public internet,” consider VPC-SC and private connectivity patterns before jumping to network firewalls alone.

Common trap: confusing IAM authorization with network isolation. IAM answers “who can access,” while VPC-SC and private service access patterns address “from where” access can occur and reduce the blast radius of stolen credentials. Another trap is using a single shared service account for multiple pipelines; on the exam, that usually violates least privilege and auditability.

Section 2.5: Reliability and cost design: quotas, autoscaling, reservations, budgets

Reliability and cost are joint constraints, and the exam expects you to design with both. Start with quotas and limits: Pub/Sub throughput, Dataflow worker scaling behavior, BigQuery concurrent jobs and slot capacity, and API rate limits. When a scenario includes bursty traffic, the correct design often uses managed autoscaling (Pub/Sub + Dataflow streaming) and backpressure handling rather than fixed-size clusters.

For BigQuery cost control and performance predictability, reservations (slot reservations) are a common answer when workloads are steady and mission critical. On-demand is simpler for variable or exploratory workloads. Partitioning and clustering are cost levers too: partition by event date to minimize scanned bytes; cluster by high-cardinality filter columns used in queries.

Budgets and alerts (Cloud Billing budgets) address financial governance. The exam may include a requirement like “notify when spend exceeds threshold” or “prevent runaway costs.” The right response is budgets with alerts plus technical controls: query cost governance (authorized views, limiting who can run ad hoc queries), materialized views where appropriate, and lifecycle policies on GCS to move cold data to cheaper storage classes.

Exam Tip: When “predictable performance” appears, think reservations and workload separation (e.g., separate projects/datasets for dev vs prod) rather than just “add more resources.”

Common traps: (1) relying on manual scaling for streaming pipelines, (2) ignoring BigQuery slot contention between teams (a reliability issue presented as a performance issue), and (3) designing without backfill strategy—reliability includes the ability to reprocess data (store raw data in GCS, keep Pub/Sub retention/replay where applicable, or maintain immutable append-only logs).

Section 2.6: Exam-style practice: architecture trade-offs and anti-patterns

Domain 1 scenarios typically present multiple “mostly correct” architectures; the exam tests trade-offs. Your selection should match the highest-priority constraints first (compliance, correctness, latency), then optimize for cost and operational simplicity. A reliable approach to eliminate distractors is to flag any option that (a) introduces unnecessary self-managed infrastructure, (b) violates data residency or security requirements, or (c) cannot meet freshness/throughput targets without heroic tuning.

Anti-patterns the exam frequently punishes include using Dataproc for a small ETL that could be a managed Dataflow job (ops burden), dumping all data into a single unpartitioned BigQuery table (cost/performance), and building direct point-to-point integrations instead of Pub/Sub (tight coupling and poor scalability). Another common anti-pattern is treating streaming as “faster batch” without handling late events, deduplication, and schema evolution.

Trade-offs to recognize: Dataflow vs Dataproc (managed vs flexible ecosystem), streaming vs micro-batch (latency vs simplicity), BigQuery native tables vs external tables (performance/governance vs storage flexibility), and centralized vs decentralized projects (governance vs autonomy). The exam also tests whether you can design for operations: monitoring, retries, and reprocessing pathways are part of architecture, not an implementation detail.

Exam Tip: If two answers both satisfy functional needs, choose the one with fewer moving parts and more managed features (autoscaling, managed checkpoints, built-in security controls). The PDE exam strongly favors operationally sustainable designs.

Finally, tie back to the lessons: translate business requirements into architecture, choose batch vs streaming patterns intentionally, and bake in governance from the start. When you read a prompt, underline the non-negotiables (SLO/RPO/compliance), then pick the reference architecture and components that satisfy them with the simplest, most managed implementation.

Chapter milestones
  • Translate business requirements into data architecture
  • Choose batch vs streaming designs and patterns
  • Design for security, compliance, and governance
  • Domain 1 practice set: scenario-based architecture questions
Chapter quiz

1. A retailer wants to build a new analytics platform on Google Cloud. Business stakeholders need daily sales dashboards with consistent definitions of metrics (e.g., net revenue) and strong SQL governance. Data sources include Cloud SQL (orders) and CSV files delivered to Cloud Storage. Latency requirements are hours, not seconds. Which architecture best meets the requirements with minimal operational overhead?

Show answer
Correct answer: Use BigQuery as the central data warehouse, load data from Cloud Storage and Cloud SQL on a scheduled basis (e.g., Dataflow/Dataproc batch or BigQuery load jobs), and manage curated datasets with IAM and authorized views
BigQuery is the managed, SQL-first warehouse designed for governed BI use cases with curated datasets, views, and fine-grained access controls, and batch ingestion matches the stated hours-level latency. Cloud Bigtable is optimized for low-latency key/value access patterns, not governed analytical SQL dashboards, so it increases complexity and limits BI semantics. Streaming Pub/Sub + Dataflow is an over-engineered pattern when stakeholders do not need seconds-level freshness; it can add cost and operational considerations (e.g., streaming semantics, late data handling) without meeting a stated requirement.

2. A logistics company needs to detect shipment temperature excursions from IoT sensors and alert operations within 10 seconds. Sensors send readings every second. They also want to store the raw events for later analysis and model training. Which design best satisfies both the low-latency alerting and long-term analytics requirements?

Show answer
Correct answer: Ingest events into Pub/Sub, process in streaming Dataflow to compute threshold breaches and send alerts, and write both raw and processed results to BigQuery (and/or Cloud Storage) for analytics
Pub/Sub + streaming Dataflow is the canonical managed pattern for seconds-level event processing and alerting, while simultaneously persisting raw/derived data for analytics in BigQuery/Cloud Storage. Nightly batch processing cannot meet the 10-second SLA. Cloud SQL is not designed for massive high-frequency IoT ingestion at scale and periodic queries introduce latency and operational bottlenecks compared to streaming processing.

3. A healthcare provider is designing a data platform for clinical analytics. The data contains PHI and must meet strict access control and auditability requirements. Analysts should only access de-identified datasets, while a small compliance team can access identified data. Which approach best aligns with security-by-design and governance expectations on Google Cloud?

Show answer
Correct answer: Store identified data in BigQuery with least-privilege IAM, create de-identified curated tables/views in separate datasets/projects, and restrict analyst access using authorized views/policy controls while enabling audit logs
The exam expects least privilege, separation of duties, and enforceable controls (not guidance). BigQuery supports dataset/project separation, authorized views, and auditing to ensure analysts cannot access identified data while enabling controlled access for a compliance group. Sharing raw files broadly in Cloud Storage increases the risk of uncontrolled access and makes fine-grained, query-time governance harder. Relying on analysts to self-enforce column restrictions is a governance anti-pattern because it is not an enforced technical control.

4. A media company processes ad impressions to produce billing reports. Their SLA is to deliver finalized daily aggregates by 8 AM, but they sometimes receive late-arriving events up to 6 hours after midnight. Which processing pattern is most appropriate?

Show answer
Correct answer: Use a batch pipeline with a defined window (e.g., daily) and a late-data handling strategy (e.g., reprocessing/backfill for the prior day) to produce corrected aggregates
A daily SLA with known late-arriving data is well-suited to batch-oriented pipelines with backfill/reprocessing to ensure correctness of billing outputs. Discarding late events violates data correctness requirements for billing and is unlikely to be acceptable. Bigtable is not a fit for large-scale analytical aggregation queries and pushing billing computation to ad hoc analysis reduces reliability and repeatability expected in production billing systems.

5. A global SaaS company wants a new data platform for product analytics. Requirements include: (1) data residency in the EU for EU customer data, (2) a governed dataset for analysts, and (3) the ability to recover from a regional outage while minimizing operational work. Which design best meets these requirements?

Show answer
Correct answer: Create separate BigQuery datasets in EU locations for EU data, use managed replication/DR patterns such as multi-region or cross-region copies where appropriate, and enforce access with IAM and governed views; keep processing in-region
EU residency implies storing and processing EU customer data in EU locations; BigQuery supports location constraints and governed analytics with IAM and views, and managed services reduce operational burden while enabling resiliency strategies. Centralizing in a US location violates data residency requirements. Self-managed Hadoop replication increases operational complexity and is a common exam distractor; the PDE exam emphasizes choosing managed, cloud-native components unless a requirement explicitly forces self-management.

Chapter 3: Ingest and Process Data (Domain 2)

Domain 2 of the Professional Data Engineer exam focuses on how you move data into Google Cloud and transform it reliably—under constraints like throughput, latency, correctness, security, and cost. The exam is less interested in memorizing product descriptions and more interested in whether you can choose the right ingestion and processing pattern, anticipate failure modes, and implement operational controls (monitoring, retries, DLQs, backfills) that keep pipelines correct over time.

This chapter ties directly to the exam’s recurring decision points: batch vs streaming, managed vs self-managed, “exactly-once” expectations vs practical delivery guarantees, and how schema/data quality issues propagate downstream. You should be able to read a scenario (e.g., CDC from a database, file drops from partners, clickstream events) and justify the ingestion tool (Storage Transfer Service, Datastream, Pub/Sub, connectors, APIs), processing engine (Dataflow, Dataproc, BigQuery), and correctness/observability plan (validation, dead-lettering, replays, idempotency).

Exam Tip: When multiple choices “can work,” the exam usually rewards the option that is (1) most managed, (2) natively integrated, and (3) meets stated SLOs with the fewest moving parts—especially around autoscaling, retries, and monitoring.

Practice note for Build ingestion strategies for files, databases, and events: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement streaming pipelines with Pub/Sub and Dataflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement batch pipelines with Dataflow and Dataproc: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 2 practice set: pipeline debugging and correctness questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build ingestion strategies for files, databases, and events: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement streaming pipelines with Pub/Sub and Dataflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement batch pipelines with Dataflow and Dataproc: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 2 practice set: pipeline debugging and correctness questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build ingestion strategies for files, databases, and events: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement streaming pipelines with Pub/Sub and Dataflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Ingestion options: Storage Transfer, Datastream, connectors, APIs

Section 3.1: Ingestion options: Storage Transfer, Datastream, connectors, APIs

On the PDE exam, ingestion questions often hide the “right” answer in the source type and change rate: files moving in bulk, databases requiring CDC, SaaS sources needing managed connectors, or custom applications emitting events. Your job is to select the simplest reliable ingestion mechanism that matches the data’s nature and timeliness requirements.

Storage Transfer Service is the go-to for moving objects at scale (from on-prem S3-compatible storage, AWS S3, Azure Blob, or another GCS bucket) into Cloud Storage. It supports scheduled transfers and incremental sync. This is commonly tested for partner file drops and large historical migrations. It is not a streaming service; if the scenario requires near-real-time event processing, Storage Transfer is usually a trap.

Datastream is Google’s managed change data capture (CDC) service for databases (commonly MySQL, PostgreSQL, Oracle) into Google Cloud. It captures inserts/updates/deletes with low overhead and streams them to destinations (often into Cloud Storage and then into BigQuery via Dataflow/Dataproc/BigQuery ingest patterns). Datastream is tested when the prompt says “replicate database changes continuously” or “minimize load on the OLTP system.”

Connectors (for example, Dataflow templates/connectors, BigQuery Data Transfer Service for some SaaS sources, and managed integrations) appear in scenarios where you need quick, supported ingestion with minimal code. The exam typically favors managed connectors over custom scraping if the data source is standard and the requirement is operational simplicity.

APIs (custom ingestion) are best when you control the producer or the source is unique. In that case, a common pattern is application → Pub/Sub (events) or application → GCS (files) followed by processing. API-based ingestion should include authentication (OAuth/service accounts), quotas/backoff, and idempotency if retries can create duplicates.

Exam Tip: If the scenario mentions “database replication/CDC,” think Datastream first; if it says “bulk file movement/scheduled sync,” think Storage Transfer; if it says “events, decoupling, many consumers,” think Pub/Sub (often feeding Dataflow).

Section 3.2: Pub/Sub fundamentals: topics, subscriptions, ordering, delivery semantics

Section 3.2: Pub/Sub fundamentals: topics, subscriptions, ordering, delivery semantics

Pub/Sub is the backbone of many streaming designs on GCP and is heavily tested. Conceptually: publishers write messages to a topic; consumers receive messages via subscriptions. The exam expects you to reason about delivery semantics, scaling, ordering needs, and replay strategy.

Subscriptions come in pull and push forms. Pull is common for Dataflow and custom consumers; push is common for HTTPS endpoints and Cloud Run services when you want Pub/Sub to deliver messages. Know that push requires an endpoint that can handle retries and authenticate requests. In both types, Pub/Sub uses an ack mechanism: messages are redelivered if not acknowledged before the ack deadline.

Delivery semantics: Pub/Sub provides at-least-once delivery in typical configurations, meaning duplicates are possible. Exam questions often try to trick you into claiming Pub/Sub is “exactly once.” It is not, in the general case. Exactly-once processing is achieved by downstream design: idempotent writes, deduplication keys, or storage layers that can upsert/merge.

Ordering: Pub/Sub supports ordering keys for ordered delivery within a key. This is essential when per-entity ordering matters (e.g., events per user/account). The trade-off is potential throughput constraints per ordering key; the exam may test whether you can preserve ordering without forcing global ordering (a common scalability trap).

Retention and replay: Pub/Sub can retain acknowledged messages for a configured period (message retention), enabling replay in some cases. However, relying on Pub/Sub as a long-term event log is usually not the recommended design; durable replay commonly uses Cloud Storage or BigQuery as an immutable landing zone plus reprocessing via Dataflow/Dataproc.

Exam Tip: When you see “multiple independent consumers” or “decouple producers and consumers,” Pub/Sub is the intended answer. When you see “must handle duplicates,” the correct solution is almost always “design idempotent processing/deduplicate downstream,” not “turn on exactly-once in Pub/Sub.”

Section 3.3: Dataflow concepts: Beam model, windowing, triggers, watermarks

Section 3.3: Dataflow concepts: Beam model, windowing, triggers, watermarks

Dataflow (Apache Beam on Google’s managed runner) is the exam’s centerpiece for stream and batch transformations. You are expected to understand the Beam programming model at a conceptual level and how it affects correctness for streaming analytics.

Beam model: Pipelines are built from transforms applied to PCollections. The exam often checks whether you can separate source (Pub/Sub, GCS, BigQuery) from transforms (ParDo, GroupByKey, Combine, enrichment joins) and sinks (BigQuery, GCS, Pub/Sub). For production, know that Dataflow is managed: autoscaling workers, built-in monitoring, and advanced features like templates and flex templates for repeatable deployments.

Event time vs processing time: Streaming correctness depends on event time. Late data is normal (mobile devices offline, network delays). Beam uses watermarks to estimate event-time progress. A common exam trap is choosing processing-time windows for metrics that must align to when events actually occurred; event-time windows with allowed lateness are usually the correct approach.

Windowing: Fixed windows (e.g., 1 minute), sliding windows (e.g., 10-minute window every minute), and session windows (based on activity gaps) are frequently tested. You must align window type with business intent: “per-minute counts” implies fixed; “rolling average” implies sliding; “user sessions” implies session windows.

Triggers: Triggers determine when to emit results for a window (e.g., after watermark, early firings for low latency, late firings for late data). The exam cares about the trade-off: early triggers reduce latency but may create multiple updates per window; your sink must handle updates (e.g., BigQuery MERGE/upserts or storing intermediate aggregates).

Exam Tip: If the prompt says “update dashboards in near real time but still be correct with late events,” the best answer usually includes event-time windowing + allowed lateness + appropriate triggers + idempotent sink/upserts.

Section 3.4: Batch processing patterns: Dataproc (Spark) vs Dataflow vs BigQuery

Section 3.4: Batch processing patterns: Dataproc (Spark) vs Dataflow vs BigQuery

Batch pipeline scenarios appear as backfills, nightly transformations, ETL from Cloud Storage to BigQuery, or heavy joins/feature generation. The exam expects you to choose the right engine based on operational overhead, code portability, and performance/cost constraints.

BigQuery is often the best batch “processor” when the data is already in BigQuery (or can be loaded there) and transformations are SQL-friendly: joins, aggregations, partitioned incremental loads, and ELT patterns. It is fully managed and scales well. A common best-practice pattern is landing raw data in Cloud Storage/BigQuery, then using scheduled queries or orchestration to produce curated tables. The trap is using BigQuery for workflows that require complex custom libraries or non-SQL processing; then Dataflow/Dataproc may be better.

Dataflow (batch) is appropriate when you need the same Beam pipeline to run in batch and streaming, or when transformations involve complex parsing, enrichment from external services, or advanced IO patterns. Dataflow’s managed scaling and template deployment reduce operational burden. The exam often rewards Dataflow when the prompt emphasizes “minimal cluster management” and “unified streaming+batch.”

Dataproc (Spark/Hadoop) is a managed cluster service best when you need the Spark ecosystem (MLlib, GraphX, custom JVM/Python dependencies), existing Spark jobs, or specialized Hadoop tooling. Dataproc introduces cluster lifecycle and tuning considerations: sizing, autoscaling policies, initialization actions, and job orchestration. The exam commonly tests that Dataproc is not the default for new pipelines if Dataflow/BigQuery can meet requirements with less ops overhead.

Exam Tip: If the scenario says “existing Spark code” or “needs Spark libraries,” choose Dataproc; if it says “serverless, autoscaling, minimal ops,” choose Dataflow or BigQuery. When data is already in BigQuery and the transformation is SQL, BigQuery is usually the simplest and most cost-effective.

Section 3.5: Data quality and schema evolution: validation, dead-lettering, replays

Section 3.5: Data quality and schema evolution: validation, dead-lettering, replays

Correctness and recoverability are core Domain 2 themes. The exam frequently frames failures as “bad records,” “schema changed,” “downstream sink outage,” or “duplicate messages.” You should respond with patterns that preserve data, isolate failures, and allow reprocessing.

Validation: Perform lightweight validation early (required fields, type checks, range checks) and separate invalid from valid flows. In Dataflow, this is often implemented as side outputs (tagged outputs) so you can keep the pipeline running while routing problematic data elsewhere for inspection.

Dead-lettering: A dead-letter queue (DLQ) is a durable place for failed records: commonly a Pub/Sub topic for streaming errors or a Cloud Storage bucket/BigQuery table for batch rejects. The key exam point: do not drop data silently. DLQs enable later triage and reprocessing after fixes.

Schema evolution: Streaming pipelines break when producers add fields, change types, or send unexpected JSON. Strategies include using schema registries/contract testing, encoding formats that support evolution (Avro/Protobuf with compatible changes), and designing sinks to tolerate additive changes (e.g., BigQuery nullable new columns). The trap is assuming you can freely change field types without impact; type changes often require a new field + backfill.

Replays and backfills: For streaming, replay can come from Pub/Sub retention (limited) or—more robustly—an immutable raw landing zone in Cloud Storage/BigQuery. For batch, keep raw inputs and versioned code so you can rerun deterministically. Reprocessing also requires idempotent writes: BigQuery MERGE on a stable key, or writing to partitioned tables with overwrite semantics for the target partition.

Exam Tip: When asked how to “ensure no data loss,” look for: durable raw storage, DLQs, and replay/backfill procedures. When asked how to “ensure correctness with duplicates,” look for idempotency and deduplication keys rather than assuming the messaging system prevents duplicates.

Section 3.6: Exam-style practice: pipeline choices, latency, and failure handling

Section 3.6: Exam-style practice: pipeline choices, latency, and failure handling

This domain’s questions often read like production incidents or design reviews: “latency spiked,” “pipeline is stuck,” “cost is too high,” “data is missing,” or “metrics don’t match.” Your scoring advantage comes from systematically mapping symptoms to the layer: ingestion (Pub/Sub/subscription/backlog), processing (windowing/watermarks/worker scaling), or sink (BigQuery quotas, hot partitions, retries).

Pipeline choice reasoning: If near-real-time processing and multiple consumers are required, the intended pattern is Pub/Sub → Dataflow streaming → BigQuery/GCS. If the task is a nightly heavy join across curated tables, BigQuery SQL is usually preferred. If you must reuse Spark code or rely on Spark-specific libraries, Dataproc is appropriate. For CDC replication needs, Datastream is the key ingestion building block, often followed by Dataflow to transform and load.

Latency diagnosis: In streaming, high end-to-end latency can come from Pub/Sub backlog (insufficient subscriber throughput), Dataflow underprovisioning (worker limits, autoscaling disabled, skewed keys causing hot spots), or windowing choices (waiting for watermark). The exam commonly tests the distinction between “data is delayed because windows wait for completeness” vs “data is delayed because the system is overloaded.”

Failure handling: Expect scenarios where some records fail parsing or a sink rejects writes. The correct design keeps processing: route invalid records to a DLQ, implement retries with backoff for transient sink failures, and use idempotent outputs so retries don’t corrupt results. For BigQuery sinks, streaming inserts can have quota/throughput constraints; batch loads or file loads may be more cost-effective for high-volume append-only batch patterns.

Common traps: (1) Claiming “exactly once” without discussing idempotency/dedup; (2) ignoring late data and using processing-time windows for event-time metrics; (3) choosing Dataproc for new pipelines when Dataflow/BigQuery fits with lower ops; (4) dropping bad records instead of dead-lettering; (5) forgetting replay/backfill planning.

Exam Tip: When you must pick between two plausible architectures, choose the one that explicitly addresses: duplicates, late data, backpressure, and reprocessing. Those are the correctness and operability signals the PDE exam looks for.

Chapter milestones
  • Build ingestion strategies for files, databases, and events
  • Implement streaming pipelines with Pub/Sub and Dataflow
  • Implement batch pipelines with Dataflow and Dataproc
  • Domain 2 practice set: pipeline debugging and correctness questions
Chapter quiz

1. A media company ingests clickstream events (~200k events/sec) from multiple regions. They need end-to-end latency under 5 seconds into BigQuery, automatic scaling, and the ability to handle occasional malformed messages without breaking the pipeline. Which approach best meets the requirements with the fewest operational moving parts?

Show answer
Correct answer: Publish events to Pub/Sub and use a Dataflow streaming pipeline with windowing and a dead-letter output for bad records, writing to BigQuery
Pub/Sub + Dataflow streaming is the most managed, natively integrated pattern for high-throughput, low-latency ingestion, with built-in autoscaling, retries, and structured error handling (e.g., dead-letter outputs). Batch files to Cloud Storage + Dataproc cannot meet the <5s latency SLO due to hour-level batching and higher ops overhead. Self-managed Kafka + custom consumers increases operational burden (cluster management, scaling, reliability) and typically provides fewer built-in pipeline correctness controls than Dataflow.

2. A retailer needs near-real-time change data capture (CDC) from a Cloud SQL (MySQL) database into BigQuery for analytics. They want minimal maintenance and a solution that can continue through transient failures while preserving change order. What should they implement?

Show answer
Correct answer: Use Datastream to capture CDC from Cloud SQL and stream changes into BigQuery (optionally through Dataflow for transformations)
Datastream is the managed Google Cloud CDC service designed for capturing database changes with low operational overhead and reliable delivery semantics, commonly used to land changes in BigQuery (directly or via Dataflow). Periodic full exports are inefficient, increase cost, and risk missing low-latency needs and introducing correctness issues (duplicates/partial reloads). Debezium on GKE can work, but it is significantly more self-managed (cluster ops, connector lifecycle, scaling, upgrades) and is not the exam-preferred choice when a managed native service meets requirements.

3. A logistics company receives nightly partner file drops (CSV) into a Cloud Storage bucket. Files can arrive late and sometimes contain rows that fail schema/quality checks. They need a cost-effective pipeline that can backfill and reprocess specific dates without reprocessing the entire history. What is the best approach?

Show answer
Correct answer: Trigger a Dataflow batch pipeline (e.g., via Cloud Scheduler/Workflows) that reads only the target date partition from Cloud Storage, validates records, writes valid data to partitioned BigQuery tables, and writes invalid rows to a quarantine bucket/table
A Dataflow batch pipeline is well-suited for file-based ingestion with predictable schedules and supports reprocessing/backfills by parameterizing the input path/date range and writing to partitioned BigQuery for targeted reruns. A streaming pipeline polling Cloud Storage adds unnecessary cost/complexity and is a poorer fit for late-arriving nightly files; also writing to an unpartitioned table makes backfills and cost control harder. A long-running Dataproc cluster increases operational overhead and cost compared to managed batch Dataflow, and 'manual' handling of bad rows does not meet robust correctness/operability expectations.

4. A Dataflow streaming pipeline reads from Pub/Sub and writes to BigQuery. Operations reports occasional duplicate rows in BigQuery after worker restarts. The pipeline currently uses at-least-once processing and does not include any deduplication logic. What change best addresses duplicates while maintaining streaming performance?

Show answer
Correct answer: Implement idempotent writes by adding a stable event_id and using BigQuery Storage Write API with a primary key-style dedup strategy (or de-dup in Dataflow using event_id + window), ensuring retries don’t create duplicates
In streaming systems, retries/restarts commonly produce duplicates under at-least-once delivery. The exam expects a correctness fix: design idempotency (stable event_id) and apply deduplication/write semantics appropriate for BigQuery (e.g., Storage Write API with dedup capabilities, or Dataflow-side dedup keyed by event_id within an appropriate window). Tuning retention/autoscaling may reduce frequency but does not solve the underlying correctness issue. Switching to daily batch changes the product requirement (near-real-time) and trades correctness symptoms for unacceptable latency.

5. A team runs a Dataflow pipeline that transforms events from Pub/Sub. After a new producer release, the pipeline starts failing with deserialization errors due to an added field and occasional type changes. They must keep the pipeline running, prevent data loss, and make failures observable for remediation. What is the best design update?

Show answer
Correct answer: Add schema validation/versioning in the pipeline, route invalid/unparseable messages to a dead-letter topic/table with error metadata, and monitor error rates with alerts while allowing valid records to continue
A robust ingestion/processing pattern uses explicit validation and a dead-letter path (DLQ/quarantine) so the pipeline continues processing good data while preserving bad records for later analysis and replay. Silent drops violate correctness and data-loss requirements and reduce observability. Pausing the pipeline and manipulating ack deadlines is operationally fragile and risks message redelivery/backlog growth; it also does not provide a sustainable approach to schema evolution and ongoing monitoring.

Chapter 4: Store the Data (Domain 3)

Domain 3 of the Google Professional Data Engineer exam tests whether you can choose storage services and data models that meet reliability, scalability, security, and cost constraints—then prove it through practical decisions such as partitioning, lifecycle management, and access control. In scenario-based questions, you are rarely asked “what is BigQuery?”; instead, you’re asked to diagnose a bottleneck (slow queries, exploding costs, governance gaps) and pick the combination of storage and modeling patterns that resolves it with minimal operational burden.

This chapter maps directly to common exam objectives: selecting storage for analytics vs. lakes vs. serving, modeling for BigQuery performance and governance, and applying lifecycle controls (partition expiration, object lifecycle rules, and cost-aware layouts). Expect distractors that sound plausible but ignore one key constraint—like choosing Cloud SQL for terabyte-scale analytics, or using Bigtable when you actually need relational constraints and cross-row transactions.

As you read, practice translating each scenario into (1) access pattern (OLAP vs. OLTP vs. key-value), (2) latency and concurrency needs, (3) consistency/transaction requirements, (4) data volume and growth, (5) governance/security controls, and (6) cost model. Those six signals usually reveal the correct storage layout.

Practice note for Select storage services for analytics, lakes, and serving: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model data for BigQuery performance and governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Manage lifecycle, partitioning, and cost controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 3 practice set: storage and modeling scenario questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select storage services for analytics, lakes, and serving: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model data for BigQuery performance and governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Manage lifecycle, partitioning, and cost controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain 3 practice set: storage and modeling scenario questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select storage services for analytics, lakes, and serving: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model data for BigQuery performance and governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Storage decision matrix: GCS, BigQuery, Bigtable, Spanner, Cloud SQL

On the exam, storage selection is an “identify the workload” problem. Start by classifying the primary access pattern: analytics (scan/aggregate), lake (cheap object storage + schema-on-read), or serving (low-latency lookups/transactions). Then map it to the service.

BigQuery is the default for analytics: columnar storage, elastic execution, and strong SQL/BI integration. It wins when queries scan large datasets, aggregate, join, and support many analysts without capacity planning. It is not ideal as a low-latency key-value store for per-user millisecond reads.

Cloud Storage (GCS) is your data lake substrate: durable, inexpensive, and flexible for raw/bronze data, archival, and interchange formats (Parquet/Avro). Use it when you need cheap storage, file-based ingestion, or to decouple storage from compute (Dataproc/Spark, Dataflow, BigQuery external tables). It is not a database: it lacks indexing, concurrency controls, and query-native governance on its own.

Bigtable is for high-throughput, low-latency, wide-column lookups on a row key: time-series, IoT telemetry, user activity streams, and serving features with predictable key access. Bigtable is a common trap: it does not support SQL joins and is not meant for ad-hoc analytics. If the scenario emphasizes “scan a full table and aggregate,” Bigtable is usually wrong.

Spanner fits global, horizontally scalable relational OLTP with strong consistency and SQL, including transactions across rows/tables. If the prompt mentions global availability, high write rates, and relational constraints, Spanner is a contender. It’s overkill for simple departmental apps or analytics-only needs.

Cloud SQL is managed MySQL/PostgreSQL for traditional OLTP with moderate scale. Use it for application backends with relational needs that don’t require Spanner’s global scale. A classic exam distractor is recommending Cloud SQL for multi-terabyte analytic reporting; BigQuery is the correct direction.

Exam Tip: When a scenario says “data scientists run ad-hoc queries over billions of rows,” default to BigQuery. When it says “serve per-user data with predictable primary-key reads under 10 ms,” consider Bigtable (or Spanner/Cloud SQL if relational/transactional requirements are explicit).

Section 4.2: BigQuery table design: partitioning, clustering, and denormalization

BigQuery performance and cost are dominated by bytes scanned. The exam expects you to know how partitioning and clustering reduce scanned data, and when denormalization beats heavy joins.

Partitioning splits a table into segments, typically by ingestion time or a DATE/TIMESTAMP column. Queries that filter on the partition column scan fewer partitions, lowering cost and improving latency. Use partitioning for time-based event data, logs, and facts that are queried by date ranges. A frequent trap is partitioning on a high-cardinality field (e.g., user_id) which can create too many partitions or provide little pruning benefit. Another trap: partitioning exists, but queries don’t filter on the partition column—so costs remain high.

Clustering organizes data within partitions (or within the table) by up to four columns, improving pruning for equality/range filters and joins on those columns (e.g., customer_id, product_id). Clustering helps when queries often filter on non-partition columns. It is not a substitute for partitioning on time when you routinely query date ranges.

Denormalization (including nested/repeated fields) is often preferred in BigQuery to reduce joins and improve query speed. Star schema patterns are common, but BigQuery can handle joins well—so the exam is about tradeoffs: denormalize to reduce repeated joins for high-volume queries, but avoid uncontrolled duplication that inflates storage and complicates updates. For slowly changing dimensions, consider maintaining dimension tables and joining, or using materialized views/derived tables when repeated joins dominate.

Exam Tip: If the question includes “cost spiked” and “queries scan entire table,” the likely fix is partitioning + enforcing partition filters (e.g., require_partition_filter), then clustering on common predicates. If the issue is “too many joins and slow dashboards,” consider denormalized tables, BI Engine (in other domains), or summary tables—without sacrificing governance.

Section 4.3: Data formats and compression: Avro, Parquet, ORC, JSON, CSV

File format choices show up in lake and ingestion scenarios, especially when moving between GCS, Dataflow/Dataproc, and BigQuery. The exam tests whether you understand schema evolution, columnar benefits, and cost/performance impacts.

Avro is row-based with embedded schema and good support for schema evolution. It is a strong choice for streaming/batch pipelines that append records and need robust schema handling (often paired with Pub/Sub or Dataflow sinks to GCS). It’s also commonly used as an interchange format for BigQuery loads.

Parquet and ORC are columnar formats optimized for analytic scans, predicate pushdown, and efficient compression. For data lakes queried by engines like Spark/Trino/BigQuery external tables, Parquet is a frequent best answer. Columnar formats reduce I/O when only a subset of columns is needed—exactly the pattern in analytics.

JSON is flexible but verbose and expensive to store/parse at scale. It can be appropriate for semi-structured ingestion or when producers are heterogeneous, but it is rarely the best long-term storage format for cost and performance. A common trap is choosing JSON “because it’s easy,” ignoring downstream query costs.

CSV is human-readable and broadly compatible, but lacks embedded types/schema, is prone to parsing issues, and compresses less effectively for analytics workloads. On the exam, CSV is often a distractor for large-scale analytic storage.

Exam Tip: If a scenario highlights “optimize scan performance” or “reduce cost of querying files in the lake,” pick Parquet/ORC. If it emphasizes “schema evolution and compatibility across producers,” Avro is frequently safer. Reserve JSON/CSV for interchange, small volumes, or unavoidable upstream constraints.

Section 4.4: Metadata, catalogs, and governance: Dataplex basics and tagging

Governance is not just policy documents; it is operational metadata that makes data discoverable, classifiable, and controllable. The PDE exam increasingly expects you to recognize Dataplex as Google Cloud’s data fabric layer for organizing lakes and warehouses with consistent metadata and governance.

At a high level, Dataplex helps you group data across GCS and BigQuery into logical domains (lakes/zones/assets), apply consistent organization, and surface metadata for discovery. This becomes critical when the scenario includes “multiple teams,” “shared datasets,” “need to find trusted sources,” or “regulatory classification.”

Tagging (business and technical metadata) is a common exam focus: you classify data elements (e.g., PII, PCI, HIPAA-like sensitivity), identify data owners/stewards, and distinguish certified vs. raw datasets. Tag-based organization supports downstream controls and audit readiness. The exam may frame this as “improve discoverability and governance without rewriting pipelines.” Metadata and tagging is often the least disruptive fix compared to building new storage systems.

Common trap: Confusing governance with only IAM. IAM controls access, but governance also includes cataloging, lineage/ownership, and classification. If the scenario mentions “users don’t know which dataset to trust,” IAM changes won’t solve it; a catalog and tags will.

Exam Tip: When you see keywords like “data mesh,” “domains,” “data products,” “classification,” “discoverability,” or “standardize governance across BigQuery and GCS,” Dataplex-oriented solutions are likely being tested.

Section 4.5: Access controls: dataset/table policies, row/column security, views

Security scenarios in Domain 3 often require fine-grained access inside BigQuery, not just project-level IAM. You should be able to choose between dataset/table permissions, authorized views, and row/column-level controls based on the requirement.

Dataset and table IAM is the first layer: grant roles (e.g., BigQuery Data Viewer) at the dataset level for broad access. This is appropriate when all users can see all rows/columns in the dataset. It is a trap when the prompt requires restricting subsets of data to different groups.

Row-level security (row access policies) restricts which rows a user can see based on conditions (often tied to user/group membership). Use this for multi-tenant datasets, region-based restrictions, or “analysts can see only their business unit.”

Column-level security (policy tags) restricts access to sensitive columns such as SSN, email, or payment identifiers. This is frequently paired with governance classification: columns tagged as PII can be visible only to specific roles. If the prompt says “mask or restrict sensitive fields while keeping the rest queryable,” column-level controls are the right tool.

Views and authorized views are a classic exam pattern: expose a curated interface (filtered/aggregated) while keeping base tables locked down. Views are also used to enforce consistent filtering logic, reduce accidental full-table scans (paired with partition filters), and provide stable schemas to BI tools.

Exam Tip: If the scenario needs different users to see different slices of the same table, think row-level security. If it’s “same rows, but hide sensitive fields,” think column-level security via policy tags. If it’s “share a curated dataset without exposing raw tables,” think authorized views.

Section 4.6: Exam-style practice: optimize cost/perf with the right storage layout

Domain 3 scenarios frequently combine cost pressure with performance complaints. The exam expects a layered answer: (1) put data in the correct system, (2) structure it for pruning, (3) control lifecycle and retention, and (4) enforce governance and access patterns.

For analytics tables that grow without bound, the most test-relevant cost controls are: partitioning (so queries scan less), clustering (so predicates prune within partitions), and retention controls (partition expiration for BigQuery; lifecycle rules for GCS). If a scenario says “keep 7 years in archive but only query last 90 days,” a common layout is: recent/hot data in partitioned BigQuery tables; older/cold data in GCS (compressed columnar files) or BigQuery partitions with expiration + separate archival storage, depending on compliance query needs.

When ingesting raw data, a frequent best practice is a multi-zone lake on GCS (raw/bronze) using Parquet/Avro, with curated/silver data in BigQuery for interactive analysis. This separation supports governance (raw is restricted; curated is shared), and it prevents analysts from directly querying messy JSON/CSV at scale.

Common traps: (a) choosing a serving store (Bigtable) to “speed up dashboards” when the real issue is BigQuery table design; (b) storing everything as JSON in GCS and expecting cheap analytics; (c) adding clustering but forgetting partition filters—so full partitions are still scanned; (d) solving cost by moving data out of BigQuery even though the requirement is frequent ad-hoc analytics.

Exam Tip: In optimization questions, look for the “biggest lever” first: reduce bytes scanned (partition + filters), then reduce repeated computation (summary tables/materialized views), then adjust storage format (Parquet/ORC in lake), and finally tighten lifecycle/retention. The correct answer is usually the minimal change that meets both performance and governance constraints.

Chapter milestones
  • Select storage services for analytics, lakes, and serving
  • Model data for BigQuery performance and governance
  • Manage lifecycle, partitioning, and cost controls
  • Domain 3 practice set: storage and modeling scenario questions
Chapter quiz

1. A media company ingests 20 TB/day of clickstream logs. Data scientists run ad-hoc SQL analytics over months of history, while raw files must also be retained for reprocessing. The team wants minimal operations and strong IAM controls. Which storage design best meets these requirements?

Show answer
Correct answer: Land raw data in Cloud Storage (data lake) and load/stream curated data into BigQuery for analytics; control access with IAM and dataset/table policies
Cloud Storage is the standard low-cost data lake on GCP for raw retention and reprocessing, while BigQuery is the managed OLAP warehouse for ad-hoc SQL over large history. Cloud SQL is not designed/costed for petabyte-scale analytics and would be an operational and performance bottleneck. Bigtable is optimized for low-latency key/value access patterns, not exploratory SQL analytics with complex joins/aggregations; using it for ad-hoc analytics typically leads to poor fit and higher engineering burden.

2. A retail company has a 12 TB BigQuery table of orders queried primarily by date range and region. Query costs are increasing because analysts often scan large portions of the table. You need to reduce scan cost and improve performance without changing query semantics. What is the best approach?

Show answer
Correct answer: Partition the table by order_date and cluster by region to reduce scanned data for common filters
BigQuery partitioning by date and clustering by a frequently filtered dimension (region) are canonical cost/performance controls: they enable partition pruning and better data locality to reduce bytes scanned. Bigtable is not a drop-in replacement for OLAP SQL workloads and would require application-side query logic and different access patterns. Disabling partitioning removes pruning benefits and generally increases scanned bytes and cost; denormalization alone does not address scan volume when filters are selective on date/region.

3. A security team requires that analysts can see only rows for their assigned country in a shared BigQuery dataset. Analysts should still be able to query the same tables without maintaining separate copies. What should you implement?

Show answer
Correct answer: BigQuery row-level security policies (row access policies) or authorized views to restrict rows per user/group
Row-level security (row access policies) and authorized views are built-in BigQuery governance features to restrict access to subsets of data while keeping a single source of truth. Duplicating datasets per country increases storage cost and operational overhead and risks inconsistent data. Cloud Storage downloads via signed URLs bypass BigQuery’s managed governance/query controls and do not provide a scalable, auditable, SQL-based access pattern for analysts.

4. A team stores daily exports in a Cloud Storage bucket. Compliance requires keeping objects for 400 days, then deleting them automatically. The team wants the lowest operational burden and to avoid building a scheduler. What should they do?

Show answer
Correct answer: Configure a Cloud Storage lifecycle rule to delete objects older than 400 days
Cloud Storage lifecycle management is the native, low-ops way to enforce time-based retention and deletion policies on objects. A custom Cloud Run job adds operational complexity, scheduling, retries, and permission management for no added benefit. BigQuery table expiration applies to BigQuery tables/partitions, not arbitrary objects in Cloud Storage; moving data just to manage retention increases cost and changes the storage layer without necessity.

5. A fintech application needs single-digit millisecond reads for a user’s recent transactions by user_id, with sustained high write throughput. Queries are simple lookups and range reads per user; cross-user joins and ad-hoc analytics are not required. Which storage service best fits?

Show answer
Correct answer: Cloud Bigtable with a row key design based on user_id (and time component for recency/range reads)
Bigtable is designed for low-latency, high-throughput key/value and wide-column workloads with predictable access patterns (point lookups and range scans by row key), matching user_id-centric reads/writes. BigQuery is optimized for OLAP analytics with higher latency and scan-based pricing; it is not intended for transactional-style millisecond serving. Cloud SQL provides relational features and transactions but is typically harder to scale for very high write throughput/large scale serving compared to Bigtable, and it introduces more operational constraints for this access pattern.

Chapter 5: Prepare & Use Data for Analysis + Maintain & Automate (Domains 4-5)

Domains 4 and 5 of the Google Professional Data Engineer exam focus on whether you can turn “data in storage” into “data that drives decisions,” and then keep that system healthy over time. Expect scenario questions that combine BigQuery SQL, governance, performance, ML enablement, and operational rigor (monitoring, orchestration, CI/CD). The exam is not testing obscure syntax; it’s testing whether you can pick the right tool and pattern under constraints like cost, latency, reliability, and security.

This chapter connects two mindsets you’ll need on exam day: (1) analyst-facing outcomes (fast, consistent, governed metrics; BI integration; repeatable transformations), and (2) operator-facing outcomes (detect issues quickly, recover safely, automate changes). Many wrong answers sound plausible because they optimize one axis (speed) while violating another (cost, correctness, or maintainability). Use the “business requirement → data shape → engine choice → operations” chain to eliminate traps.

You’ll also see the exam blend “analyze and optimize with BigQuery SQL and BI patterns,” “operationalize ML pipelines with BigQuery ML and Vertex AI integration,” and “automate workflows with orchestration and CI/CD.” Your goal is to recognize where each belongs—and what breaks first if you choose incorrectly.

Practice note for Analyze and optimize with BigQuery SQL and BI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI integration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Automate workflows with orchestration and CI/CD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domains 4-5 practice set: troubleshooting, monitoring, and automation questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Analyze and optimize with BigQuery SQL and BI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI integration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Automate workflows with orchestration and CI/CD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domains 4-5 practice set: troubleshooting, monitoring, and automation questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Analyze and optimize with BigQuery SQL and BI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI integration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: BigQuery analysis essentials: joins, UDFs, materialized views, BI Engine fit

Section 5.1: BigQuery analysis essentials: joins, UDFs, materialized views, BI Engine fit

In Domain 4, BigQuery is the default analytical engine, and the exam expects you to design queries and semantic layers that scale. Start with join strategy: BigQuery performs best when joins are selective and keys are well-distributed. A common design is a star schema (fact table with dimension tables), but the exam often hides the real issue: overly wide dimension tables or unfiltered joins that explode intermediate results.

User-defined functions (UDFs) appear as a maintainability tool. SQL UDFs are great for reusable transformations (standardizing strings, parsing IDs) and keep logic centralized; JavaScript UDFs are more flexible but can be slower and harder to govern. Prefer SQL UDFs when possible and reserve JS UDFs for logic not expressible in SQL.

Materialized views are frequently the correct answer when the prompt says “speed up repeated aggregates while staying in BigQuery.” Materialized views can auto-refresh and reduce compute for consistent aggregation patterns. They are not a universal cache: they have definition constraints and won’t help when query predicates vary widely or when you need complex non-deterministic logic.

BI Engine questions test whether you can identify “interactive dashboard latency” needs. BI Engine accelerates Looker/Looker Studio-style queries by caching data in memory, but it shines when dashboards query relatively stable datasets and repeated dimensions/metrics. If the dataset changes constantly or the query pattern is ad hoc, BI Engine may not be the best lever.

  • Exam Tip: If the scenario emphasizes “trusted metrics,” “consistent KPIs,” and “shared definitions,” think in terms of views/materialized views and a curated dataset rather than letting every dashboard hit raw tables.
  • Common trap: Choosing materialized views to “speed up everything.” If the workload is highly exploratory with many unique predicates, partitioning/clustering or pre-aggregated tables may be more appropriate.

On the exam, correct answers usually align with: minimize scanned bytes, reuse logic safely (UDFs/views), and keep BI latency low without duplicating too much data.

Section 5.2: Performance tuning: query plans, slot reservations, caching, and limits

Section 5.2: Performance tuning: query plans, slot reservations, caching, and limits

Performance tuning questions typically blend cost and speed. BigQuery pricing is primarily about bytes processed (on-demand) or slots (reservations). The exam expects you to interpret a slow/expensive workload and choose the least disruptive fix. Start with query plan thinking: filter early, select only needed columns, and avoid cross joins or unbounded joins. Partitioning and clustering are core “reduce bytes scanned” levers—partition by time for time-series facts, cluster on high-cardinality filter/join keys used frequently.

BigQuery caching is a frequent distractor. The query results cache can make repeat queries appear fast “for free,” but it’s not a reliability strategy. If the data changes or the query text changes, cache benefits disappear. Use caching as an incidental benefit, not as your primary performance plan.

Slot reservations show up when the prompt mentions “multiple teams,” “predictable performance,” or “avoid noisy neighbors.” Reservations can provide stable throughput and cost predictability (flat-rate), and you can use assignments to route workloads (prod vs dev). However, reservations don’t fix a fundamentally inefficient query; they just throw compute at it.

Limits appear in scenario questions: too many concurrent queries, memory errors from large shuffles, or timeouts. When you see these, look for hints about reducing intermediate data (pre-aggregate, filter sooner, use approximate functions like APPROX_COUNT_DISTINCT where acceptable) or breaking work into staged tables.

  • Exam Tip: If the requirement is “reduce cost,” the best answer usually reduces bytes processed (partitioning/clustering, pruning columns, staged aggregates) rather than adding slots.
  • Common trap: Picking “increase slot capacity” when the root cause is scanning unpartitioned historical data for every daily report.

To identify the correct choice, match the bottleneck to the lever: bytes scanned → partition/cluster/projection; concurrency/SLA → reservations; repeated dashboard aggregates → materialized views/BI Engine; sporadic speedups → cache (but not guaranteed).

Section 5.3: Feature engineering and analytics readiness: sampling, labeling, leakage

Section 5.3: Feature engineering and analytics readiness: sampling, labeling, leakage

The PDE exam increasingly expects ML-aware data preparation, even if you aren’t building models daily. “Prepare and use data for analysis” includes making datasets analysis-ready: stable definitions, correct labels, and sampling that doesn’t bias results. Sampling questions often revolve around representativeness: random sampling can help exploration and reduce cost, but stratified sampling may be required when classes are imbalanced (fraud, churn).

Labeling is a common weak point in pipelines. The exam may describe a dataset where labels are derived from future events (e.g., “churned within 30 days”) and then accidentally joined back using data not available at prediction time. That’s leakage. Leakage leads to unrealistically high training performance and poor real-world results, and the exam wants you to prevent it by enforcing time-aware joins and point-in-time feature generation.

Analytics readiness also includes handling late-arriving data and defining freshness expectations. A “gold” table used for ML features or BI should be built from “silver” cleansed data with clear event-time logic. If the prompt mentions “backfills” or “late events,” the right pattern usually includes windowing logic (for streaming) or scheduled backfills (for batch) with idempotent transformations.

  • Exam Tip: When you see “predict” or “serve,” assume you must restrict features to those available at inference time; anything derived from the future is suspect.
  • Common trap: Treating a convenience join (joining outcomes back to events without a cutoff timestamp) as harmless. On the exam, this is often the hidden reason the model fails in production.

Practical exam framing: if the scenario is BI-focused, sampling is about cost and speed; if it’s ML-focused, sampling is about bias/imbalance and evaluation validity. If you can explain “why this dataset would mislead decision-makers,” you can usually eliminate half the answer choices.

Section 5.4: ML workflows: BigQuery ML vs Vertex AI pipelines and when to use each

Section 5.4: ML workflows: BigQuery ML vs Vertex AI pipelines and when to use each

Domain 4 also tests your ability to operationalize ML with the right boundary between SQL-native modeling and full ML platforms. BigQuery ML (BQML) is the best fit when the data is already in BigQuery, the model types supported by BQML meet requirements (e.g., linear/logistic regression, boosted trees, matrix factorization, time-series, some deep learning integrations), and the team wants fast iteration using SQL with minimal infrastructure.

Vertex AI is typically the correct choice when you need custom training code, complex feature pipelines, managed online prediction endpoints, model monitoring, hyperparameter tuning at scale, or MLOps workflows spanning multiple steps. The exam often frames this as “data scientists require custom TensorFlow/PyTorch” or “need continuous training with approvals and reproducibility.” That points to Vertex AI Pipelines and managed training jobs.

Integration patterns matter. A common, exam-friendly pattern is: curate and validate features in BigQuery, export training data to Vertex AI (or read directly), train in Vertex, and write predictions back to BigQuery for downstream analytics. Alternatively, keep simpler models in BQML and schedule retraining using orchestration tools, storing evaluation metrics and model versions in a governed dataset.

  • Exam Tip: If the prompt highlights “SQL analysts,” “keep everything in BigQuery,” or “minimal ops,” BQML is often correct. If it highlights “custom model,” “online serving,” or “full MLOps,” Vertex AI is usually correct.
  • Common trap: Choosing Vertex AI for a straightforward regression where the real requirement is quick, governed analytics and retraining—BQML would be cheaper and simpler.

The exam is not asking which tool is “better,” but which tool minimizes operational burden while meeting constraints (latency, governance, reproducibility, and team skill set).

Section 5.5: Operations: monitoring, logging, data freshness, and incident response

Section 5.5: Operations: monitoring, logging, data freshness, and incident response

Domain 5 evaluates whether you can keep data products reliable. Monitoring in GCP typically involves Cloud Monitoring metrics (latency, error rates, backlog), Cloud Logging (structured logs), and alerting tied to SLOs. The exam frequently describes “dashboards show stale data” or “pipeline succeeded but data is wrong.” That’s your cue to think beyond job success: you need data quality and freshness checks.

Data freshness is an operational contract: define acceptable lag per table or dashboard, then alert when breached. For streaming, monitor Pub/Sub subscription backlog and Dataflow watermark/processing-time delay. For batch, monitor scheduled job duration, downstream table partition completeness, and row-count/aggregate sanity checks.

Incident response questions focus on safe recovery: roll back to last known good dataset, re-run idempotent jobs, and communicate impact. The best answers usually include: clear ownership, runbooks, and postmortems. If the incident is caused by schema changes or upstream contract breaks, the long-term fix is often schema enforcement, versioning, and automated validation at ingestion.

  • Exam Tip: “Job succeeded but data is missing/wrong” implies you must monitor outcomes (freshness, completeness, distribution drift), not just infrastructure metrics.
  • Common trap: Only adding more retries. Retries help transient failures, but they do not fix logical data issues, duplicates, or broken joins.

Look for answers that combine detection (alerts), diagnosis (logs/lineage), and remediation (backfill/replay), and that respect cost (avoid reprocessing petabytes when only a partition is affected).

Section 5.6: Automation: Composer/Workflows/Scheduler, IaC, and deployment patterns

Section 5.6: Automation: Composer/Workflows/Scheduler, IaC, and deployment patterns

Automation is where many exam candidates overcomplicate. The exam tests whether you can pick the simplest orchestrator that meets dependency, retry, and auditing needs. Cloud Scheduler is best for time-based triggers (kick off a daily query, call an HTTP endpoint) with minimal dependencies. Workflows is best for lightweight service orchestration (calling APIs, conditional branching, retries) without managing a full Airflow environment. Cloud Composer (managed Airflow) is best when you need complex DAGs, rich operators, cross-system dependencies, and a mature operations model for pipelines.

CI/CD and Infrastructure as Code (IaC) appear in Domain 5 as maintainability requirements. Use Terraform or similar IaC to define datasets, tables, IAM bindings, Pub/Sub topics, Dataflow jobs, Composer environments, and alerting policies consistently across environments. For SQL transformations, treat queries as versioned artifacts (code review, automated tests, promotion from dev → prod). Deployment patterns often include: separate projects for dev/test/prod, least-privilege service accounts, and artifact-based releases (container images, templates).

Idempotency is a key exam theme: scheduled backfills and retries must not duplicate data. The “correct” automation answer usually includes write patterns like partition overwrite, MERGE upserts, or exactly-once semantics where available, plus checkpointing for streaming.

  • Exam Tip: If the scenario says “simple schedule, few steps,” don’t pick Composer. If it says “many dependencies, branching, backfills, SLAs,” Composer is often justified.
  • Common trap: Automating everything with ad-hoc scripts on a VM. The exam prefers managed services with auditability, IAM integration, and repeatable deployments.

To choose correctly, map requirements to capabilities: dependency graph complexity → Composer; API choreography → Workflows; cron-like triggers → Scheduler; environment reproducibility → IaC; safe releases → CI/CD with staged promotion and rollback.

Chapter milestones
  • Analyze and optimize with BigQuery SQL and BI patterns
  • Operationalize ML pipelines with BigQuery ML and Vertex AI integration
  • Automate workflows with orchestration and CI/CD
  • Domains 4-5 practice set: troubleshooting, monitoring, and automation questions
Chapter quiz

1. A retail company serves executive dashboards from BigQuery. Multiple Looker/BI users repeatedly query the same 30-day sales metrics throughout the day. Data is updated once nightly. Costs are rising due to repeated scan of a large fact table, but the business requires consistent KPI definitions and sub-second dashboard load times. What should you do?

Show answer
Correct answer: Create authorized views for the KPI logic and build a scheduled query that materializes the KPI results into a denormalized table (or materialized view where applicable) used by BI; partition/cluster the table for common filters.
Domain 4 emphasizes BI patterns, governed metrics, and performance/cost optimization in BigQuery. Materializing commonly used, stable aggregates (via scheduled queries or materialized views when supported) reduces repeated full-table scans and improves latency, while authorized views enforce consistent KPI logic and controlled access. B is wrong because snapshots preserve a point-in-time table but do not reduce per-query scan cost if BI continues scanning a large snapshot table for each dashboard interaction; it also doesn’t create pre-aggregated KPI tables. C is wrong because BI querying Parquet in GCS is not a standard pattern for Looker/BI with BigQuery-managed governance and performance; it shifts complexity, typically increases operational burden, and does not address centralized KPI definitions in BigQuery.

2. A media company trains a churn prediction model. Feature engineering is performed in BigQuery, and analysts want to iterate quickly using SQL. The ML team wants to deploy the trained model to Vertex AI for online predictions and monitoring, while minimizing custom code. What is the best approach?

Show answer
Correct answer: Use BigQuery ML to train the model on BigQuery-managed features, then register/export the model to Vertex AI for deployment and online serving, keeping feature computation in BigQuery.
Domain 4 (use data for analysis/ML enablement) and Domain 5 (operationalization) expect you to choose managed patterns that reduce glue code. BigQuery ML supports SQL-based training where features live, and integration paths exist to deploy models into Vertex AI for online predictions and operational monitoring without rewriting the whole workflow. B is wrong because it introduces significant custom work and data movement, slowing analyst iteration; it may be valid for complex deep learning but is not the minimal-code, SQL-first pattern asked for. C is wrong because Dataflow is not a typical managed online prediction endpoint; it can run batch/streaming inference, but using it as an online serving layer is operationally complex and not aligned with Vertex AI’s purpose-built serving/monitoring.

3. A company runs a nightly data pipeline: ingest files, transform in BigQuery, then publish curated tables for BI. Failures sometimes occur mid-pipeline and engineers need retries, dependency management, and clear observability of each step. The company also wants to automate deployment of pipeline changes through CI/CD. Which solution best meets these requirements?

Show answer
Correct answer: Orchestrate the steps with Cloud Composer (managed Airflow) using separate tasks for ingest, BigQuery jobs, and publication; integrate DAG deployment with Cloud Build (or similar) for CI/CD and use Airflow/Cloud Logging for task-level visibility.
Domain 5 focuses on reliability, automation, and operational rigor: orchestration with retries, dependencies, and monitoring plus CI/CD for changes. Cloud Composer provides task-level state, retries, dependency management, and integrates cleanly with Cloud Logging/Monitoring. CI/CD via Cloud Build (or other pipelines) supports controlled promotion of DAG/code changes. B is wrong because it weakens orchestration semantics (limited step-level retries/branching/visibility) and fails the CI/CD requirement (manual promotion is error-prone). C is wrong because it increases operational risk (key handling, single-host reliability), provides poor observability, and does not align with managed GCP patterns expected on the exam.

4. A data engineer is troubleshooting a BigQuery workload. A dashboard query suddenly became slower and more expensive after a schema change added new columns to a large partitioned fact table. Most dashboard filters are on customer_id and event_date. What should the engineer do FIRST to improve performance and cost while preserving correctness?

Show answer
Correct answer: Review the query plan and table design; ensure partition pruning on event_date and add/adjust clustering on customer_id (and other common filters), then test with representative queries.
Domain 4 expects optimization via query patterns and physical design: partitioning/clustering, pruning, and reading fewer bytes are primary levers to reduce cost and latency. Reviewing the execution plan and ensuring partition pruning and appropriate clustering addresses the likely regression caused by schema/query changes. B is wrong because buying capacity can reduce runtime but does not directly address bytes scanned/cost per query and can mask inefficient design; it’s usually not the first corrective action. C is wrong because disabling cache typically increases cost and latency and does not fix the underlying scan/plan issue; it also conflicts with common BI performance patterns.

5. A fintech company must enforce least-privilege access to curated BigQuery datasets used by analysts. Analysts should see only approved columns (mask PII) and only rows for their business unit. The solution must be maintainable and compatible with BI tools. What should you implement?

Show answer
Correct answer: Use authorized views for curated access, combined with column-level security (policy tags) and row-level security policies; grant analysts access only to the views/datasets they are allowed to query.
Domains 4-5 include governance and secure, consistent consumption patterns. Authorized views provide a governed interface for BI, while BigQuery column-level security (policy tags) and row-level security enforce access controls at the data platform layer, not in the client. This is maintainable and auditable. B is wrong because BI-layer hiding/filtering is not an enforcement boundary; users could still query underlying tables directly and access PII or other units’ data. C is wrong because file distribution is brittle, increases operational overhead, complicates lineage/auditing, and diverges from centralized BigQuery governance expected for certification-style solutions.

Chapter 6: Full Mock Exam and Final Review

This chapter is your capstone: you will run a full-length mock exam in two parts, diagnose weak areas with a repeatable rubric, and finish with a final review that aligns directly to the Google Professional Data Engineer objectives. The goal is not to “learn more,” but to convert what you already know into consistent exam performance: correctly interpreting requirements, eliminating distractors, and choosing designs that balance reliability, scalability, security, and cost.

You will complete a domain-mixed set (to simulate real randomness), then a set of case-study style scenarios (to simulate the long-form thinking the PDE exam expects). After that, you will map misses to the objective areas and build a last-week plan that targets high-yield gaps: Dataflow streaming semantics, BigQuery performance and governance, IAM and encryption defaults, and operations (monitoring, orchestration, CI/CD, and incident response).

Exam Tip: Your score is less important than your “error signature.” Track why you missed: misunderstanding requirements, not knowing a service feature, or getting trapped by a plausible-but-wrong option. Your remediation depends on the miss type.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Exam Day Checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Exam Day Checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Mock exam instructions, timing, and scoring rubric

Section 6.1: Mock exam instructions, timing, and scoring rubric

Run the mock exam like production: quiet environment, single sitting, no notes, and strict timing. The PDE exam rewards sustained attention and requirement parsing under time pressure. Split your mock into two sessions if needed, but keep each session realistic: uninterrupted, timed blocks that force decisions. Aim to practice the core skill the exam tests—choosing the best design under constraints—not perfect recall of every product detail.

Timing plan: allocate a fixed average time per question and use “mark and move” aggressively. If you cannot eliminate to two options quickly, you’re at risk of burning time that you won’t get back on later multi-step scenarios. Use a three-pass approach: Pass 1 answer confidently; Pass 2 revisit marked questions; Pass 3 final checks for requirement mismatches.

  • Scoring rubric (self-grade): 1 point for correct selection. Add a second label for each miss: (A) requirement misread, (B) service capability gap, (C) architecture tradeoff gap, (D) operations/governance oversight.
  • Confidence tracking: for each answer, record High/Medium/Low. Your best improvement comes from “High confidence but wrong”—these reveal persistent misconceptions.
  • Domain tagging: tag each question to an objective area: ingestion (Pub/Sub, Dataflow, Dataproc), storage/modeling (BigQuery, GCS, operational DB), analysis/optimization/governance, or operations (monitoring, CI/CD, orchestration).

Exam Tip: Always restate the constraints before you answer (e.g., “must be real-time,” “must be exactly-once,” “PII,” “lowest ops overhead,” “multi-region DR”). Most wrong answers violate one explicit constraint while sounding generally “cloud best practice.”

Section 6.2: Mock Exam Part 1 (Domain-mixed set)

Section 6.2: Mock Exam Part 1 (Domain-mixed set)

Part 1 is designed to feel like the real exam’s domain-mixed distribution: a question on BigQuery partitioning may be followed by a streaming pipeline reliability question, then an IAM/governance decision. Your job is to quickly identify which “lens” the question is testing: architecture selection, correctness/semantics, security, cost control, or operational excellence.

When you review your performance in this part, pay attention to recurring pattern errors:

  • Streaming semantics: confusion between at-least-once vs exactly-once, event time vs processing time, and how windowing/triggers affect results. In Dataflow, correctness depends on watermarking and allowed lateness; the exam frequently checks whether you understand late data handling and deduplication.
  • Batch vs interactive tradeoffs: choosing Dataproc (Spark/Hadoop) when Dataflow (managed Beam) or BigQuery is simpler and lower-ops, or choosing BigQuery for heavy transactional workloads where an operational store is expected.
  • BigQuery design: partitions/clustering, materialized views, and slot/cost controls. Watch for traps where a proposed solution “works” but explodes cost due to full-table scans or repeated transformations.
  • Security defaults: many services encrypt at rest by default; the exam often wants IAM, CMEK/KMS, VPC Service Controls, and data access controls (row-level / column-level security) tied to governance needs, not “turn on encryption” as a generic answer.

Exam Tip: In domain-mixed questions, the correct answer is usually the one that meets the requirement with the fewest moving parts (managed services, minimal custom code) while still respecting constraints like private connectivity, compliance, and SLOs.

As you complete Part 1, write a one-line justification per answer. If you cannot justify it in one line using the question’s stated constraints, you likely chose based on familiarity rather than fit.

Section 6.3: Mock Exam Part 2 (Case-study style scenarios)

Section 6.3: Mock Exam Part 2 (Case-study style scenarios)

Part 2 simulates the longer, case-study style thinking: multi-step pipelines, multiple stakeholders (security, analytics, operations), and evolving requirements. The PDE exam often embeds the “real” constraint in a single clause: “no data may traverse the public internet,” “must support backfills,” “must isolate PII,” or “must meet a 15-minute SLA with unpredictable spikes.” Your technique is to extract constraints first, then map to patterns.

Use a consistent scenario framework:

  • Ingest: identify sources, frequency, ordering, and failure modes. For streaming, Pub/Sub is commonly the entry point; decide on attributes for routing and keys for ordering/dedup.
  • Process: choose Dataflow for managed streaming/batch Beam pipelines; Dataproc when you truly need Spark/Hadoop ecosystem control; BigQuery when SQL-based ELT is enough.
  • Store/model: choose BigQuery for analytics warehouse; Cloud Storage for data lake landing and archival; operational stores (e.g., Cloud Bigtable/Spanner/Firestore) for low-latency serving patterns. The exam expects you to separate serving needs from analytical needs.
  • Govern/secure: IAM least privilege, service accounts, DLP/PII handling, and boundary controls (VPC-SC) for exfiltration risk.
  • Operate: monitoring (Cloud Monitoring/Logging), data quality checks, orchestration (Cloud Composer/Workflows), CI/CD for pipelines, and incident response runbooks.

Exam Tip: In case-study scenarios, avoid “tool switching” midstream. A common trap is proposing a second processing system (e.g., Dataflow + Dataproc + custom GKE jobs) when the scenario only needs one. Extra components increase operational burden, which the exam treats as a negative unless justified.

After Part 2, re-evaluate whether you consistently choose designs that can be automated and monitored. The PDE role is accountable for reliable data products, not just one-time data movement.

Section 6.4: Answer rationales and domain mapping to objectives

Section 6.4: Answer rationales and domain mapping to objectives

This section is where improvement happens: you will convert wrong answers into objective-aligned action items. For every miss, write (1) the violated constraint, (2) the overlooked service feature, and (3) the “tell” in the distractor option. Then map each miss to the course outcomes: system design under reliability/scalability/security/cost; ingestion/processing (Pub/Sub, Dataflow, Dataproc); storage/modeling (BigQuery, GCS, operational stores); analytics optimization/governance/ML/BI; and operations (monitoring, CI/CD, orchestration, incident response).

Common rationale patterns the exam expects:

  • Reliability: regional vs multi-regional choices, replay/backfill strategy, idempotency, dead-letter handling, and SLO-aware alerting. A “reliable” answer explains how failures are detected and recovered from, not just that a service is managed.
  • Cost: BigQuery cost control via partitioning/clustering and query patterns; avoiding unnecessary egress; choosing serverless when utilization is spiky; minimizing persistent clusters unless required.
  • Security/governance: least privilege IAM, separation of duties, auditability, and fine-grained access controls. The exam often prefers native governance controls over homegrown masking logic.
  • Performance: avoiding full scans, choosing appropriate storage formats in GCS (Parquet/Avro), and leveraging BigQuery features (materialized views, BI Engine where appropriate) rather than over-engineering.

Exam Tip: When two answers both “work,” pick the one that better matches the objective domain emphasized by the question. If the stem is about governance, an answer that adds IAM controls and auditing will beat an answer that only improves throughput.

Finish this section by producing a ranked “fix list” of your top three weak domains, each with one hands-on review task (e.g., re-derive Dataflow windowing semantics; practice BigQuery partition/clustering decisions; rewrite an IAM policy with least privilege and service accounts).

Section 6.5: Final review: top services, patterns, and common traps

Section 6.5: Final review: top services, patterns, and common traps

Your final review should be pattern-based, not product-brochure-based. The PDE exam tests whether you can assemble proven GCP reference patterns under constraints. Rehearse these “top of mind” mappings:

  • Pub/Sub → Dataflow → BigQuery: canonical streaming analytics pattern; requires thinking about schemas, deduplication, late data, and BigQuery write patterns.
  • GCS landing zone → ELT in BigQuery: common batch pattern; emphasize data formats, partitioning, and incremental loads.
  • Dataproc: best when you need Spark/Hive compatibility, custom libraries, or tight control; trap: choosing it when serverless alternatives reduce ops.
  • BigQuery optimization: partitions/clustering, pruning, avoiding SELECT *, using approximate aggregates appropriately, materialized views, and slot/reservation awareness when relevant.
  • Governance/security: IAM least privilege, service accounts per workload, CMEK/KMS when required, VPC Service Controls for boundaries, policy tags/row-level security, and audit logs for traceability.
  • Operations: monitoring metrics that align to SLAs (latency, backlog, error rate), orchestration with Composer/Workflows, CI/CD with testing of SQL/pipelines, and documented incident response.

Common traps to actively avoid:

  • Answering with “more services” instead of the simplest managed architecture that meets constraints.
  • Ignoring data freshness/latency requirements (batch proposed when streaming is required, or vice versa).
  • Proposing a warehouse as an operational store, or forcing an operational DB to serve analytical scans.
  • Forgetting governance requirements: PII isolation, access boundaries, and auditability.

Exam Tip: If an option sounds “enterprisey” but adds manual processes (hand-run jobs, unmanaged clusters, bespoke security layers), it is often a distractor. The exam values automation and managed controls.

Section 6.6: Exam day readiness: strategy, pacing, and last-week plan

Section 6.6: Exam day readiness: strategy, pacing, and last-week plan

Exam day success is execution: pacing, attention control, and consistent requirement parsing. Use a deliberate strategy: read the last line first (what is being asked), then scan for constraints (latency, compliance, cost, operational overhead), then evaluate options against constraints. Do not “solutioneering” beyond what is asked—the PDE exam rewards best-fit decisions, not maximal designs.

Pacing strategy: commit to a mark-and-return workflow. If you are stuck after eliminating to two options, mark it and move on. Many candidates lose 10–15 minutes early and never recover, which increases error rate later. Keep a mental checkpoint schedule and ensure you leave time for a full review pass of marked items.

  • Last-week plan: focus on your weak-spot list from Section 6.4. Do short, targeted refreshers: one topic per day with concrete artifacts (a sample BigQuery table design with partitioning; a Dataflow pipeline correctness checklist; an IAM policy review).
  • Day-before plan: light review only—patterns and traps. Rehearse “constraint to service” mappings and governance defaults.
  • Operations mindset: be ready to answer how you monitor, alert, backfill, and recover. The exam frequently hides operational requirements inside performance or reliability statements.

Exam Tip: When you feel rushed, slow down for 10 seconds and restate the constraints. Most late-stage mistakes come from answering the “general topic” rather than the specific requirement (e.g., choosing a fast system that violates data residency or choosing a secure system that misses the SLA).

When you finish, do a final pass that checks for “single-constraint violations”: public egress, missing least privilege, lack of replay/backfill, and BigQuery cost pitfalls. These are the easiest points to recover through disciplined review.

Chapter milestones
  • Mock Exam Part 1
  • Mock Exam Part 2
  • Weak Spot Analysis
  • Exam Day Checklist
Chapter quiz

1. You are running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline uses event-time windowing, but your results occasionally contain duplicates after worker restarts. The business requirement is: "No duplicates in BigQuery for each logical event, even during retries," while keeping end-to-end latency low. What should you do?

Show answer
Correct answer: Write to BigQuery using insertId-based streaming inserts derived from a stable event UUID, and ensure the pipeline uses exactly-once source semantics with checkpointing
BigQuery streaming inserts support best-effort deduplication when insertId is stable per logical row; combining a stable event UUID with Dataflow’s checkpointing/replay behavior addresses duplicates caused by retries. Processing-time windowing (B) changes semantics and does not prevent duplicates on retries. Staging to Cloud Storage then batch loading (C) can reduce streaming insert duplication but increases latency and does not inherently guarantee deduplication unless you add additional merge/upsert logic.

2. A retail company stores 5 years of transaction data in BigQuery. Analysts frequently query the last 30 days by date and store_id. Queries are getting expensive and slow. You need to optimize performance and cost with minimal operational overhead. What should you do?

Show answer
Correct answer: Partition the table by transaction_date and cluster by store_id
Partitioning by date and clustering by store_id aligns with common filtering patterns and reduces bytes scanned, improving performance and cost with low overhead. Splitting into one table per store (B) increases management complexity and often worsens performance due to many table scans/unions. External tables on Cloud Storage (C) can be useful for certain use cases but typically have worse query performance and still scan large amounts of data if not carefully partitioned; it also doesn’t address the core optimization as directly as native partitioning/clustering.

3. Your organization requires that all datasets containing PII are encrypted with customer-managed encryption keys (CMEK) and that access is restricted to a small security group. You need a solution that is enforceable and auditable across multiple projects. What should you do?

Show answer
Correct answer: Use BigQuery dataset-level CMEK with Cloud KMS keys, grant dataset access only to the security group, and enforce policy via Organization Policy/controls and centralized logging
BigQuery supports CMEK at the dataset level using Cloud KMS; combining this with least-privilege IAM on the dataset and org-level guardrails (plus audit logs) best meets enforceable, auditable requirements. Default encryption (B) does not satisfy the explicit CMEK requirement; authorized views help with column-level exposure but don’t change encryption controls. External tables over CMEK-encrypted GCS objects (C) can still be queried by principals with BigQuery permissions and appropriate GCS access; it adds complexity and does not replace BigQuery CMEK controls for datasets containing PII.

4. You completed a full mock exam and noticed a pattern: many missed questions involve choosing an overly complex architecture when a simpler managed option would meet requirements. You want a repeatable process to improve exam performance in the final week. What is the best next step?

Show answer
Correct answer: Classify each miss by error type (requirements misread, service feature gap, or distractor trap), map it to the PDE objective domain, and create targeted drills for the highest-frequency categories
The PDE exam rewards interpreting requirements and selecting balanced designs; a rubric that tags miss types and maps to objective areas builds a targeted remediation plan and reduces repeated distractor mistakes. Memorization via repeated retakes (B) can inflate scores without improving reasoning on new questions. Unstructured documentation review (C) may help knowledge gaps but does not address the recurring decision-pattern issue (over-architecting) or the requirement interpretation errors.

5. On exam day, you want to minimize risk of losing time due to long case-study questions and reduce the chance of making avoidable mistakes under pressure. Which strategy best aligns with certification exam best practices?

Show answer
Correct answer: Do a fast pass to answer high-confidence questions first, mark time-consuming case-study items for review, and use remaining time to validate requirements and eliminate distractors
A two-pass approach is effective for timed, mixed-format certification exams: it protects time, builds momentum, and reserves cognitive bandwidth for complex scenarios and requirement re-checks. Leading with the longest items (B) increases time-pressure and can force rushed guesses later. Not reviewing at all (C) increases the chance of unforced errors, especially misread requirements and overlooked constraints.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.