HELP

+40 722 606 166

messenger@eduailast.com

Google Professional Data Engineer Prep (GCP-PDE): BigQuery

AI Certification Exam Prep — Beginner

Google Professional Data Engineer Prep (GCP-PDE): BigQuery

Google Professional Data Engineer Prep (GCP-PDE): BigQuery

A focused, domain-mapped plan to pass GCP-PDE on your first try.

Beginner gcp-pde · google · professional-data-engineer · bigquery

Prepare with a domain-mapped blueprint for the Google GCP-PDE exam

This course is a structured exam-prep blueprint for the Google Cloud Professional Data Engineer certification (exam code GCP-PDE). It is designed for beginners who are new to certification exams but have basic IT literacy and want a clear, goal-driven path to exam readiness. You’ll learn how Google expects a Professional Data Engineer to think: designing end-to-end data platforms that are secure, reliable, cost-aware, and aligned to real business requirements.

The official exam domains are covered explicitly throughout the curriculum:

  • Design data processing systems
  • Ingest and process data
  • Store the data
  • Prepare and use data for analysis
  • Maintain and automate data workloads

What makes this prep different

GCP-PDE questions are scenario-heavy and rarely ask for trivia. This blueprint emphasizes decision-making: selecting the right GCP services (BigQuery, Dataflow, Pub/Sub, Dataproc, Cloud Storage, and more), justifying trade-offs, and avoiding common architecture and operations anti-patterns. Each content chapter includes exam-style practice milestones so you constantly connect concepts to how they are tested.

6 chapters that mirror how you’ll be tested

Chapter 1 gets you exam-ready before you even study: registration and scheduling steps, exam timing strategy, question types (including multiple-select), and a realistic study plan you can follow week by week.

Chapters 2–5 map directly to the official domains. You’ll work from architecture design principles into ingestion and processing (batch and streaming), then into storage and governance (with BigQuery performance design), and finally into analytics and ML pipeline usage plus the operational skills needed to keep workloads healthy in production.

Chapter 6 is your capstone: a full mock exam split into two parts, followed by weak-spot analysis and an exam-day checklist. You’ll leave with a final review routine and a plan to convert practice results into points on the real exam.

How to use this course on Edu AI

Follow the chapters in order for a guided path, or jump to a domain you want to strengthen. Keep notes on service trade-offs and failure modes, and revisit the practice milestones after each chapter to reinforce exam-style reasoning.

Outcome: exam-ready thinking, not memorization

By the end, you’ll be able to interpret a scenario, choose an architecture that satisfies SLAs and governance requirements, implement ingestion and transformations using the right tools, and operationalize analytics and ML workloads with monitoring and automation—exactly the capabilities assessed across the Google GCP-PDE exam domains.

What You Will Learn

  • Design data processing systems: choose GCP services and architectures that meet SLA, security, and cost goals
  • Ingest and process data: build batch and streaming pipelines with Dataflow, Pub/Sub, and Dataproc patterns
  • Store the data: model and manage data in BigQuery, Cloud Storage, and operational stores for reliability and governance
  • Prepare and use data for analysis: enable BI/SQL, feature engineering, and ML workflows with BigQuery and Vertex AI patterns
  • Maintain and automate data workloads: monitor, troubleshoot, test, and automate pipelines with CI/CD and orchestration

Requirements

  • Basic IT literacy (files, networking basics, command line comfort helpful)
  • No prior Google Cloud certification experience required
  • Willingness to learn core GCP concepts (projects, IAM, billing) as part of the course
  • Access to a computer with a modern browser; optional: a GCP free trial account for hands-on practice

Chapter 1: GCP-PDE Exam Orientation and Study Strategy

  • Understand the GCP-PDE exam format, domains, and question styles
  • Registration, scheduling, and test-center/online proctoring checklist
  • How scoring works, time management, and pass-focused strategy
  • Set up your study plan: labs, docs, and revision cadence

Chapter 2: Design Data Processing Systems (Domain: Design data processing systems)

  • Translate business requirements into GCP reference architectures
  • Select services for batch, streaming, and hybrid designs
  • Design for security, privacy, and governance from day one
  • Domain practice set: architecture scenarios and trade-offs

Chapter 3: Ingest and Process Data (Domain: Ingest and process data)

  • Build ingestion patterns for batch and streaming sources
  • Implement transformations with Dataflow and SQL-first approaches
  • Handle data quality, schema drift, and late-arriving events
  • Domain practice set: pipeline debugging and performance scenarios

Chapter 4: Store the Data (Domain: Store the data)

  • Design storage layers for analytics and operational needs
  • Master BigQuery table design, partitioning, and clustering
  • Plan governance: metadata, lineage, retention, and sharing
  • Domain practice set: storage decisions and BigQuery performance

Chapter 5: Prepare and Use Data for Analysis + Maintain and Automate Workloads (Domains: Prepare and use data for analysis; Maintain and automate data workloads)

  • Enable analytics: semantic layers, BI patterns, and SQL optimization
  • Operationalize ML pipelines with BigQuery ML and Vertex AI patterns
  • Automate and orchestrate workloads with CI/CD and scheduling
  • Domain practice set: monitoring, incident response, and ML/analytics scenarios

Chapter 6: Full Mock Exam and Final Review

  • Mock Exam Part 1
  • Mock Exam Part 2
  • Weak Spot Analysis
  • Exam Day Checklist

Maya Rios

Google Cloud Certified Professional Data Engineer Instructor

Maya Rios is a Google Cloud–certified Professional Data Engineer who has trained teams to design and operate production data platforms on GCP. She specializes in BigQuery, Dataflow, and ML-enabled analytics pipelines aligned to official exam objectives.

Chapter 1: GCP-PDE Exam Orientation and Study Strategy

This course prepares you for the Google Cloud Professional Data Engineer (PDE) exam with a BigQuery-centered lens, but the exam itself is broader: it evaluates whether you can design, build, secure, operate, and troubleshoot end-to-end data systems on GCP. Expect scenario-driven questions that force trade-offs across SLA, cost, governance, and operational reliability. In other words, the test is not checking whether you can recite product definitions—it’s checking whether you can choose the right architecture under constraints and explain (implicitly, through your answer) why the alternative options are weaker.

In this first chapter, you’ll orient yourself to the exam format and logistics, understand how domains map to real data projects, learn how to read questions like an examiner, and set a pass-focused study cadence. BigQuery will show up everywhere: as a storage layer, a serving engine, a governance boundary, and increasingly as an orchestration point (scheduled queries, reservations, authorized views, BI Engine, and ML integration). Treat this chapter as your operating manual: it tells you how to spend your limited time for maximum score impact.

Practice note for Understand the GCP-PDE exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Registration, scheduling, and test-center/online proctoring checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for How scoring works, time management, and pass-focused strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up your study plan: labs, docs, and revision cadence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand the GCP-PDE exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Registration, scheduling, and test-center/online proctoring checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for How scoring works, time management, and pass-focused strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up your study plan: labs, docs, and revision cadence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand the GCP-PDE exam format, domains, and question styles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Registration, scheduling, and test-center/online proctoring checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Certification overview—Professional Data Engineer role and expectations

The Professional Data Engineer certification validates that you can design and operationalize data solutions on Google Cloud. The exam expects “production thinking”: reliability, security, cost controls, and maintainability—not just building a pipeline that works once. In BigQuery-heavy environments, that means understanding not only SQL, but also dataset design, partitioning/clustering, governance (IAM, authorized views, row/column-level security), and operational patterns like monitoring jobs, controlling costs with reservations/quotas, and handling schema evolution.

The PDE role spans the full lifecycle aligned to the course outcomes: (1) design data processing systems that meet SLAs and compliance; (2) ingest/process data in batch and streaming (Pub/Sub, Dataflow, Dataproc); (3) store data reliably (BigQuery, Cloud Storage, operational stores); (4) prepare/use data for BI, feature engineering, and ML (BigQuery + Vertex AI patterns); and (5) maintain/automate workloads (orchestration, CI/CD, testing, incident response). The exam’s scenarios commonly embed multiple of these outcomes in one story.

Exam Tip: When a scenario mentions “regulatory requirements,” “least privilege,” “data residency,” or “PII,” assume the correct answer must address governance explicitly (IAM boundaries, encryption, audit logs, data access controls) even if the question seems performance-focused.

Common trap: over-indexing on BigQuery as the answer to everything. BigQuery is powerful, but the exam rewards appropriate separation of concerns: use Pub/Sub for ingestion buffering, Dataflow for streaming transformations, Cloud Storage for raw immutable landing zones, and BigQuery for analytics serving and managed warehousing. A correct design typically includes more than one service, with clear responsibilities.

Section 1.2: Exam logistics—registration, eligibility, ID, and environment requirements

Logistics can cost you points if mishandled. Registration and scheduling are straightforward through Google’s certification portal and approved testing providers, but you should treat it like an engineering change window: verify requirements early and remove uncertainty. Schedule your exam for a time of day when you reliably sustain attention for 2+ hours (not after a long on-call shift or travel day). Confirm your name matches your ID exactly; mismatches can result in denial at check-in.

For test center delivery, arrive early with acceptable government-issued ID and understand personal item policies. For online proctoring, your environment is part of the “system design”: stable internet, a quiet room, cleared desk, and a compatible machine. Disable corporate VPNs and aggressive endpoint security that can interfere with secure browsers; test the provider’s system check in advance. Close background apps and ensure you have power connected (proctors may require a room scan, and laptops running on battery can fail mid-exam).

Exam Tip: Do a “dry run” 24–48 hours before: reboot, run the compatibility check, verify webcam/microphone, and confirm you can log into the testing portal without MFA surprises on a different device.

Common traps include assuming you can use scratch paper (rules vary), expecting to copy/paste notes, or planning to reference docs. The exam is closed-resource: your preparation must include memorizing key decision criteria (e.g., when to choose Dataflow vs Dataproc, or partitioning vs clustering in BigQuery) and being able to reason from first principles under time pressure.

Section 1.3: Exam domains walkthrough—how objectives map to real projects

The exam domains map closely to real project phases. Start by recognizing that questions are rarely “domain-pure.” A single prompt might require architecture (domain: design), implementation choice (domain: ingest/process), and operations (domain: maintain). Your job is to identify which objective is being tested and answer at that level of abstraction.

Design domain: expect trade-offs among latency, throughput, cost, and governance. BigQuery-specific signals include: reservations vs on-demand pricing, multi-region vs region for residency, authorized views for data sharing, and partitioning/clustering for predictable query cost and performance.

Ingest/process domain: batch vs streaming is constant. Pub/Sub + Dataflow for streaming; Dataproc (Spark/Hadoop) for lift-and-shift or complex distributed processing; BigQuery for ELT where SQL transformations are sufficient. The examiner often hides the key constraint in a single phrase like “exactly-once,” “late-arriving events,” or “backfill two years of data.”

Store domain: differentiate raw landing (Cloud Storage), curated warehouse (BigQuery), and operational/serving stores (Cloud Bigtable, Cloud SQL, Spanner) based on access patterns. Governance and lifecycle (retention, time travel, table expiration) are frequent BigQuery topics.

Prepare/use domain: the test expects you to support analysts and ML teams. Know when to use BigQuery views/materialized views, BI Engine, BigQuery ML for in-warehouse modeling, and when to export to Vertex AI pipelines. Feature engineering often combines BigQuery SQL transformations with reproducible pipelines and dataset versioning.

Maintain/automate domain: monitoring, alerting, troubleshooting, CI/CD, and orchestration. Think Cloud Monitoring, logging/audit trails, Dataflow job health, BigQuery job history, and orchestration with Cloud Composer/Workflows. The exam rewards operationally safe patterns: idempotent jobs, retries, dead-letter queues, and rollback plans.

Exam Tip: When two answers both “work,” choose the one that is operationally simplest while still meeting requirements (managed services, fewer moving parts, built-in autoscaling). The PDE exam frequently favors managed, serverless defaults unless a constraint demands otherwise.

Section 1.4: Question anatomy—multiple choice vs multiple select and distractor patterns

PDE questions are scenario-heavy. Your first task is to classify the question type: multiple choice (one best answer) versus multiple select (choose all that apply). Multiple select questions often penalize guessing because partially correct selections become incorrect. Read the prompt carefully for qualifiers like “best,” “most cost-effective,” “minimize operational overhead,” or “meet compliance.” Those words define the scoring rubric.

Distractors (wrong options) are designed to be plausible. Common distractor patterns include: (1) correct service, wrong configuration (e.g., BigQuery partitioning on a non-filtered column); (2) correct architecture, wrong emphasis (e.g., recommending Dataproc when the requirement is minimal ops and a simple transform); (3) security theater (suggesting broad IAM roles rather than least privilege or authorized views); and (4) latency mismatch (batch solution for real-time SLA).

Train a disciplined approach: identify the requirement list (functional + nonfunctional), identify the dominant constraint (latency, cost, governance, simplicity), then eliminate options that violate any constraint. In BigQuery scenarios, watch for “cost predictability” (reservations/slot commitments, partition pruning, avoiding SELECT *), “data sharing across teams” (authorized views, dataset IAM, Analytics Hub), and “streaming semantics” (streaming inserts vs Dataflow streaming into partitioned tables, handling late data).

Exam Tip: If an option introduces extra infrastructure (self-managed clusters, custom schedulers, bespoke auth) without a clear requirement, it’s often a distractor. The exam typically rewards using native managed capabilities (Dataflow, BigQuery scheduled queries, IAM conditions) unless there is a stated limitation.

Another trap is answering based on personal preference rather than the question’s constraints. The exam isn’t asking “what do you like?” It’s asking “what is defensible given this context?” Practice summarizing the scenario in one sentence before selecting an answer; if you can’t do that, you’re likely missing the true objective.

Section 1.5: Study workflow—hands-on labs, flashcards, and spaced repetition plan

A pass-focused study plan blends three modes: (1) hands-on building (labs), (2) concept consolidation (docs + notes), and (3) retrieval practice (flashcards/spaced repetition). For a BigQuery-centered PDE prep, your labs should repeatedly touch the same core tasks under different constraints: loading data (batch and streaming), designing tables (partitioning/clustering), securing access (IAM, authorized views, row-level security), optimizing cost/performance (slot usage, query patterns), and operational monitoring (job history, error diagnosis).

Structure your weeks around a cadence: two build days, one review day, one mixed-practice day. On build days, complete a lab and then write a brief “design rationale” note: what requirement drove each service choice. On review day, convert those rationales into flashcards (e.g., “When would you prefer Dataflow over Dataproc?” “What BigQuery feature supports sharing without granting base table access?”). On mixed-practice day, do timed scenario drills and focus on eliminating distractors quickly.

Exam Tip: Don’t create flashcards for trivia. Create flashcards for decision rules and failure modes (e.g., “Streaming + late events → windowing/watermarks in Dataflow,” “Cost spikes → partition pruning + avoid cross-join blowups + reservations”). These are the facts you must retrieve under stress.

Common trap: reading documentation passively. The exam measures application, so every doc session should end with an action: a tiny implementation (one query, one IAM policy, one partitioned table), or a written comparison (two services, when to choose each). Also schedule periodic “revision sprints” (every 10–14 days) where you re-do earlier labs faster and with fewer hints—speed and confidence matter for exam time management.

Section 1.6: Baseline assessment—skill gap checklist aligned to official objectives

Before you invest heavily, establish a baseline aligned to the official PDE objectives and the course outcomes. Your goal is not to score yourself perfectly; it’s to identify the weakest links that will collapse multi-domain questions. Use the checklist below to mark each item as: Confident / Somewhat / Needs work. Then prioritize “Needs work” items that appear in many scenarios (security, streaming semantics, cost controls).

  • Architecture selection: Can you justify Dataflow vs Dataproc vs BigQuery ELT using SLA/ops/cost constraints?
  • Ingestion patterns: Do you understand Pub/Sub buffering, schema evolution, and replay/backfill strategies?
  • BigQuery modeling: Partitioning vs clustering, denormalization trade-offs, and controlling query cost with pruning.
  • BigQuery governance: Dataset/table IAM, authorized views, row/column-level security, audit logging, and data sharing patterns.
  • Reliability/operations: Monitoring pipelines, handling retries/idempotency, diagnosing failures from logs and job metadata.
  • Automation: Orchestration options (Composer/Workflows/schedulers), CI/CD considerations, and safe deployments.
  • Analytics/ML workflows: When BigQuery ML is sufficient vs when to move to Vertex AI; feature engineering reproducibility.

Exam Tip: Your baseline should include time management. Do at least one timed read-through of complex scenarios and measure how long it takes to extract requirements. If it’s slow, practice “requirement parsing” as a dedicated skill—this often yields faster score gains than learning another service feature.

Finally, convert gaps into a sequence of learning goals tied to outcomes. For example, if governance is weak, schedule a week focused on least-privilege patterns in BigQuery (authorized views, IAM conditions, dataset boundaries) and how those decisions affect downstream BI/ML usage. This is how you turn objectives into a study plan that predicts exam performance rather than hoping for it.

Chapter milestones
  • Understand the GCP-PDE exam format, domains, and question styles
  • Registration, scheduling, and test-center/online proctoring checklist
  • How scoring works, time management, and pass-focused strategy
  • Set up your study plan: labs, docs, and revision cadence
Chapter quiz

1. You are mentoring a teammate who is focusing their study plan almost entirely on memorizing BigQuery feature definitions. You want them to align with the Google Cloud Professional Data Engineer exam’s scenario-based style. What is the best guidance?

Show answer
Correct answer: Prioritize practicing scenario questions that require choosing architectures under constraints (SLA, cost, governance, reliability) and justify trade-offs across GCP services.
The PDE exam emphasizes designing, building, operationalizing, securing, and troubleshooting end-to-end data systems using scenario-driven trade-offs—domain-aligned decision-making, not rote product recall. Option B is wrong because memorization alone won’t prepare you for architecture decisions under constraints. Option C is wrong because although BigQuery is common, the exam is broader and covers multiple services and lifecycle responsibilities.

2. Your manager asks how to approach time management during the PDE exam. You want to recommend a pass-focused strategy aligned with exam realities. Which approach is most appropriate?

Show answer
Correct answer: Use a two-pass approach: answer confident questions quickly, flag time-consuming ones, then return to flagged items; avoid spending too long on any single question.
A two-pass strategy matches certification exam best practices: maximize points by securing easy wins first, then allocate remaining time to harder trade-off questions. Option B is wrong because it creates uneven pacing and increases the risk of unanswered questions. Option C is wrong because certainty is often impossible in scenario questions; over-investing time in one item reduces overall score potential.

3. A company is preparing for the PDE exam and asks what types of questions to expect. They want to tailor their study activities accordingly. What is the most accurate expectation?

Show answer
Correct answer: Primarily scenario-based questions that require selecting the best design/operations choice among plausible alternatives.
The PDE exam typically uses scenario-based multiple-choice questions that test judgment across design, security, reliability, cost, and operations—reflecting official domain skills. Option B is wrong because exact CLI/UI recall is not the primary assessment method. Option C is wrong because the exam is not a live lab exam; it is multiple-choice/selection-based assessment.

4. You are building a study plan for a team with 4 weeks until the PDE exam. They can do labs, read docs, and do question practice, but time is limited. Which plan best matches a pass-focused cadence?

Show answer
Correct answer: Blend hands-on labs with targeted documentation reading and spaced revision (regular review cycles) while practicing scenario questions to expose knowledge gaps.
A pass-focused strategy balances hands-on experience (labs), authoritative references (docs), and periodic review to retain concepts, while using practice questions to learn exam-style trade-offs—aligned with the exam’s applied nature. Option B is wrong because delaying labs reduces the ability to build intuition for operational and architectural choices. Option C is wrong because scenario questions often require practical understanding of how services behave and interact.

5. During prep, a candidate notes that BigQuery appears in many study scenarios (scheduled queries, reservations, authorized views, BI Engine, ML integration). They conclude the exam is essentially a BigQuery exam. How should you correct this assumption in an exam-aligned way?

Show answer
Correct answer: Explain that BigQuery is a frequent component, but the exam evaluates end-to-end data system design and operations across GCP, including security, reliability, and troubleshooting beyond a single product.
BigQuery is commonly used as storage/serving/governance, but the PDE exam domains cover the full lifecycle of data solutions on GCP: design, build, security, operationalization, monitoring, and troubleshooting. Option B is wrong because it ignores the broader service ecosystem and cross-domain trade-offs. Option C is wrong because governance and operations are core to the PDE role and appear frequently in exam scenarios.

Chapter 2: Design Data Processing Systems (Domain: Design data processing systems)

This domain of the Google Professional Data Engineer exam is less about memorizing product features and more about proving you can translate business requirements into a defensible GCP reference architecture. Expect prompts that include a mix of SLA language (availability, freshness), data characteristics (volume, velocity, schema volatility), and constraints (security, residency, cost). Your job is to pick the smallest set of managed services that meet the stated goals while avoiding operational overhead and anti-patterns.

The exam frequently hides the “real requirement” in a single line: e.g., “needs near real-time dashboards,” “replay events for 7 days,” “must isolate PII,” or “jobs run once per day but must complete in 30 minutes.” In this chapter you’ll practice recognizing those signals, selecting batch/streaming/hybrid patterns (Dataflow, Pub/Sub, Dataproc), and landing data into the right storage layer (BigQuery, Cloud Storage, operational stores) with governance and security designed in from day one.

As you read, keep anchoring decisions to: (1) business outcome, (2) measurable SLO/SLA targets, (3) failure modes (what happens when a dependency fails), and (4) cost drivers (storage, slots, shuffle, egress). Many “wrong” answers on the exam are plausible technically but violate a constraint (latency, residency, access boundaries) or add avoidable operational risk.

Practice note for Translate business requirements into GCP reference architectures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select services for batch, streaming, and hybrid designs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design for security, privacy, and governance from day one: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: architecture scenarios and trade-offs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate business requirements into GCP reference architectures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select services for batch, streaming, and hybrid designs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design for security, privacy, and governance from day one: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: architecture scenarios and trade-offs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate business requirements into GCP reference architectures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select services for batch, streaming, and hybrid designs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Requirements and constraints—SLA/SLO, RPO/RTO, latency, throughput, and cost

Section 2.1: Requirements and constraints—SLA/SLO, RPO/RTO, latency, throughput, and cost

Design questions start with requirements. Translate vague statements into measurable constraints: SLA (promised availability) and SLO (internal target). For data systems, common SLOs include data freshness (e.g., “events available in BigQuery within 2 minutes”), job completion time (“daily pipeline completes by 06:00”), and correctness (“exactly-once billing records”). Map disaster recovery requirements using RPO (maximum tolerable data loss) and RTO (maximum tolerable downtime). A streaming telemetry pipeline might need RPO≈0 and low RTO; a nightly batch enrichment might tolerate hours of RTO and a nonzero RPO if source data can be reloaded.

Latency and throughput drive architecture. “Near real-time” typically implies Pub/Sub + Dataflow streaming into BigQuery, whereas “hours is fine” can be Cloud Storage + batch load + SQL transforms. Throughput affects partitioning, batching, and concurrency: millions of events per second pushes you toward managed ingestion with Pub/Sub and Dataflow autoscaling, and careful BigQuery partitioning/clustering to control query scan costs.

Cost is always a constraint even when not explicitly stated. Identify cost levers: BigQuery on-demand bytes scanned vs editions/slots, Dataflow worker sizing and streaming engine, Dataproc cluster lifecycle, and storage class choices in Cloud Storage. The exam tests whether you can choose a design that meets SLOs without overprovisioning (e.g., permanent Dataproc cluster for intermittent ETL).

Exam Tip: When a prompt includes “minimize operational overhead” or “managed service preferred,” favor serverless/managed options (BigQuery, Dataflow, Pub/Sub) over VM-managed approaches (self-managed Kafka/Spark) unless a hard requirement forces the latter.

Common trap: confusing RPO/RTO with latency. Low-latency analytics does not automatically imply strict DR; conversely, strict RPO can be satisfied with durable log storage (Pub/Sub retention, Cloud Storage) even if analytics are batch.

Section 2.2: Service selection matrix—BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Storage, Spanner/Bigtable

Section 2.2: Service selection matrix—BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Storage, Spanner/Bigtable

The PDE exam expects you to choose services based on workload shape. Build a mental selection matrix. BigQuery is the analytic warehouse: SQL at scale, managed storage, strong BI integration, and native ML/feature engineering patterns. Cloud Storage is the durable landing zone and data lake: cheap, scalable, ideal for raw files, reprocessing, and long-term retention. Pub/Sub is the ingestion backbone for event streams with fan-out and replay (within retention). Dataflow (Beam) is the managed pipeline engine for both batch and streaming, with windowing, state, and exactly-once patterns where applicable. Dataproc is managed Hadoop/Spark: best when you need Spark ecosystem compatibility, custom libraries, or lift-and-shift jobs—especially batch jobs that can run on ephemeral clusters.

Operational databases appear in scenarios where the system serves low-latency reads/writes for applications. Spanner fits globally consistent relational needs (strong consistency, SQL, high availability), while Bigtable fits high-throughput, wide-column, low-latency key/value access (time series, IoT). The key exam move is to avoid using BigQuery as an OLTP store: BigQuery is optimized for analytics, not per-row transactional workloads.

Exam Tip: When you see “stream processing with complex event-time windows,” “deduplication,” or “enrichment with lookups,” Dataflow streaming is usually the intended answer. When you see “Spark jobs, MLlib, existing Scala code,” Dataproc is often the cleanest migration path—especially if the requirement mentions portability or existing Hadoop.

Common trap: picking Dataproc for simple ETL because it “can do everything.” The exam often rewards managed simplicity: Dataflow for pipelines, BigQuery for transformations (SQL) when feasible, and ephemeral Dataproc only when Spark-specific value is required. Another trap: confusing Pub/Sub with storage. Pub/Sub is a message bus with retention, not a long-term data lake; pair it with Cloud Storage or BigQuery for durable analytical storage.

Section 2.3: Data modeling strategy—raw/clean/curated zones, lakehouse patterns, and domain ownership

Section 2.3: Data modeling strategy—raw/clean/curated zones, lakehouse patterns, and domain ownership

Strong architectures separate concerns: ingest first, then standardize, then publish. A common exam-friendly pattern is raw/clean/curated zones. “Raw” preserves source fidelity (immutable, append-only where possible) and enables reprocessing; place it in Cloud Storage (files) or BigQuery raw datasets with minimal transformation. “Clean” applies schema normalization, deduplication, and PII handling; this is often Dataflow or BigQuery SQL. “Curated” (or “serving”) is modeled for analytics: dimensional models, wide tables for BI, aggregates, and governed views.

Lakehouse patterns blend Cloud Storage as the low-cost lake with BigQuery as the query/warehouse layer. On the exam, this shows up when requirements mention both “retain raw files for compliance” and “fast SQL analytics.” The correct design often lands raw in Cloud Storage, then loads or externalizes into BigQuery for analysis, with curated BigQuery tables partitioned and clustered for performance/cost control.

Domain ownership matters for governance. If the prompt mentions multiple business units, shared datasets, or “data mesh,” look for designs that allow dataset-level boundaries: separate BigQuery datasets per domain, clear contracts (schemas), and centrally governed shared dimensions via authorized views. This reduces accidental coupling and supports least-privilege access.

Exam Tip: When asked to “enable reprocessing” or “support schema evolution,” ensure the architecture keeps raw data immutable and versioned. Answers that overwrite raw inputs or only store transformed outputs often fail hidden reprocessing requirements.

Common trap: modeling everything as one giant BigQuery dataset with broad permissions “for simplicity.” The exam tends to penalize this because it breaks governance and least privilege. Another trap: treating curated outputs as the only source of truth; your source of truth should be raw (or a contractually defined upstream system), with curated being reproducible.

Section 2.4: Security architecture—IAM, service accounts, VPC Service Controls, CMEK, DLP patterns

Section 2.4: Security architecture—IAM, service accounts, VPC Service Controls, CMEK, DLP patterns

Security is not a bolt-on; the exam tests “secure-by-design.” Start with IAM boundaries: grant least privilege at the right level (project vs dataset vs table). In BigQuery, dataset IAM plus authorized views are core tools for row/column exposure control. Use separate service accounts per pipeline component (ingest, transform, publish) to limit blast radius, and avoid using user credentials for production pipelines.

For data exfiltration risk, VPC Service Controls is a common “best answer” when the prompt mentions protecting sensitive data from credential theft or restricting access from the public internet. Service perimeters around BigQuery, Cloud Storage, and Pub/Sub can prevent data access from outside allowed networks/projects. If the requirement includes customer-managed encryption keys or regulatory mandates, choose CMEK for BigQuery/Storage where supported, and be prepared to mention Cloud KMS key rotation and separation of duties.

DLP patterns show up for PII: tokenization, masking, or classification. The exam doesn’t require you to implement a full policy engine, but it expects you to know that sensitive fields may need detection/classification and then access controls (views, policy tags) or transformation (hashing/tokenization) before wide sharing. A robust pattern is: land raw in restricted zone, run DLP inspection/classification, write clean zone with sensitive fields transformed, then publish curated datasets with authorized views.

Exam Tip: Watch for phrasing like “prevent data exfiltration,” “only allow access from corporate network,” or “compliance requires encryption key control.” Those are strong signals for VPC Service Controls and CMEK. Pure IAM alone is often an incomplete answer in such prompts.

Common trap: assuming private IP alone solves exfiltration. Private connectivity helps, but VPC Service Controls addresses stolen credentials and copy-to-external-project scenarios. Another trap: using one overly privileged service account for all pipelines—this is operationally easy but exam-unfriendly due to least-privilege violations.

Section 2.5: Reliability and scalability—autoscaling, backpressure, quotas, and multi-region considerations

Section 2.5: Reliability and scalability—autoscaling, backpressure, quotas, and multi-region considerations

Reliability is measured at the system boundary: can you meet freshness and availability targets when components fail or load spikes? Dataflow autoscaling is a common lever for variable throughput; for streaming pipelines, understand backpressure: if BigQuery streaming inserts or downstream sinks slow, Dataflow must buffer and apply flow control. Design includes retry policies, dead-letter queues (often Pub/Sub), and idempotent writes/deduplication strategies to handle at-least-once delivery.

Quotas frequently appear as hidden constraints. BigQuery has quotas on API requests, load jobs, and streaming inserts; Pub/Sub has throughput and message size constraints; Dataflow has worker limits per region/project. A “best” answer might avoid hitting streaming insert limits by using batch loads (files to Cloud Storage then load jobs) when latency allows, or by using partitioned tables and efficient batching. For Dataproc, reliability often means ephemeral clusters, autoscaling policies, and using managed services (e.g., storing intermediates in Cloud Storage) so cluster loss doesn’t lose state.

Multi-region considerations tie to availability and data residency. BigQuery datasets can be multi-region or region; you generally must keep data processing in the same location to avoid egress and latency issues. If the prompt stresses disaster tolerance, multi-region BigQuery and multi-region Cloud Storage are strong options, but they can conflict with strict residency requirements.

Exam Tip: When a scenario mentions “spiky traffic” or “unpredictable volume,” prefer services with built-in horizontal scaling (Pub/Sub, Dataflow, BigQuery) and designs that decouple producers/consumers. Tight coupling (direct writes from apps into BigQuery without buffering) is often brittle under spikes.

Common trap: designing only for the happy path. The exam likes answers that mention replay (raw retention), DLQ for poison messages, and safe rollbacks. Another trap: ignoring location alignment—cross-region processing can silently break SLAs via latency/egress costs.

Section 2.6: Exam-style design questions—trade-offs, anti-patterns, and “best answer” reasoning

Section 2.6: Exam-style design questions—trade-offs, anti-patterns, and “best answer” reasoning

The exam is a “best answer” test: several options may work, but only one most directly meets constraints with minimal risk and operations. Your reasoning should be explicit: identify the primary requirement (latency, governance, cost), eliminate options that violate it, then choose the simplest managed architecture that satisfies the remaining needs.

Trade-offs are the heart of this domain. Batch vs streaming: streaming costs more to run continuously but meets low-latency SLOs; batch is cheaper and simpler when freshness can be minutes/hours. BigQuery transforms vs external compute: SQL in BigQuery reduces moving data and operational burden, but Dataflow/Dataproc is better for complex event-time logic, heavy custom code, or non-SQL transformations. Cloud Storage vs BigQuery for raw: Storage is cheap and flexible for any format; BigQuery raw tables make exploration easier but can increase warehouse costs if abused as a landing zone without lifecycle controls.

Recognize common anti-patterns the exam punishes: using BigQuery as a transactional datastore for per-user app reads/writes; building a permanent Dataproc cluster for a once-a-day job; skipping raw retention so you cannot reprocess; granting broad project-level IAM when dataset-level controls are needed; and designing a streaming pipeline without a buffering layer or replay strategy.

Exam Tip: When an option adds components not required by the prompt (extra queues, custom clusters, complex networking) it’s often a distractor. Prefer architectures that are “just enough” while still addressing reliability (replay/DLQ), security (least privilege, perimeters), and cost (serverless where appropriate).

How to identify correct answers quickly: underline non-negotiables (residency, encryption, freshness, retention), then map them to reference patterns: Pub/Sub→Dataflow→BigQuery for near real-time analytics; Cloud Storage→BigQuery load→SQL for batch; Dataproc for Spark compatibility; Spanner/Bigtable for serving workloads. This disciplined mapping is exactly what the domain “Design data processing systems” is assessing.

Chapter milestones
  • Translate business requirements into GCP reference architectures
  • Select services for batch, streaming, and hybrid designs
  • Design for security, privacy, and governance from day one
  • Domain practice set: architecture scenarios and trade-offs
Chapter quiz

1. A retail company needs near real-time dashboards in BigQuery from clickstream events (~50k events/sec). They must be able to replay events for up to 7 days in case downstream processing fails. They want minimal operational overhead. Which architecture best meets the requirements?

Show answer
Correct answer: Publish events to Pub/Sub, process with Dataflow streaming (with dead-letter handling), and write to BigQuery; retain events using Pub/Sub message retention (or export to Cloud Storage) to enable 7-day replay
Pub/Sub + Dataflow streaming is the managed reference pattern for low-latency pipelines into BigQuery and supports operationally simple replay when paired with retention/backlog handling. Option B is wrong because BigQuery time travel applies to table data changes, not as a durable event buffer for replaying raw source events; direct streaming inserts also shift retry/replay complexity to the client. Option C is wrong because batch landing and daily processing cannot meet near real-time dashboard freshness requirements.

2. A healthcare provider ingests daily claim files (300 GB/day) and must complete transformations and load to BigQuery within 30 minutes of file arrival. The schema changes occasionally. The team wants to minimize cluster management and prefers serverless. Which solution is most appropriate?

Show answer
Correct answer: Use Cloud Storage as the landing zone and trigger a Dataflow batch pipeline (Flex Template) to validate/transform and load into partitioned BigQuery tables, handling schema evolution in the pipeline
Serverless Dataflow batch fits time-bounded batch SLAs with minimal ops and can encode schema handling/validation before loading to BigQuery. Option B can work but adds avoidable operational overhead (cluster lifecycle, patching, sizing) that the scenario explicitly wants to avoid. Option C is wrong because federated queries over Cloud Storage are not ideal for repeated production transformations under tight SLAs and do not provide the same controlled validation/loading pattern expected for governed pipelines.

3. A global company must keep EU customer PII data in the EU and allow US analysts to query only aggregated metrics. They are building a BigQuery-based platform and want governance designed in from day one. What is the best approach?

Show answer
Correct answer: Store EU PII in EU-located BigQuery datasets and use policy tags (Data Catalog) and row/column-level security; expose only authorized aggregated tables or authorized views to US analysts
Using EU datasets enforces data residency, while policy tags plus row/column-level security and authorized views are standard BigQuery governance controls to isolate PII and safely share aggregates. Option B is wrong because it violates the residency requirement (EU PII stored in the US) and IAM alone is a coarse control compared to column/row security and governed sharing patterns. Option C is wrong because querying external data across regions can violate residency/egress constraints and still risks exposing raw PII rather than enforcing controlled, least-privilege access in BigQuery.

4. An IoT company needs a hybrid design: devices stream telemetry continuously, but business reporting uses curated daily snapshots in BigQuery. They want one design that supports low-latency alerting and cost-efficient historical reporting. Which architecture is most appropriate?

Show answer
Correct answer: Ingest telemetry into Pub/Sub, process with Dataflow streaming for real-time alerts and write raw/processed data to BigQuery; also schedule a daily BigQuery transformation (or Dataform) to produce curated snapshot tables for reporting
Pub/Sub + Dataflow streaming is the canonical managed pattern for continuous ingestion and low-latency processing, while scheduled BigQuery transformations create cost-efficient, query-optimized daily snapshot tables for reporting. Option B is wrong because direct-to-BigQuery from devices increases client complexity and the on-prem nightly ETL adds operational overhead and latency risk. Option C is wrong because a once-per-day batch pipeline cannot meet real-time alerting requirements.

5. A media company is choosing between batch and streaming for log processing. Requirements: logs arrive continuously; dashboards must be updated within 2 minutes; occasional late events (up to 10 minutes) must be correctly counted; operations team is small. Which design best fits?

Show answer
Correct answer: Use Pub/Sub ingestion and Dataflow streaming with event-time windowing and allowed lateness; write results to BigQuery for dashboards
Dataflow streaming supports event-time processing, windowing, and handling late data—key for correct counts with late arrivals—while keeping ops overhead low. Option B is wrong because a micro-batch load pattern without late-data handling will produce incorrect metrics and may not consistently meet the 2-minute freshness target. Option C is wrong because it introduces avoidable operational burden (cluster management and Kafka operations) compared to managed Pub/Sub + Dataflow for the stated constraints.

Chapter 3: Ingest and Process Data (Domain: Ingest and process data)

This domain is where the Professional Data Engineer exam most often blends architecture with operational reality: you must choose ingestion and processing patterns that meet SLAs (latency and freshness), reliability (retries, idempotency, backfill), security (least privilege, data boundaries), and cost (batch vs streaming, managed vs self-managed). In practice, you’ll be asked to reason from requirements to a design: “near real-time analytics,” “daily batch,” “CDC from Cloud SQL,” “files arriving from partners,” “schema changes,” or “late events.”

The exam tests whether you can match the right managed service to the job and then apply the critical details: how BigQuery loads differ from streaming inserts, how Dataflow’s Beam model handles time, and how to build pipelines that survive messy data (duplicates, partial failures, drift). You should be prepared to justify your selection using three axes: latency, operability (monitoring, backfills, DLQs), and cost predictability.

Exam Tip: When two answers both “work,” prefer the option that is (1) managed, (2) simpler to operate, and (3) aligns to the required latency. The PDE exam rewards “right-sized” choices—don’t pick Dataflow streaming for a once-per-day file drop unless latency truly requires it.

Practice note for Build ingestion patterns for batch and streaming sources: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement transformations with Dataflow and SQL-first approaches: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Handle data quality, schema drift, and late-arriving events: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: pipeline debugging and performance scenarios: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build ingestion patterns for batch and streaming sources: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement transformations with Dataflow and SQL-first approaches: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Handle data quality, schema drift, and late-arriving events: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: pipeline debugging and performance scenarios: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build ingestion patterns for batch and streaming sources: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement transformations with Dataflow and SQL-first approaches: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Ingestion services—Pub/Sub, Storage Transfer, BigQuery load jobs, Datastream concepts

This section maps directly to the exam objective “build batch and streaming pipelines,” and it starts with choosing an ingestion front door. For streaming events (application logs, IoT telemetry, clickstream), Pub/Sub is the default ingestion buffer: it decouples producers from consumers, scales horizontally, and provides retention for replay within its configured window. For batch file movement (SFTP partners, cross-cloud buckets), Storage Transfer Service is often the best managed option; it handles scheduling, retries, and incremental transfers without you writing orchestration glue.

For landing data into BigQuery, the exam expects you to distinguish two ingestion modes. BigQuery load jobs are batch-oriented (from Cloud Storage, and some sources via connectors), cost-effective, and ideal for micro-batch patterns (e.g., every 5 minutes or hourly). They provide atomic table updates per job and can be integrated with partitioned tables for efficient refresh. BigQuery streaming inserts can support low-latency ingestion, but the exam frequently probes the operational and cost implications (streaming pricing, eventual consistency concerns for some operations, and the need for deduplication when retries occur).

Datastream is conceptually different: it’s change data capture (CDC) from operational databases (e.g., MySQL/PostgreSQL/Oracle) into analytics destinations (often landing in Cloud Storage/BigQuery via downstream processing). On the test, look for phrases like “minimize load on OLTP,” “capture inserts/updates/deletes,” “near real-time replication,” or “keep analytics in sync” to trigger Datastream. You may still need a processing step to apply changes (MERGE in BigQuery, Dataflow CDC pipelines) depending on the destination format.

Exam Tip: If the source is “files in a bucket once per day,” BigQuery load jobs (possibly orchestrated) usually beat streaming inserts on cost and simplicity. If the source is “database row-level changes,” look for CDC (Datastream) rather than periodic full exports.

  • Common trap: Picking Pub/Sub for large file transfers. Pub/Sub is for messages/events, not bulk file movement.
  • Common trap: Treating load jobs as “real-time.” They can be frequent, but they are still batch with job latency and scheduling considerations.
Section 3.2: Dataflow fundamentals—Beam model, windowing, triggers, stateful processing, watermarking

Dataflow (Apache Beam) is the exam’s centerpiece for managed processing, and you must know the conceptual model, not just “it runs pipelines.” Beam transforms are applied to PCollections, and the pipeline can be bounded (batch) or unbounded (streaming) with the same programming model. The exam tends to test time semantics and correctness under out-of-order data—this is where windowing, triggers, and watermarks matter.

Windowing defines how unbounded data is grouped for aggregation: fixed windows (e.g., 5 minutes), sliding windows (overlapping), and session windows (activity gaps). Triggers decide when to emit results: after watermark, after processing-time delay, after count, or repeatedly. Watermarking is Dataflow’s estimate of event-time progress; it determines when the system believes “most data for this window has arrived.” Late data is anything that arrives after the watermark passes the window end, and your pipeline must define allowed lateness and how to handle updates.

Stateful processing lets you keep per-key state (e.g., dedup keys, running totals, sessionization) across elements, often combined with timers. This appears in scenarios like “deduplicate events across retries,” “enforce ordering per user,” or “enrich with slowly changing reference data.” You should also understand that state and timers increase operational complexity and can increase cost; the exam often rewards simpler designs when requirements allow (e.g., using BigQuery MERGE-based idempotency rather than extensive state if latency permits).

Exam Tip: If a question mentions “late-arriving events” or “event time,” your answer must reference windowing + allowed lateness + triggers (or an approach that explicitly tolerates late data, such as writing raw events then doing periodic backfill/compaction).

  • Common trap: Using processing time when the requirement is “based on when the event occurred.” Event time is required for correctness with out-of-order data.
  • Common trap: Assuming the watermark is perfect. It’s an estimate; design for late data explicitly.
Section 3.3: Batch processing—Dataproc/Spark patterns vs BigQuery ELT vs Dataflow batch

Batch processing questions typically ask you to pick between (1) Dataproc (Spark/Hadoop), (2) BigQuery ELT (SQL-first), and (3) Dataflow in batch mode. The best choice depends on workload shape, team skills, and operational expectations.

BigQuery ELT is often the exam’s preferred answer for straightforward transformations: SQL-based cleansing, joins, aggregations, and dimensional modeling. If data is already in BigQuery or easily loaded to BigQuery, ELT reduces moving parts and leverages managed scaling. Look for requirements like “analysts maintain logic in SQL,” “minimize ops,” “ad hoc reprocessing,” or “cost visibility per query.” You’ll often pair this with partitioning, clustering, and scheduled queries/materialized views.

Dataflow batch is strong when you need advanced transforms (custom parsing, complex file formats, enrichment), large-scale shuffle with managed autoscaling, or when you want a single Beam pipeline to support both batch and streaming variants. It’s also common when ingesting from Cloud Storage and writing to BigQuery with transformations that are awkward in SQL alone.

Dataproc/Spark is most appropriate when you need the Spark ecosystem (existing code, MLlib, GraphX), Hadoop-compatible tooling, or when porting on-prem Hadoop jobs. The exam frequently emphasizes that Dataproc clusters require more operational ownership (cluster lifecycle, sizing, patching unless using managed features), though you can mitigate with ephemeral clusters, autoscaling policies, and managed workflows. If the problem statement hints “existing Spark jobs” or “lift-and-shift,” Dataproc rises; otherwise, managed serverless options are often favored.

Exam Tip: When the transformation is primarily relational and the destination is BigQuery, default to BigQuery ELT unless the prompt explicitly requires custom code, streaming compatibility, or complex parsing.

  • Common trap: Choosing Dataproc because it “can do anything.” The exam often penalizes unnecessary operational burden.
  • Common trap: Ignoring data locality and formats. If source is many small files in GCS, consider file compaction/partitioning strategies regardless of engine.
Section 3.4: Streaming processing—exactly-once/at-least-once semantics, deduplication, and ordering

Streaming on the PDE exam is about correctness under retries and disorder. Pub/Sub provides at-least-once delivery, which means duplicates can occur. Dataflow can achieve effectively-once processing for certain sinks and patterns, but you must still design with idempotency in mind. The test often uses wording like “no duplicates in BigQuery,” “handle retries,” or “ensure consistent aggregates,” pushing you toward deduplication strategies.

Deduplication commonly relies on a stable event identifier (event_id) and a bounded time horizon. In Dataflow, you can deduplicate using per-key state with TTL, or by writing raw events and performing periodic BigQuery dedupe (e.g., using QUALIFY ROW_NUMBER() over event_id ordered by ingestion time). For BigQuery streaming inserts, use insertId (where applicable) as a de facto dedupe key; but don’t assume it solves all cases across pipeline restarts and custom sinks—design explicitly.

Ordering is another frequent trap. Pub/Sub does not guarantee global ordering; ordering keys only guarantee order within a key. If the requirement is “process per customer in order,” you need ordering keys and a pipeline design that respects them (often by keying and windowing appropriately). If the requirement is “exact order of all events,” that’s usually unrealistic at scale; the correct answer is typically to redesign the requirement (event-time windows and idempotent updates) rather than attempt total ordering.

Exam Tip: If you see “exactly once,” translate it into “idempotent writes + dedupe key + replay strategy.” The exam rarely expects you to claim true end-to-end exactly-once across arbitrary systems without qualifications.

  • Common trap: Confusing “exactly-once processing” with “exactly-once delivery.” Delivery is usually at-least-once; correctness is achieved through idempotency/dedup.
  • Common trap: Forgetting late data in streaming aggregates. If you update aggregates, define how late events revise prior outputs.
Section 3.5: Data validation—quality checks, schema evolution, error handling, DLQs, reprocessing strategy

Real pipelines fail in predictable ways: malformed records, missing fields, schema drift, and downstream quota/permission issues. The exam objective here is to “handle data quality, schema drift, and late-arriving events” while keeping the system operable. A common best practice is a two-tier model: land raw immutable data (bronze) and then publish curated tables (silver/gold). This makes reprocessing and backfills straightforward when logic changes or when you need to correct bad data.

Validation strategies include schema checks (required fields, data types), range checks, referential integrity checks (when feasible), and anomaly detection on volumes. In Dataflow, invalid records are commonly diverted to a dead-letter queue (DLQ) sink (e.g., Pub/Sub topic or Cloud Storage path) with enough metadata to replay after fixing the producer or parser. In BigQuery ELT, quarantine tables are common: write rejected rows with error reasons, then remediate via SQL and reload.

Schema evolution is a frequent exam scenario: “new field appears,” “field type changes,” “producer adds nested object.” The robust pattern is to ingest flexibly (e.g., JSON with tolerant parsing, or storing raw payload + extracted fields) and evolve curated schemas deliberately. In BigQuery, adding nullable columns is generally safe; changing types is not. When type drift occurs, you may need to land as STRING in raw, then cast/validate in curated layers.

Exam Tip: Choose answers that preserve raw data and enable replay. If an option drops invalid records without retention, it’s usually incorrect unless the prompt explicitly allows loss.

  • Common trap: Treating DLQ as “done.” The exam expects a reprocessing strategy: how records move from DLQ back into the main pipeline after fixes.
  • Common trap: Assuming schema drift is solved by “auto-detect” forever. Auto-detect can help ingestion but curated modeling still needs governance.
Section 3.6: Exam-style ingestion/processing questions—latency, cost, and operability trade-offs

Many PDE questions in this domain are disguised trade-off problems. You’re given constraints (freshness in seconds vs minutes vs hours), data volume variability (bursty streams, seasonal batch spikes), and team constraints (“small ops team,” “SQL skills,” “existing Spark code”). The correct answer is the architecture that meets the SLA with the least operational complexity and predictable cost.

For latency, identify whether the requirement is true streaming (seconds) or near-real-time (minutes) or batch (hours/day). Seconds typically implies Pub/Sub + Dataflow streaming (or equivalent managed streaming processing). Minutes can be micro-batch with load jobs or scheduled transforms. Hours/day often points to Storage Transfer + BigQuery load + ELT. For cost, streaming systems have a “running cost” and can be more expensive than batch if the business doesn’t require continuous freshness. For operability, prefer managed services (Dataflow, BigQuery) over self-managed clusters unless the prompt demands Spark/Hadoop compatibility.

Debugging and performance scenarios on the exam often revolve around bottlenecks and backpressure: hot keys causing skew in streaming aggregations, excessive shuffle due to poor key choice, too-frequent small file loads, or BigQuery slots consumed by inefficient SQL. Correct responses usually include a concrete mitigation: re-keying to reduce hot spots, using combiner/aggregation strategies, batching writes, partitioning and clustering, and separating raw ingestion from heavy transformations.

Exam Tip: When asked to “improve reliability,” look for idempotency, retries with backoff, DLQs, and replay/backfill paths. When asked to “improve performance,” look for reducing shuffle (skew/hot keys), right-sizing windows, and using partitioned/clustering strategies in BigQuery.

  • Common trap: Over-optimizing for one dimension. A low-latency design that is unmonitorable or impossible to backfill is often not the best exam answer.
  • Common trap: Ignoring quotas/limits and operational overhead (cluster management, job orchestration) when a managed alternative meets requirements.
Chapter milestones
  • Build ingestion patterns for batch and streaming sources
  • Implement transformations with Dataflow and SQL-first approaches
  • Handle data quality, schema drift, and late-arriving events
  • Domain practice set: pipeline debugging and performance scenarios
Chapter quiz

1. A retailer needs near real-time dashboards in BigQuery with <60s end-to-end latency from point-of-sale events. Events arrive continuously and may be duplicated during retries. The team wants a managed solution that supports event-time processing and late data handling. What should you implement?

Show answer
Correct answer: Pub/Sub -> Dataflow streaming pipeline using event-time windowing and allowed lateness, write to BigQuery with a deduplication strategy (e.g., key-based) and dead-letter handling
A is the best match for the exam domain: Pub/Sub + Dataflow streaming is the managed, right-sized pattern for sub-minute latency, and Beam’s event-time/allowed lateness addresses late-arriving events; you can also implement idempotency/dedup by event ID before writing to BigQuery. B is batch-oriented (load jobs) and typically cannot guarantee <60s freshness, and partitioning does not deduplicate or correctly handle event-time semantics by itself. C can work but is more operationally complex (cluster management, state, scaling, failure handling) and is generally disfavored on the PDE exam when a managed Dataflow solution meets requirements.

2. A partner sends CSV files to Cloud Storage once per day. The files are 500 GB and occasionally contain malformed rows. The business needs the data available in BigQuery by 6 AM, but does not require streaming. You want predictable cost and simple operations. What is the best ingestion approach?

Show answer
Correct answer: Trigger a BigQuery load job from Cloud Storage into a partitioned table, configure error handling (e.g., max bad records) and capture rejected rows separately if needed
A aligns to a batch file-drop requirement: BigQuery load jobs are cost-effective and operationally simple for large daily files, and support handling malformed data through load configuration and quarantine patterns. B is the wrong fit: streaming inserts cost more, add quota/throughput considerations, and provide little benefit when the SLA is hours. C adds unnecessary operational complexity (always-on streaming pipeline) for a daily batch workload, which the exam typically penalizes when simpler managed batch ingestion meets the SLA.

3. You maintain a Dataflow streaming pipeline writing to a BigQuery table. The upstream team adds new optional fields to the JSON payload without notice, causing intermittent BigQuery insert failures. You want the pipeline to tolerate schema drift while preserving new fields for later analysis. What should you do?

Show answer
Correct answer: Route records to a dead-letter path on insert failure, then land raw events in a BigQuery table with a flexible schema (e.g., store the payload in a JSON/STRING column) and promote new fields through a controlled schema update process
A is the robust, exam-aligned approach: handle schema drift explicitly with a DLQ/quarantine path, preserve raw data for backfill/replay, and update the curated schema intentionally. B is incorrect because BigQuery does not provide a general ‘auto-create columns on streaming insert’ mode; schema changes must be managed (and uncontrolled changes can break downstream consumers). C may stop failures but violates the requirement to preserve new fields and reduces data quality/traceability, which is discouraged in production-grade ingestion designs.

4. A product analytics team uses event-time windows in Dataflow to compute session metrics and writes results to BigQuery. They notice metrics are undercounted because some mobile events arrive up to 2 hours late. They want correct results without reprocessing the entire history daily. What change best addresses this?

Show answer
Correct answer: Increase allowed lateness and configure triggers in the Dataflow pipeline to update windowed results when late events arrive, ensuring the BigQuery sink supports updates/merges as needed
A is the correct event-time solution: Beam’s allowed lateness and triggers are designed to incorporate late-arriving data into windowed computations with incremental updates, matching the requirement to avoid full reprocessing. B is wrong because processing-time windowing changes the definition of the metric and typically still produces incorrect business results for event-time analytics. C is a batch workaround that increases latency and can still miss very late events unless you expand the recomputation window, which trends toward reprocessing more data than needed.

5. A Dataflow batch pipeline reads from Cloud Storage and writes to BigQuery. It sometimes produces duplicate rows after job retries when workers crash mid-write. The team needs an approach that is resilient to retries and supports safe backfills. What should you recommend?

Show answer
Correct answer: Write to a staging table and perform a deterministic BigQuery MERGE into the target table using a stable unique key (or hash) to ensure idempotent upserts
A is a common certification-grade pattern: stage then MERGE using a unique key makes the load idempotent across retries and supports controlled backfills. B is incorrect because streaming inserts are not a universal guarantee against duplicates; exactly-once semantics depend on the end-to-end design and may still require deduplication keys. C is operationally risky and violates reliability goals; disabling retries reduces resilience and increases manual intervention, which the exam typically treats as an anti-pattern.

Chapter 4: Store the Data (Domain: Store the data)

This domain tests whether you can pick the right storage layer, model data for reliable analytics, and enforce governance without sacrificing cost or performance. On the Professional Data Engineer exam, “storage” is rarely just “where do I put bytes?”—it’s about SLAs (latency, throughput, concurrency), access patterns (OLAP vs OLTP), retention/regulatory needs, and operational overhead. Expect scenarios where multiple stores coexist: Cloud Storage as the raw landing zone, BigQuery as the analytical warehouse, and an operational database (Bigtable/Spanner/Cloud SQL) for serving low-latency application queries.

The most common trap: choosing a store based on familiarity rather than access pattern. Another trap: believing BigQuery optimization is “optional.” The exam frequently expects you to reduce scanned bytes via partition pruning and clustering, and to manage cost via storage classes, table TTLs, reservations, and controlled sharing. This chapter maps directly to the “Store the data” domain objective: design storage layers, master BigQuery physical design, plan governance (metadata/lineage/retention/sharing), and justify decisions in exam-style prompts.

Practice note for Design storage layers for analytics and operational needs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Master BigQuery table design, partitioning, and clustering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan governance: metadata, lineage, retention, and sharing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: storage decisions and BigQuery performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design storage layers for analytics and operational needs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Master BigQuery table design, partitioning, and clustering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan governance: metadata, lineage, retention, and sharing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: storage decisions and BigQuery performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design storage layers for analytics and operational needs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Master BigQuery table design, partitioning, and clustering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Storage options overview—Cloud Storage, BigQuery, Bigtable, Spanner, Cloud SQL trade-offs

Exam questions often start with a workload description; your job is to identify whether it is analytical (OLAP) or operational (OLTP), and then choose the storage accordingly. Cloud Storage (GCS) is the default durable object store for data lakes: cheap, virtually unlimited, and ideal for raw files (Avro/Parquet/JSON/CSV) and long-term retention. BigQuery is the managed analytical warehouse: columnar storage, massively parallel execution, and SQL-based access for BI/ML feature creation. Bigtable and Spanner are operational databases for low-latency serving; Cloud SQL is managed relational OLTP for smaller scale or traditional RDBMS constraints.

Cloud Storage fits “land and preserve” patterns (raw + curated zones) and decouples storage from compute. BigQuery fits interactive analytics, scheduled transformations, and cross-dataset joins. Bigtable fits time-series and high-write workloads with predictable row-key access (single-row/scan by key prefix). Spanner fits globally consistent relational workloads (strong consistency, horizontal scale, multi-region), while Cloud SQL fits typical transactional workloads (PostgreSQL/MySQL) with simpler scaling expectations.

Exam Tip: When a prompt says “ad-hoc analysis,” “dashboards,” “large joins,” or “petabyte-scale scan,” the safe default is BigQuery. When it says “single-digit millisecond reads,” “key-value,” “high write rate,” or “serving to an application,” think Bigtable/Spanner/Cloud SQL depending on relational needs and consistency requirements.

  • GCS: cheapest at-rest, great for raw history and batch ingestion; query requires external engines or BigQuery external tables.
  • BigQuery: OLAP, SQL, governance integration; pay for storage + queries/slots.
  • Bigtable: wide-column, low-latency, scale-out; requires careful row-key design.
  • Spanner: relational + horizontal scale + strong consistency; higher cost, schema-driven.
  • Cloud SQL: relational OLTP, easier lift-and-shift; limited horizontal scale.

Common trap: selecting Bigtable for “big data analytics.” Bigtable is for serving/operational access patterns, not large ad-hoc joins. Another trap: selecting Cloud SQL for massive ingestion volumes; it can become a bottleneck without sharding patterns not typically expected on the exam.

Section 4.2: BigQuery internals—datasets, tables, views, materialized views, and storage billing basics

BigQuery organization matters for security and billing. Projects contain datasets; datasets contain tables and views. Datasets are the primary IAM boundary in many exam scenarios (though you can also do table-level). Tables can be native (managed storage) or external (data stored in GCS). Views are logical (stored SQL), while materialized views precompute results to accelerate repeated queries—useful when prompts mention repeated aggregation over large fact tables.

Understand storage billing at a high level: BigQuery charges for storage (logical bytes stored) and compute (on-demand bytes processed or capacity/slots via reservations). Partitioned tables can reduce query cost by scanning fewer partitions; clustered tables reduce cost by minimizing the blocks read within partitions. External tables can reduce data duplication but may introduce performance and feature limitations (and can still incur read costs depending on the engine and format). Columnar formats (Parquet/Avro) typically perform better than CSV for external querying.

Exam Tip: If a question emphasizes “cost control,” look for levers like partitioning/clustering, table TTL, limiting who can run queries (IAM), and using materialized views for repeated summaries. If it emphasizes “simplify management,” prefer native BigQuery tables over external unless the prompt explicitly requires keeping data in GCS.

Common traps include confusing views with materialized views (views do not store results), and assuming dataset location is flexible after the fact. Dataset location (US/EU/region) is a design-time choice affecting compliance and cross-region query limitations. Also watch for prompts about “sharing with another team/company”: you may need authorized views or dataset-level sharing rather than exporting data.

Section 4.3: Performance design—partitioning, clustering, pruning, slots/reservations, and query optimization

This section is heavily tested because it connects directly to cost and SLAs. Partitioning splits a table into segments (often by ingestion time or a timestamp/date column). Clustering sorts data within partitions by up to four columns to improve selective filters and aggregations. The exam expects you to choose partition keys that align with common filters (e.g., event_date) and to justify clustering based on high-cardinality columns used in WHERE/JOIN/GROUP BY (e.g., customer_id, region, device_type).

Partition pruning is the mechanism that avoids scanning irrelevant partitions. If users filter on a timestamp but the table is partitioned on ingestion_time, pruning may not help and costs rise—classic exam trap. Clustering helps when predicates are selective; it won’t save you if queries are always full scans with no filters. Another trap is over-partitioning (too granular partitions) or partitioning on a field rarely used in filters.

Compute management appears as on-demand vs reservations. On-demand bills per bytes processed; reservations allocate slots for predictable workloads and can enforce capacity isolation across teams. If a prompt says “steady workload with strict concurrency SLAs,” reservations are often the best answer. If it says “sporadic, unpredictable,” on-demand may be appropriate.

Exam Tip: In multi-tenant org scenarios, identify whether the problem is query contention (solve with reservations, assignments, and workload management) or bytes scanned (solve with partitioning, clustering, and query rewrites).

  • Optimize queries by selecting only needed columns (columnar advantage), pushing filters early, and avoiding cross joins unless intended.
  • Use approximate aggregation functions when business allows (often a cost/performance hint in prompts).
  • Prefer partition filters that match the partitioning column to ensure pruning.

BigQuery also caches results; however, don’t rely on caching as an architecture answer unless the prompt explicitly mentions repeated identical queries and permissive freshness requirements.

Section 4.4: Data lifecycle—retention policies, time travel, snapshots, backups, and archival strategies

Lifecycle design is governance plus cost management. BigQuery supports table expiration (TTL) at dataset or table level—ideal for ephemeral staging tables or regulated retention windows. Prompts about “keep raw data for 7 years but only query the last 90 days” often imply a two-tier strategy: recent data in BigQuery optimized for performance, and older/rarely accessed data archived to GCS (Nearline/Coldline/Archive) or kept in BigQuery but with careful cost justification.

BigQuery time travel lets you query previous versions of data within a limited window (commonly up to 7 days, depending on settings/features). This is tested as a recovery mechanism for accidental deletes/updates. Table snapshots provide a point-in-time, read-only copy that can be used for recovery or reproducibility; they can be part of “backup” answers inside BigQuery. For long-term backups and disaster recovery, exporting to GCS is a common pattern, especially when prompts mention cross-project portability or long retention at low cost.

Exam Tip: If the incident is “someone deleted rows yesterday,” time travel or snapshot is likely the simplest correct answer. If the requirement is “retain for years at minimal cost,” archival to GCS storage classes is typically stronger than keeping everything hot in BigQuery.

Common trap: treating BigQuery like an OLTP database with frequent row-level updates and expecting “traditional backups.” BigQuery supports DML, but heavy mutation workloads can be costly and may suggest redesign (append-only + periodic compaction) or using an operational store. Another trap is ignoring legal holds/regulatory retention: TTL is useful, but it must match compliance requirements; sometimes you need to disable expiration and enforce retention via policy and separate archival.

Section 4.5: Governance and sharing—Data Catalog/Dataplex concepts, IAM on datasets, authorized views

Governance is not just permissions; it includes metadata discovery, lineage, and controlled sharing. Dataplex (and the broader Google Cloud governance ecosystem) helps organize data into lakes/zones, apply policies, and centralize discovery. Data Catalog concepts show up as “tagging,” “business metadata,” and “searching datasets.” In exam scenarios, the right move is often to combine: (1) clear dataset boundaries, (2) IAM least privilege, and (3) sharing mechanisms that prevent data leakage.

IAM in BigQuery is commonly applied at the project and dataset level, with roles such as BigQuery Data Viewer, Job User, and Data Owner. A frequent trap: granting users permission to run queries requires both access to data and the ability to create query jobs (often via BigQuery Job User at the project level). If a prompt says “they can see tables but queries fail,” you should suspect missing job permissions.

Authorized views are a key test concept: they let you share a filtered/aggregated view without granting direct access to underlying tables. This is the correct answer when prompts mention “share only a subset of columns/rows,” “PII masking,” or “external partner access” while keeping base tables protected. Row-level and column-level security can also appear, but authorized views are the classic mechanism for controlled sharing across groups.

Exam Tip: When the requirement includes “consumers should not access raw tables,” default to authorized views (or policy-based controls) rather than copying data into a separate dataset—copying often increases risk and cost.

Lineage-related prompts often expect you to identify that governance requires tracing sources and transformations; solutions include using consistent naming conventions, tagging, and integrating orchestration metadata. Don’t answer with “document it in a wiki” when the prompt hints at automated discovery and compliance reporting.

Section 4.6: Exam-style storage questions—choose the right store and justify partition/cluster decisions

On the exam, you will rarely be asked “What is BigQuery?” Instead, you’ll see a scenario: data arrives (streaming or batch), must be stored for analytics and sometimes served to apps, must meet retention rules, and must be shareable. Your scoring hinges on matching requirements to the store and then defending design choices (partition/cluster, retention, and governance).

For store selection, use a quick decision framework: (1) Is it OLAP or OLTP? (2) Do we need SQL joins across large datasets? (3) What are latency and concurrency requirements? (4) What is the retention window and access frequency? (5) Who needs access and at what granularity? If the workload is BI/ML feature building with large scans, BigQuery is the anchor. If it’s raw immutable files, GCS is the landing/archival layer. If it’s low-latency lookups by key, Bigtable or Spanner/Cloud SQL are the candidates depending on relational needs.

For BigQuery physical design, justify partitioning by the most common time-based filter (event_date is usually better than ingestion_time if analysts filter by event time). Add clustering when queries frequently filter/join on specific dimensions (customer_id, product_id, region) and when it will improve pruning within partitions. Watch for the trap where a prompt says “queries filter by user_id and date”—that often implies partition by date and cluster by user_id, not the other way around.

Exam Tip: When asked to “reduce cost,” explicitly mention scanned bytes: partition pruning + selecting fewer columns. When asked to “meet predictable SLAs,” mention reservations/slots and workload isolation. When asked to “share safely,” mention authorized views and least-privilege IAM.

Finally, be careful with “one-store-fits-all” answers. A strong PDE response is layered: GCS for raw/archival, BigQuery for curated analytics, and an operational store for serving when needed. The exam rewards architectures that separate concerns—durable landing, governed warehouse, and fit-for-purpose serving—while keeping governance and lifecycle policies explicit.

Chapter milestones
  • Design storage layers for analytics and operational needs
  • Master BigQuery table design, partitioning, and clustering
  • Plan governance: metadata, lineage, retention, and sharing
  • Domain practice set: storage decisions and BigQuery performance
Chapter quiz

1. A company ingests clickstream events (2 TB/day) into Cloud Storage and runs hourly and daily analytics in BigQuery. Most queries filter on event_date and often include user_id and event_type predicates. The team reports unpredictable query cost due to large scans. You need to improve performance and reduce bytes scanned with minimal operational overhead. What should you do?

Show answer
Correct answer: Create a BigQuery table partitioned by event_date and clustered by user_id and event_type
Partitioning by event_date enables partition pruning so queries that filter by date scan fewer bytes, which directly reduces cost and improves performance. Clustering by user_id and event_type improves pruning within partitions for common predicates and supports better locality. Bigtable is optimized for low-latency key/value access (operational serving), not large-scale OLAP scans; using it as the primary store for analytics adds complexity and doesn’t inherently reduce BigQuery scanned bytes. Materialized views can help for specific repeated query patterns but do not replace partitioning/clustering and won’t prevent large scans for ad hoc queries on a large unpartitioned base table.

2. A fintech company must retain transaction records for 7 years for compliance. Analysts should only query the most recent 90 days frequently; older data is rarely accessed but must remain available for audits. You want to minimize storage cost and operational work while keeping data queryable in BigQuery. What is the best approach?

Show answer
Correct answer: Use a BigQuery partitioned table for the full 7 years and apply table/partition labels plus IAM controls; optionally use long-term storage pricing and avoid deleting old partitions
A partitioned BigQuery table supports efficient queries on the last 90 days via partition pruning while keeping older partitions available for audits. BigQuery long-term storage pricing can reduce cost automatically for unchanged partitions, minimizing operational overhead. Setting a partition expiration of 90 days would violate the 7-year retention requirement. Offloading all older data only to Cloud Storage Archive reduces warehouse storage cost but increases operational complexity (you’d need rehydration/loading for audits) and makes the data not immediately queryable in BigQuery, which is typically undesirable for compliance audit workflows.

3. A SaaS platform stores operational customer profiles that require single-row reads/writes with p95 latency under 10 ms. The same data is also used for daily analytics and BI dashboards. Which storage architecture best matches the access patterns and exam best practices?

Show answer
Correct answer: Use Cloud Spanner (or Cloud SQL) for operational profiles and replicate/stream changes into BigQuery for analytics
Operational workloads with strict low-latency, high-concurrency point lookups are better served by an OLTP database such as Cloud Spanner (global consistency/scale) or Cloud SQL (regional OLTP), while BigQuery is optimized for OLAP analytics. Replicating/streaming into BigQuery provides analytical performance without impacting operational SLAs. BigQuery is not designed to be an OLTP system of record for sub-10 ms point reads/writes at scale. Cloud Storage is object storage and is not appropriate for low-latency transactional reads/writes of individual records.

4. A data platform team needs governance for datasets in BigQuery: business users must be able to discover tables, understand column meaning, and trace where curated tables came from. They also want to share a curated dataset with an external partner without copying data. What combination of features best meets these requirements?

Show answer
Correct answer: Use Data Catalog/Dataplex for metadata discovery and lineage, use BigQuery policy tags for column-level security, and share via BigQuery authorized views or Analytics Hub
Data Catalog/Dataplex supports metadata management (business glossary, technical metadata) and lineage, improving discoverability and traceability. Policy tags support fine-grained governance such as column-level security. Authorized views or Analytics Hub enable controlled sharing without copying underlying data and without granting excessive permissions. Labels/expiration help with organization and retention but do not provide rich discovery/lineage or secure sharing patterns. Cloud Logging is not a lineage system, and granting BigQuery Admin is overly permissive and violates least-privilege governance expectations.

5. An analytics team complains that BigQuery queries sometimes scan far more data than expected even though tables are partitioned by ingestion time. Many queries filter by a logical event_timestamp field (when the event occurred) rather than load time, and late-arriving data is common. You need to reduce scanned bytes and keep query behavior aligned with how analysts filter data. What should you do?

Show answer
Correct answer: Recreate the table partitioned by event_timestamp (time-unit column partitioning) and update ingest pipelines to populate it correctly; optionally cluster on common predicate columns
Partitioning by the column analysts filter on (event_timestamp) enables effective partition pruning and keeps semantics consistent even with late-arriving data, as long as ingestion populates the partitioning column correctly. Clustering can further reduce scans within partitions. Relying on caching is not a cost-control strategy and does not address scanning behavior for new or varied queries; SELECT * often increases scanned bytes. Managing separate tables per day increases operational overhead and complicates governance and querying; proper partitioning is the intended BigQuery design pattern.

Chapter 5: Prepare and Use Data for Analysis + Maintain and Automate Workloads

This chapter maps to two high-weight domains on the Professional Data Engineer exam: (1) preparing and using data for analysis and (2) maintaining and automating data workloads. The exam rarely asks you to write code; instead, it tests whether you can choose the right BigQuery modeling patterns, SQL optimization levers, semantic/BI connectivity approaches, and production operations practices (monitoring, orchestration, testing, and CI/CD). Expect scenario questions where multiple answers are “technically possible” but only one meets stated constraints like freshness SLA, governance, cost, and operational burden.

As you read, keep a mental checklist: Where is the curated layer (gold)? How are transforms versioned and reproducible? What’s the orchestration control plane? How will failures be detected and remediated? And for ML: does the scenario require in-warehouse training (BigQuery ML) or a managed ML platform with custom code and endpoints (Vertex AI)?

Practice note for Enable analytics: semantic layers, BI patterns, and SQL optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Automate and orchestrate workloads with CI/CD and scheduling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: monitoring, incident response, and ML/analytics scenarios: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable analytics: semantic layers, BI patterns, and SQL optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Automate and orchestrate workloads with CI/CD and scheduling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Domain practice set: monitoring, incident response, and ML/analytics scenarios: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable analytics: semantic layers, BI patterns, and SQL optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Operationalize ML pipelines with BigQuery ML and Vertex AI patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Analytics readiness—curated datasets, star schemas, data marts, and BI connectivity patterns

Section 5.1: Analytics readiness—curated datasets, star schemas, data marts, and BI connectivity patterns

Exam scenarios often begin with “analysts need a dashboard” or “finance needs a trusted KPI.” The correct answer usually hinges on designing a curated analytics layer in BigQuery that is stable, well-modeled, and cost-efficient. A common pattern is raw/landing (bronze) → cleaned/conformed (silver) → curated marts (gold). The exam wants you to separate ingestion concerns from analytics consumption, so that BI tools query predictable tables rather than volatile raw feeds.

In BigQuery, star schemas (fact table with surrounding dimensions) remain a top choice for BI. They reduce join ambiguity, standardize business definitions, and make metrics consistent across dashboards. Data marts can be implemented as separate datasets (e.g., marts_sales, marts_finance) with controlled access and shared dimensions. Consider using partitioned facts (by event_date) and clustered keys (e.g., customer_id, product_id) to improve scan efficiency for common filters.

Semantic layers appear frequently in modern BI patterns. In GCP contexts, this may mean Looker’s LookML model, authorized views, or curated BI views that define KPIs and hide sensitive columns. The exam tests whether you can provide governed access without copying data. Authorized views let you expose derived tables/views while restricting base-table access, supporting least privilege and reducing data duplication.

Exam Tip: When the scenario emphasizes “single source of truth,” “consistent metrics,” or “reduce ad-hoc query cost,” favor curated marts + semantic definitions (views/LookML) over letting every team query raw tables.

BI connectivity patterns: For Looker/Looker Studio/third-party tools, BigQuery is typically queried directly using service accounts and IAM. Materialized views may be appropriate when repeated aggregate queries cause high cost or latency, but remember materialized views have constraints (determinism, supported functions). If the question mentions “near real-time dashboard,” consider whether BI should read from a partitioned table with frequent micro-batches, or use BI Engine acceleration (where applicable) rather than building a separate serving database.

Common trap: Creating many duplicated summary tables for each dashboard “to improve performance.” This often increases governance burden and introduces metric drift. Prefer a small number of well-owned marts and controlled semantic definitions; use partitions/clusters/materialized views selectively for repeatable patterns.

Section 5.2: Feature engineering and preparation—SQL transforms, UDFs, and data versioning concepts

Section 5.2: Feature engineering and preparation—SQL transforms, UDFs, and data versioning concepts

Feature engineering is tested as part of “prepare and use data for analysis,” especially where BigQuery ML or downstream Vertex AI training is involved. The exam looks for reproducible transformations: you should be able to point to SQL that deterministically produces training features from curated sources, with the ability to rerun historically (backfill) and compare versions.

In BigQuery, common feature prep patterns include: (1) window functions for recency/frequency metrics, (2) safe casting and missing value handling, (3) categorical encoding via one-hot-like pivots or hashing, and (4) time-based joins that avoid leakage (e.g., only using events prior to the label timestamp). When the prompt mentions “data leakage,” the correct solution usually involves point-in-time correctness: join features with constraints like event_time <= label_time and avoid future aggregates.

UDFs can standardize transformations (string cleaning, parsing, bucketing). Use SQL UDFs for portability and governance; use JavaScript UDFs only when SQL cannot express the logic cleanly, because JS UDFs can be slower and harder to secure. Persistent UDFs in a shared dataset help enforce consistent feature definitions across teams.

Data versioning concepts show up as “reproducible training” and “auditability.” In BigQuery, you can approximate dataset versioning by stamping outputs with a run_id, snapshot_date, or model_version and writing to partitioned tables (e.g., partition by snapshot_date). For immutable training sets, write once per version and avoid in-place updates. If asked about rerunning training from an exact state, consider using table snapshots (copy/snapshot tables) or exporting a frozen training set to Cloud Storage with a versioned path.

Exam Tip: If the scenario includes “explainability,” “audit,” or “regulatory,” prioritize immutable feature tables and clear lineage (run_id, snapshot partitions) over overwriting a single feature table every day.

Common trap: Building features directly in a dashboard query or a training notebook without productionizing the SQL. The exam expects you to move feature computation into scheduled, governed transformations (views, scheduled queries, Dataform/dbt-style pipelines, or orchestrated jobs) and to manage schema evolution carefully (e.g., adding nullable columns, backfilling partitions).

Section 5.3: ML pipelines overview—BigQuery ML vs Vertex AI training/serving and when to use each

Section 5.3: ML pipelines overview—BigQuery ML vs Vertex AI training/serving and when to use each

The PDE exam frequently asks you to choose between BigQuery ML (BQML) and Vertex AI. The best answer depends on where the data lives, how custom the model needs to be, and how the model will be served and monitored. BQML is ideal when data is already in BigQuery and you want SQL-native training, evaluation, and batch prediction with minimal operational overhead. It shines for classic supervised learning, forecasting, and anomaly detection supported by BQML, especially when the constraint is “small team” or “fast time to value.”

Vertex AI is the right choice when you need custom training code (TensorFlow/PyTorch/XGBoost beyond what BQML supports), feature stores, hyperparameter tuning at scale, managed endpoints for online prediction, canary deployments, or advanced MLOps (model registry, monitoring, explainability tooling). If the scenario requires low-latency online serving, you typically land on Vertex AI endpoints (possibly with features sourced from BigQuery or a feature store) rather than relying on BigQuery batch predictions alone.

Pipeline thinking: (1) feature prep in BigQuery, (2) training (BQML or Vertex), (3) evaluation, (4) registration/versioning, (5) batch or online inference, (6) monitoring for drift and performance, and (7) retraining triggers. For BQML, steps 2–4 can be inside BigQuery using CREATE MODEL, ML.EVALUATE, ML.EXPLAIN_PREDICT, and model version control via naming/versioning conventions. For Vertex AI, you’ll often orchestrate data extraction, training jobs, and deployment.

Exam Tip: If the question emphasizes “SQL-only team,” “minimal ops,” “training data in BigQuery,” or “batch scoring,” pick BigQuery ML. If it emphasizes “custom model,” “online predictions,” “A/B testing deployments,” or “GPU/TPU training,” pick Vertex AI.

Common trap: Assuming Vertex AI is always the “enterprise” answer. The exam rewards right-sizing: BQML can be the most correct solution when requirements are met and simplicity/cost are priorities. Conversely, don’t force BQML when the scenario clearly needs custom architectures or real-time endpoints.

Section 5.4: Orchestration—Cloud Composer (Airflow), Workflows, Scheduler, and dependency management

Section 5.4: Orchestration—Cloud Composer (Airflow), Workflows, Scheduler, and dependency management

Orchestration is a core “maintain and automate” skill: the exam expects you to know which tool to use and why. Cloud Composer (managed Airflow) is the heavyweight option for complex DAGs with many dependencies, retries, backfills, SLAs, and cross-system operators (BigQuery, Dataproc, Dataflow, Cloud Storage, etc.). Use Composer when you need robust dependency management, rich scheduling, and an established Airflow ecosystem.

Workflows is a serverless orchestration option suited for service-to-service control flow: calling APIs, chaining Cloud Functions/Run, invoking BigQuery jobs, and handling branching logic with minimal infrastructure. It is often the best answer when the workflow is primarily API orchestration and you want lower ops overhead than Composer.

Cloud Scheduler is a cron trigger, not a dependency engine. It’s appropriate to kick off a single job on a schedule (e.g., run a Workflows execution, publish a Pub/Sub message, hit an HTTP endpoint) but it won’t manage complex upstream/downstream relationships by itself. On the exam, if a prompt includes “dependencies across tasks” or “dynamic fan-out,” Scheduler alone is rarely sufficient.

Dependency management concepts include idempotency (safe retries), watermarking (track last processed time), and late data handling. For BigQuery-centric ELT, “scheduled queries” can be tempting, but they become brittle when you need multi-step dependencies and coordinated backfills. The exam often prefers a dedicated orchestrator when there are multiple steps and failure handling needs to be centralized.

Exam Tip: When you see “DAG,” “backfill,” “task-level retries,” or “cross-team standardized orchestration,” Composer is the likely target. When you see “simple serverless steps” and “call a few APIs,” Workflows is often more cost-effective and easier to operate.

Common trap: Picking Composer for a trivial single-step schedule (overkill) or picking Scheduler for a multi-step pipeline with dependencies (insufficient). Match the orchestration tool to the operational complexity and failure modes described.

Section 5.5: Operations—monitoring/alerting, logs/metrics, pipeline SLOs, cost controls, and quota management

Section 5.5: Operations—monitoring/alerting, logs/metrics, pipeline SLOs, cost controls, and quota management

This domain tests whether you can keep data products reliable. In practice, reliability is defined by SLOs such as freshness (data available by 8:00 AM), completeness (no missing partitions), correctness (reconciled counts), and latency (stream-to-table within N minutes). The exam expects you to connect these SLOs to monitoring signals: Cloud Monitoring metrics, BigQuery job metadata, Dataflow/Dataproc health, and alerting policies.

For BigQuery, operational visibility often comes from INFORMATION_SCHEMA views and Cloud Logging audit logs. You can detect failures by scanning job histories (failed queries, slot contention), monitoring load job errors, and validating partition arrival. For pipeline components (Dataflow, Pub/Sub), use built-in metrics (backlog, system lag, worker errors) and log-based alerts for repeated failures.

Cost controls are a frequent exam angle. BigQuery cost drivers include bytes scanned and slot usage (depending on pricing model). Practical controls: enforce partition filters (require partition filter), cluster on commonly filtered columns, use approximate aggregates when appropriate, set query quotas and custom cost controls via billing budgets/alerts, and restrict who can run large ad-hoc queries. Consider reservations/slots for predictable workloads and to isolate critical jobs from ad-hoc contention.

Quota management: scenarios may mention “quota exceeded” or “too many concurrent jobs.” The right answer typically includes smoothing load (batching, scheduling), using reservations, reducing concurrency, or redesigning to fewer, larger jobs. For streaming inserts into BigQuery, the exam sometimes expects you to prefer Storage Write API (via Dataflow) for higher throughput and better reliability than legacy streaming patterns.

Exam Tip: If the scenario includes “unexpected BigQuery bill,” look for answers that reduce bytes scanned (partitioning/clustering, materialized views, limiting columns, pruning) and add governance (budgets, quotas, and controlled access to raw tables).

Common trap: Treating monitoring as “check dashboards.” The exam wants actionable alerting tied to SLO breaches (freshness, failures, lag) and clear ownership/runbooks, not just metric collection.

Section 5.6: Testing and automation—data unit tests, backfills, CI/CD, and rollback strategies in exam scenarios

Section 5.6: Testing and automation—data unit tests, backfills, CI/CD, and rollback strategies in exam scenarios

Automation separates a prototype from a production pipeline, and the exam tests whether your solution is safe to change. Data testing includes schema checks (expected columns/types), constraint checks (uniqueness, non-null, referential integrity), and business rule validations (counts within tolerance, reconciliations). Implement these as SQL assertions in orchestration, or use a transformation framework (e.g., Dataform/dbt concepts) to run tests as part of the build. The key exam idea: tests must run automatically and fail the pipeline early when quality breaks.

Backfills: you should design pipelines to reprocess historical partitions without rewriting the entire dataset. Partitioned tables enable targeted backfills (rerun only affected dates). Your orchestration should support parameterized runs (start_date/end_date) and idempotent writes (write to a staging table then swap, or write to partitions deterministically). If asked how to correct a bad transform, the best practice is often: (1) fix code, (2) rerun affected partitions, (3) validate with tests, and (4) promote to curated tables.

CI/CD: Expect scenarios about “promote changes safely.” Common patterns include storing SQL/pipeline code in Git, running linting/unit tests in CI, deploying to dev/test datasets first, then promoting to prod with approvals. For BigQuery objects, use infrastructure-as-code or deployment tools that manage views, UDFs, and scheduled jobs consistently across environments. Automation should also include secrets/IAM handled via service accounts and least privilege.

Rollback strategies: since data changes can be expensive to undo, favor reversible deployments. Examples: create new versioned views, blue/green datasets for marts, or write outputs to new tables and switch a view pointer. For critical pipelines, keep prior partitions or snapshots for a retention window so you can restore quickly after an incident.

Exam Tip: When you see “pipeline broke after a change” or “need zero-downtime schema update,” look for answers that use staging + atomic cutover (views), versioned artifacts, and automated validation—rather than manual edits in the console.

Common trap: Confusing “unit tests” with manual spot checks. On the exam, the more correct answer is the one that bakes tests into the automated workflow and supports repeatable backfills and controlled releases.

Chapter milestones
  • Enable analytics: semantic layers, BI patterns, and SQL optimization
  • Operationalize ML pipelines with BigQuery ML and Vertex AI patterns
  • Automate and orchestrate workloads with CI/CD and scheduling
  • Domain practice set: monitoring, incident response, and ML/analytics scenarios
Chapter quiz

1. A retail company has a BigQuery raw dataset ingested from multiple sources and wants to enable consistent KPI definitions (e.g., revenue, active customers) across Looker and other BI tools. They also need row-level security by region and want to avoid duplicating logic in every dashboard. What should you implement to best meet these requirements with low operational overhead?

Show answer
Correct answer: Create a curated (gold) BigQuery dataset with standardized views (or authorized views) that encapsulate KPI logic and enforce access controls, and connect BI tools to this semantic layer
A BigQuery-centered semantic layer using curated views/tables (gold layer) centralizes business logic and supports governance patterns like authorized views and row-level security, which is aligned with the 'prepare and use data for analysis' domain. Option B is wrong because documentation does not enforce consistent definitions or security and leads to duplicated logic and drift. Option C is wrong because exporting introduces latency and additional operational burden, and it weakens centralized governance and access control in BigQuery.

2. A media company runs interactive analytics on a 20 TB partitioned BigQuery table. A frequent query filters by event_date but still scans most partitions and is expensive. You need to reduce cost and improve performance without changing the business logic. What is the best next step?

Show answer
Correct answer: Ensure the query includes a partition filter on the partitioning column and consider adding clustering on commonly filtered/joined columns to improve pruning and reduce bytes scanned
Cost in BigQuery is primarily driven by bytes scanned; partition pruning via correct partition filters and clustering for data locality are core SQL optimization levers tested in the exam. Option B may improve runtime but does not materially reduce bytes scanned/cost and is not the best first move. Option C is wrong because SELECT * often increases scanned bytes and caching is not a reliable cost-control mechanism for varying ad-hoc queries.

3. A company has a churn prediction model. The training data lives in BigQuery and the team currently trains with BigQuery ML. They now need custom feature transformations in Python, GPU-accelerated training, and an online prediction endpoint with canary deployments. Which approach best meets these requirements?

Show answer
Correct answer: Use Vertex AI for training and deployment, with BigQuery as the feature source (e.g., export or read via BigQuery connectors), and operationalize the pipeline with managed orchestration
The scenario requires custom code, GPUs, and managed online serving with deployment strategies (canary), which aligns with Vertex AI patterns rather than in-warehouse BigQuery ML. Option A is wrong because BigQuery ML is best for SQL-based modeling and batch-style scoring; it is not designed for GPU training and advanced endpoint rollout controls. Option C is wrong due to poor reproducibility, manual operations, and lack of a managed serving/control plane.

4. A data engineering team wants to productionize BigQuery transformations with version control, automated tests, and repeatable deployments across dev/test/prod. They also need to schedule daily runs and be able to roll back changes quickly. What is the best approach?

Show answer
Correct answer: Use a CI/CD pipeline (e.g., Cloud Build) to deploy versioned SQL (and related artifacts) and use a workflow orchestrator (e.g., Cloud Composer/Workflows) or Scheduler-triggered jobs to run pipelines with environment-specific configuration
The exam expects CI/CD for reproducibility and controlled promotion, plus an orchestration/scheduling control plane to run workloads reliably. Option B is wrong because manual UI changes are not versioned, are hard to test, and rollback is fragile. Option C is wrong because VM-based cron increases operational burden (patching, secrets, availability) and weakens governance and repeatability compared to managed CI/CD and orchestration.

5. A financial services company has a BigQuery pipeline that must meet a 30-minute freshness SLA for dashboards. An upstream change occasionally causes a scheduled query to fail, leaving stale data with no alert until users complain. You need to improve detection and response while minimizing manual effort. What should you do?

Show answer
Correct answer: Add monitoring and alerting on job failures and data freshness (e.g., last successful load timestamp), and implement automated remediation steps (retry/backoff) in the orchestrator runbook
Monitoring/incident response is a core part of 'maintain and automate data workloads': alert on failed jobs and SLA indicators (freshness/latency) and automate retries or escalation via orchestration. Option B is wrong because higher frequency does not address root cause or ensure alerting; it can increase cost and still leave silent failures. Option C is wrong because it increases manual toil, violates SLAs, and lacks reliable detection and governance.

Chapter 6: Full Mock Exam and Final Review

This chapter is your capstone: two full mock-exam passes, a structured weak-spot analysis, and an exam-day execution plan. The Google Professional Data Engineer exam rewards applied judgment—choosing services and patterns that satisfy SLAs, security, governance, and cost—not memorizing isolated features. Your goal here is to rehearse that judgment under time pressure and then convert misses into an objective-aligned remediation plan.

You will work through Mock Exam Part 1 and Part 2 as if you are in the testing center, then diagnose performance by domain: system design (service selection and architecture), ingestion/processing (batch/streaming), storage/modeling (BigQuery/GCS/operational stores), analytics/ML enablement (BigQuery + Vertex AI patterns), and operations (monitoring, CI/CD, orchestration). The final review reinforces the comparisons and traps that appear repeatedly in PDE scenarios—especially around BigQuery architecture, streaming semantics, data governance, and cost control.

Exam Tip: Treat every question as a mini design review. Before you look at options, restate the constraints in your own words (latency, throughput, governance, regionality, failure tolerance, cost). The “best” answer is usually the one that meets the hard constraints with the fewest moving parts.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Exam Day Checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Exam Day Checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 1: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Mock Exam Part 2: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Weak Spot Analysis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Mock exam rules—timing plan, marking strategy, and how to review effectively

Section 6.1: Mock exam rules—timing plan, marking strategy, and how to review effectively

Your mock exam is only as valuable as how closely it mirrors the real constraints. Start by setting a strict timing plan. Use a two-pass strategy: Pass 1 is for confident answers and quick eliminations; Pass 2 is for marked items. Avoid “research mode” during the mock—no docs, no notes, no pausing—because the exam tests your ability to choose under uncertainty.

Marking strategy matters. Mark any item where (a) two options both seem plausible, (b) you’re making an assumption not stated, or (c) you’re unsure which constraint is most important. Do not mark items just because they feel unfamiliar—if you can map requirements to a service decisively, answer and move on.

Review effectively by converting each miss into a tagged lesson, not a vague regret. For each incorrect or low-confidence choice, capture: the key constraint you missed, the service feature you misapplied, and the single phrase that would have steered you correctly (for example: “streaming exactly-once isn’t a given—design idempotency”).

Exam Tip: During review, prioritize “near misses” (50/50 decisions) over “I had no idea.” Near misses are the fastest score gains because they usually require one clarifying concept (e.g., BigQuery partitioning vs clustering, Pub/Sub retention, Dataflow windowing, or IAM boundaries).

Common trap: spending too long validating an answer you already know. The PDE exam often includes one distractor that is technically possible but operationally heavier. Your review should train you to prefer the simplest architecture that meets the constraints and aligns with managed services.

Section 6.2: Mock Exam Part 1—mixed-domain scenario questions

Section 6.2: Mock Exam Part 1—mixed-domain scenario questions

Mock Exam Part 1 should feel “mixed-domain”: one scenario can touch ingestion, BigQuery modeling, governance, and operations at once. As you work, practice an explicit requirement-to-service mapping: ingestion (Pub/Sub vs Storage Transfer vs batch loads), processing (Dataflow vs Dataproc vs BigQuery SQL), storage (BigQuery tables, partitions, clustering, materialized views), and governance (IAM, policy tags, DLP, CMEK, VPC-SC).

When a scenario emphasizes low-latency analytics on streaming data, look for BigQuery streaming ingestion patterns and Dataflow transforms with windowing. If the scenario stresses reproducibility, backfills, and cost predictability, expect batch loads, scheduled queries, or Dataform/Composer-managed workflows. If the scenario highlights schema evolution and semi-structured payloads, check whether JSON ingestion, BigQuery schema updates, and Dataflow schema handling are part of the answer.

Exam Tip: In BigQuery-centric scenarios, always decide whether the workload is best served by (1) query optimization (partitioning/clustering/denormalization), (2) ingestion redesign (batch vs streaming), or (3) data modeling/governance changes (row-level security, policy tags, authorized views). Many distractors solve the wrong layer.

Common traps you should actively avoid in Part 1: choosing Dataproc for simple ETL that Dataflow or BigQuery SQL can handle; ignoring regional constraints (datasets, GCS buckets, and Dataflow regions should align); and underestimating operational overhead (managing clusters, custom schedulers, or bespoke retry logic when managed services already provide it).

As you finish Part 1, do not immediately “celebrate” or “panic.” Write down which domains felt slow: BigQuery performance tuning, streaming semantics, or governance. That list becomes your objective-aligned remediation inputs in Section 6.4.

Section 6.3: Mock Exam Part 2—mixed-domain scenario questions

Section 6.3: Mock Exam Part 2—mixed-domain scenario questions

Mock Exam Part 2 should pressure-test your end-to-end design instincts. Expect scenarios where you must balance SLA, security, and cost while integrating BigQuery with upstream pipelines and downstream analytics/ML. In these scenarios, the exam often probes whether you understand not just what a service does, but why it is the “best fit” given constraints like data freshness, failure modes, governance boundaries, and team skillset.

When you see multi-team or multi-tenant analytics, immediately think about BigQuery data sharing and least-privilege patterns: authorized views, row-level security, column-level security with policy tags, and dataset-level IAM. When you see “sensitive data,” confirm whether the scenario requires DLP inspection, tokenization, or controlled egress (VPC Service Controls). When you see “minimize cost,” look for BigQuery slot management, query optimization, partition pruning, use of materialized views, and avoiding repeated scans through well-designed tables.

Exam Tip: If an option adds complexity (custom Spark job, self-managed Kafka, bespoke encryption pipeline), ask what constraint forces it. On the PDE exam, complexity must be justified by a requirement—otherwise it’s likely a distractor.

Operational excellence is frequently embedded in Part 2. Strong answers mention monitoring and automation implicitly through service choices: Dataflow metrics and dead-letter patterns, Cloud Monitoring alerts, BigQuery INFORMATION_SCHEMA for job auditing, and orchestration via Cloud Composer/Workflows. Another recurring theme is testability: CI/CD for SQL (Dataform or scripted deployments), infrastructure-as-code, and staged rollouts.

Common trap: assuming BigQuery solves all transformations. BigQuery SQL is powerful, but if you need event-time windowing, complex streaming joins, or exactly-once-like behavior through idempotency, Dataflow is often the more appropriate processing engine—while BigQuery remains the serving layer.

Section 6.4: Results interpretation—domain-by-domain remediation plan aligned to objectives

Section 6.4: Results interpretation—domain-by-domain remediation plan aligned to objectives

Your weak spot analysis must translate scores into actions aligned to the course outcomes and the PDE exam objectives. Start by categorizing every miss into one of five domains: (1) design/architecture, (2) ingest/process, (3) storage/modeling/governance, (4) analysis/ML enablement, (5) operations/automation. Then label the root cause: misunderstood requirement, wrong service selection, misapplied limitation, or overlooked cost/security constraint.

For design misses, your remediation should focus on service fit and tradeoffs: when to choose Dataflow vs Dataproc, Pub/Sub vs batch loads, BigQuery vs operational stores, and how to meet SLAs with minimal operational burden. For ingestion/processing misses, revisit streaming fundamentals: event time vs processing time, windowing, late data handling, backpressure, and idempotent writes to BigQuery.

For storage/modeling misses, build a checklist: partitioning strategy (time vs integer range), clustering fields that match common filters, avoiding anti-patterns like over-normalization without need, and selecting appropriate table types (standard tables, external tables, materialized views). Governance misses should map to concrete controls: policy tags for column security, row-level policies, authorized views, CMEK, and audit logging. For analysis/ML misses, emphasize patterns that appear on the exam: feature engineering in BigQuery, exporting to Vertex AI, and ensuring training/serving consistency.

Exam Tip: Your plan should be measurable: “I will redo my marked questions after 48 hours without notes,” and “I will write a one-page service comparison for BigQuery vs Dataflow transforms, including cost and latency implications.” Vague plans don’t move scores.

Common trap: studying only what you got wrong, not what you answered correctly for the wrong reason. Review your low-confidence correct answers—these are fragile points likely to fail under different wording on exam day.

Section 6.5: Final review—must-know service comparisons, common traps, and last-week study routine

Section 6.5: Final review—must-know service comparisons, common traps, and last-week study routine

Your final review is about consolidating “decision tables” that the PDE exam repeatedly tests. At minimum, be able to compare: Dataflow vs Dataproc (managed streaming/batch pipelines vs cluster-based Spark/Hadoop), Pub/Sub vs direct BigQuery loads (decoupled messaging and buffering vs simpler batch ingestion), BigQuery native tables vs external tables (performance and governance vs convenience and separation), and BigQuery SQL transforms vs pipeline transforms (set-based analytics vs event-driven streaming logic).

For BigQuery specifically, lock in the cost/performance levers: partitioning to reduce scanned bytes, clustering to speed selective queries, materialized views for repeated aggregations, scheduled queries for batch refresh, and slot considerations (on-demand vs reservations) when concurrency and predictability matter. Also know governance primitives: dataset IAM, authorized views, row-level security, policy tags, and auditability via logs and INFORMATION_SCHEMA.

Exam Tip: When two answers both “work,” pick the one that reduces operational risk: fewer components, managed scaling, clearer IAM boundaries, and built-in monitoring. The exam is biased toward managed services and maintainability.

Common traps in the last week: over-indexing on niche features, and ignoring phrasing like “minimize operational overhead,” “near real-time,” “data residency,” or “least privilege.” These phrases are not filler—they are the scoring key. Another trap is assuming “streaming” implies “exactly once.” In GCP designs, you often achieve effective exactly-once outcomes via idempotency, deduplication keys, and carefully designed sinks.

Suggested last-week routine: alternate between (a) one timed set of scenario reviews (not memorization) and (b) one targeted deep dive on your weakest domain. End each day with a short “service comparison rewrite” from memory to cement tradeoffs.

Section 6.6: Exam day execution—time management, multiple-select tactics, and stress-proof checklist

Section 6.6: Exam day execution—time management, multiple-select tactics, and stress-proof checklist

On exam day, execution matters as much as knowledge. Use disciplined time management: keep a steady pace, avoid getting trapped in one scenario, and rely on your two-pass strategy. For multiple-select items, treat them like constraint satisfaction: each selected option must be necessary and consistent with the scenario. If an option adds a service, ask what requirement forces that extra component.

Exam Tip: For multiple-select, try elimination first. Remove any option that violates a constraint (region, latency, governance), increases operational burden without benefit, or shifts the architecture away from managed services unnecessarily. Then choose the minimal set that fully satisfies requirements.

Stress-proofing is practical: read the last sentence carefully (it often contains the real objective), then scan for “must” constraints (PII, residency, SLAs, cost ceiling, near real-time). Rephrase the question as: “Which design best meets X while minimizing Y?” This keeps you from selecting impressive-but-irrelevant features.

  • Before starting: confirm your pacing plan and commit to marking uncertain items.
  • During: capture keywords (SLA, governance, cost) and map to services (BigQuery, Dataflow, Pub/Sub, Dataproc, GCS, Vertex AI).
  • When stuck: decide which constraint is hardest (often security/residency) and eliminate options that ignore it.
  • Final review: revisit only marked questions; do not churn through everything.

Checklist for readiness: you can explain why BigQuery is the serving layer, when Dataflow is required for streaming semantics, how to apply least privilege with authorized views/policy tags, and how to reduce cost with partitioning/clustering/materialized views. If you can do those under time pressure, you’re executing at a PDE level.

Chapter milestones
  • Mock Exam Part 1
  • Mock Exam Part 2
  • Weak Spot Analysis
  • Exam Day Checklist
Chapter quiz

1. Your team is running through a timed mock exam and notices they consistently miss questions about meeting strict data residency requirements while minimizing operational overhead. A workload must keep all data in the EU, support ad-hoc SQL analytics, and avoid managing servers. Which design best meets these constraints on Google Cloud?

Show answer
Correct answer: Create BigQuery datasets in an EU multi-region (or specific EU region) and use BigQuery for storage and analytics, ensuring all referenced data sources (e.g., GCS buckets) are also EU-located.
BigQuery is a serverless analytics warehouse and supports dataset location controls (regional/multi-regional), aligning with residency and minimal ops requirements. Option B increases operational overhead (cluster lifecycle, tuning) and does not inherently improve compliance compared to simply using EU-located BigQuery + EU-located sources; it also isn’t the most direct choice for ad-hoc SQL. Option C is incorrect because Bigtable is an operational NoSQL store and is not designed for ad-hoc SQL analytics workloads like BigQuery; using it would complicate analytics and governance.

2. During Mock Exam Part 2, you review a scenario: an e-commerce company streams click events and requires near-real-time dashboards in BigQuery. They can tolerate occasional duplicate events but must not lose events, and they want the simplest managed ingestion with minimal code. Which approach is most appropriate?

Show answer
Correct answer: Use Pub/Sub to ingest events and a Dataflow streaming pipeline with BigQuery Storage Write API (exactly-once semantics where applicable) and dead-letter handling to prevent data loss.
Pub/Sub + Dataflow streaming into BigQuery is the standard managed pattern for low-latency ingestion with reliability controls (checkpointing, retries, DLQ) and can be configured to avoid data loss while tolerating duplicates if necessary. Option B is not near-real-time (10-minute latency) and increases risk of loss if servers fail before batching. Option C is batch-oriented and fails the near-real-time dashboard requirement.

3. Weak-spot analysis shows you often choose overly complex architectures. A product team needs to let analysts query a 50 TB dataset stored as Parquet in Cloud Storage without loading it into BigQuery, and they want to control cost by limiting unnecessary data scans. What is the best solution?

Show answer
Correct answer: Create BigQuery external tables over the Parquet files and enforce partitioned/hive-style layouts and required partition filters where appropriate to reduce scanned data.
BigQuery external tables can query Parquet in GCS without loading, and cost control comes from reducing scanned data via partitioning-friendly layouts and query patterns (and using controls like required partition filters when using partitioned tables). Option B adds operational overhead (cluster management) and cost unpredictability; it’s not the simplest managed analytics path. Option C violates the requirement to avoid loading and, as an unpartitioned table, can increase scan cost for common filters.

4. In a mock exam review, a question asks about governance. A healthcare company stores PHI in BigQuery and needs to ensure analysts can only see non-sensitive columns unless explicitly approved, while still allowing broad access to the table for allowed fields. Which BigQuery capability best matches this requirement?

Show answer
Correct answer: Use BigQuery column-level security with policy tags (Data Catalog) to restrict access to sensitive columns based on IAM.
BigQuery policy tags (column-level security) are designed for fine-grained governance where access can be granted at the column level using IAM, fitting PHI controls. Option B is operationally heavy, increases duplication, and risks inconsistent governance over time. Option C is incomplete: authorized views can help, but granting everyone one view doesn’t address differentiated access for approved vs non-approved users, and misconfiguration can still allow base-table access if permissions are too broad; policy tags are the purpose-built control.

5. Exam Day Checklist practice: You are asked to pick the best operational approach for a BigQuery-centric pipeline. A team runs daily ELT transformations and must ensure failures are detected quickly, retries are managed, and the workflow is version-controlled and repeatable. They want a managed orchestration service with minimal custom code. What should you recommend?

Show answer
Correct answer: Use Cloud Composer (managed Airflow) to orchestrate BigQuery jobs, implement retries/alerts, and manage DAGs in source control.
Cloud Composer provides managed orchestration aligned with PDE operational requirements: scheduling, retries, dependency management, alerting/monitoring integration, and infrastructure-as-code/source-controlled DAGs. Option B fails basic reliability and monitoring expectations. Option C can work but increases operational burden (VM patching, secrets management, custom retry/alert logic) and is less aligned with managed, repeatable operations typically expected for production pipelines.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.