DATA-ENG-ASSOC Free Sample Questions

Certified Data Engineer Associate Practice Test
10/207 questions · Unlock full access
Q1

A data engineering team is migrating its development workflow from the Databricks UI to a local IDE using Databricks Connect. A junior engineer successfully sets up their connection profile but receives a `Py4JError` upon trying to initialize a `SparkSession`. The cluster they are connecting to runs Databricks Runtime 14.3, which uses Python 3.11.2. The engineer's local environment is running Python 3.11.5. What is the primary reason for this connection failure?

Q2Multiple answers

A DevOps team is implementing a CI/CD pipeline to deploy a multi-task Databricks workflow using Databricks Asset Bundles (DAB). The pipeline must handle deployments to development, staging, and production workspaces, each with different compute policies and secret scopes. Which TWO components of the `databricks.yml` file are essential for managing these environment-specific configurations? (Select TWO)

Q3

A streaming pipeline using Auto Loader is configured to ingest JSON files from a cloud storage location. The pipeline runs successfully for several weeks but suddenly fails. Investigation of the `_rescue` column reveals that several recent files contain records where a previously numeric `transaction_amount` field is now a string (e.g., `"100.50"` instead of `100.50`). The desired behavior is to automatically adapt the target Delta table schema to accommodate this change without manual intervention. Which Auto Loader option should have been configured to handle this situation gracefully?

Q4

A data architect is designing a Medallion architecture for a financial services company. The raw data (Bronze layer) contains sensitive personally identifiable information (PII). The Silver layer must contain the same records but with all PII columns pseudonymized. The Gold layer will contain aggregated data with no PII. The compliance team requires that only a specific service principal, used by an automated cleansing job, can read the raw PII data from the Bronze layer. All other users and groups should be denied access. How should this security requirement be implemented using Unity Catalog?

Q5

True or False: When a Databricks job cluster is configured with a cluster pool, it can start faster because it acquires its driver and worker nodes from the pool of idle instances, reducing the time spent waiting for the cloud provider to provision new virtual machines.

Q6

During a code review, a senior engineer observes the following PySpark code snippet intended to update customer records based on new transactions. What is the primary issue with this approach for transforming data from a Bronze to a Silver table in a Medallion architecture? ```python # bronze_df is the raw, unvalidated source DataFrame # silver_table is the path to the clean, validated Delta table (bronze_df.write .format("delta") .mode("overwrite") .save(silver_table)) ```

Q7

A data engineer is building a Delta Live Tables (DLT) pipeline. They need to define a table that combines streaming data from a Kafka source with a static dimension table from Unity Catalog for enrichment. The pipeline should enforce a quality constraint: the join key from the streaming source must not be null. If a record violates this constraint, it should be dropped, and the pipeline should continue processing valid records. Which DLT function and expectation clause should be used? ```mermaid graph TD A[Kafka Stream] --> C{DLT Pipeline}; B[UC Dimension Table] --> C; C -->|Join & Enrich| D[Silver Table]; D -->|CONSTRAINT key IS NOT NULL| E{Quality Check}; E -->|Violation| F[Drop Row]; E -->|Valid| G[Process Row]; ```

Q8

A company wants to provide its external partners with read-only access to a curated sales dataset managed in Unity Catalog. The partners do not have Databricks workspaces. The data engineering team needs to set up a secure sharing mechanism that does not require creating and managing users within their Databricks account. Which technology should be used to achieve this?

Q9

An organization is looking to optimize its ad-hoc analytics query performance and reduce infrastructure management overhead. The analytics team runs a large number of concurrent, short-running queries against Gold tables throughout the day. The workload pattern is highly variable. Which type of compute resource is the best fit for this scenario?

Q10

A data engineer needs to write a PySpark DataFrame containing daily sales aggregates to a Delta table. The operation must insert new sales records and update the total sales amount for existing dates that are already in the target table. Which operation should be used to accomplish this combined insert and update logic efficiently?