A data architect is designing a petabyte-scale fact table for an e-commerce platform. The table will store transaction data and will be queried by `product_id`, `customer_id`, and `transaction_date`. The `customer_id` column has very high cardinality (over 100 million unique values), while `transaction_date` is used for filtering in 90% of queries, often with range scans. The primary goals are to optimize query performance for a wide variety of analytical queries and to minimize ongoing data layout maintenance. Which data layout strategy is most effective for these requirements?
Q2Multiple answers
A financial institution is using Unity Catalog to govern access to a `transactions` table containing sensitive customer data. The following access rules must be enforced: 1. Analysts in the `EU_Analysts` group should only see transactions from European countries. 2. For all analysts, the `customer_full_name` column must be masked, showing only the first initial and last name. 3. Auditors in the `Compliance` group must see all data, unmasked and unfiltered. Which TWO actions are required to implement this security model directly in Unity Catalog? (Select TWO)
Q3
A developer is defining a Databricks Asset Bundle (DAB) to deploy a multi-task job. The job includes a task that runs a Python script packaged as a wheel file.
Q4
**Company Background:** Streamlytics, a real-time analytics provider, processes IoT sensor data. They have a Spark Structured Streaming job that joins a stream of sensor readings (`sensor_stream`) with a static dimension table of sensor metadata (`sensor_metadata`). The job performs an aggregation to calculate the average temperature per sensor type every 5 minutes. **Current Architecture:** The streaming job uses a 10-minute watermark on the event timestamp to handle late-arriving data. The state store is located on cloud object storage, and the pipeline runs on an auto-scaling job cluster. This setup has been running efficiently for several months. **The Problem:** After a new set of legacy devices were onboarded, the pipeline's performance has severely degraded. Micro-batch processing times have inflated from 30 seconds to over 15 minutes, causing the stream to fall behind. The state store size has grown to several terabytes. An investigation reveals that the legacy devices occasionally send data that is several days late. These records are critical and must eventually be processed for auditing purposes. **Business Requirement:** The engineering lead has tasked you with resolving the performance degradation and the massive state size. The solution must not lose the late-arriving data and should avoid a significant increase in steady-state compute costs.
Q5
A data engineering team has designed a multi-task job in Databricks Workflows to orchestrate their daily ETL process. The job has a linear dependency chain for the main data processing and an independent task for auditing that can run at any time after the start. An engineer needs to add a new task, `generate_reports`, which must run only after both the `validate_data` and `process_data` tasks have successfully completed. Which task dependency configuration correctly implements this requirement? ```mermaid graph TD A[start_job] --> B[ingest_data]; B --> C[validate_data]; C --> D[process_data]; D --> E[load_warehouse]; A --> F[run_audits]; ```
Q6
A data engineer needs to ingest a large volume of continuously arriving JSON files from a cloud storage location. The schema of the JSON files is known to evolve over time, with new columns being added frequently. The pipeline must be resilient to malformed JSON records and should infer and handle schema changes automatically. Which Databricks feature is best suited for this ingestion task?
Q7
A data pipeline needs to transform a bronze Delta table containing raw customer orders into a silver table. During this transformation, any order record that does not have a valid `order_id` or has a negative `total_amount` should be moved to a quarantine table for later analysis, instead of being loaded into the silver table. The main pipeline should continue processing valid records without interruption. Which Delta Live Tables (DLT) feature should be used to implement this data quality and quarantining process?
Q8
An analytics query on a large, date-partitioned Delta table is running slowly. The query frequently filters on a high-cardinality `user_id` column within specific date ranges. The Spark UI shows that a large number of files are being read even for queries that target a small number of users. The table's files are already compacted to an optimal size (around 1GB). Which optimization technique should be applied to improve data skipping and reduce the number of files scanned?
Q9
What is the primary function of the metastore in the context of the Databricks Lakehouse Platform?
Q10Multiple answers
An engineering team is building a near-real-time ETL pipeline using Delta Live Tables (DLT) to process a Change Data Capture (CDC) feed from a relational database. The source system provides records with `operation_type` ('INSERT', 'UPDATE', 'DELETE'), a primary key, and a timestamp for ordering. The goal is to efficiently replicate these changes into a target Delta table. Which THREE DLT features or commands are essential for correctly implementing this CDC pipeline? (Select THREE)