10/210 questions · Unlock full access
Q1

A data engineering team is developing a Spark application to process sensitive financial data. They need to pass a large, read-only lookup table (approximately 500MB) containing currency exchange rates to all executor nodes. This table is used in a join operation within multiple tasks. Which Spark feature should be used to distribute this lookup table efficiently and minimize network I/O?

Q2

In the context of the Apache Spark execution hierarchy, which of the following events will always trigger the creation of a new Spark Stage?

Q3Multiple answers

A Spark job processing a large dataset is experiencing performance degradation. Analysis of the Spark UI shows that one task in a particular stage is taking significantly longer than all other tasks. The stage involves a `groupBy('user_id')` operation. What is the most likely cause of this issue and the most appropriate solution? (Select TWO)

Q4

A developer needs to read a large Parquet dataset partitioned by `year`, `month`, and `day`. To optimize read performance, they only want to load data for the first week of January 2023. Which Spark SQL query correctly applies partition pruning to achieve this? `SELECT * FROM sales_parquet WHERE _______`

Q5

A streaming application needs to calculate a running count of events per user and output the updated count for each user as new data arrives in every micro-batch. Which Structured Streaming output mode is designed for this use case?

Q6

A retail analytics company is building a daily batch processing pipeline using Spark. The pipeline must ingest raw sales transaction data from a CSV file, enrich it with product dimension data from a Parquet file, calculate daily sales aggregates per product category, and write the final report to a Delta table, overwriting the previous day's report. The raw sales data (`sales_df`) contains `product_id`, `sale_amount`, and `transaction_time`. The product dimension data (`products_df`) contains `product_id` and `product_category`. The final report must be partitioned by `product_category` for efficient querying by downstream business intelligence tools. Which sequence of PySpark DataFrame operations correctly and most efficiently implements this logic?

Q7

True or False: Spark Connect allows a Spark application's driver process to run on a separate machine from the Spark cluster, such as a developer's laptop or an IDE, while the execution of Spark jobs occurs on the remote cluster.

Q8

A data scientist is working with a large Spark DataFrame and needs to apply a complex numerical computation that is already implemented and highly optimized in the `scipy` library. Which type of User-Defined Function is best suited for applying this `scipy` function to columns of a Spark DataFrame to maximize performance?

Q9

Which of the following describes the role of the Driver in a Spark application running in cluster mode?

Q10

A developer is writing a DataFrame to a cloud storage location. The requirement is to write the data only if the target location does not already exist. If the location exists, the write operation should fail instead of overwriting or appending data. Which `saveMode` should be used?