A large financial services company operates multiple Databricks workspaces for different business units. They need to develop a centralized repository of customer features (e.g., credit score, transaction frequency) that can be securely shared and reused across all workspaces, with strict access controls managed by a central governance team. Which approach best meets these requirements?
Q2
True or False: In the context of the bias-variance tradeoff, increasing a model's complexity (e.g., adding more layers to a neural network or increasing the depth of a decision tree) will generally decrease its bias but increase its variance.
Q3
A data scientist is cleaning a dataset containing employee salary information for a large corporation. They observe that the 'salary' column has a number of extreme outliers, including several C-level executive salaries that are orders of magnitude higher than the rest of the employees. For a feature engineering task, they need to impute a few missing salary values. Which imputation method should they prefer and why?
Q4
A leading e-commerce company wants to deploy a new product recommendation system. The system has several complex requirements: 1. **Real-time Personalization**: The model must provide recommendations within 200ms of a user's action (e.g., viewing a product). 2. **Dynamic User Profiles**: User feature vectors, which are used for inference, must be updated in near real-time based on their clickstream data. 3. **A/B Testing**: The MLOps team must be able to deploy a new 'challenger' recommendation algorithm and route 10% of live traffic to it for evaluation against the current 'champion' model. 4. **Scalability**: The system must handle traffic spikes during holiday seasons, scaling automatically without manual intervention. Which Databricks deployment architecture best fulfills all these requirements? ```mermaid graph TD subgraph User Interaction WebApp[Web Application] --> API_GW[API Gateway] end subgraph Real-time Processing Clickstream[Kafka: User Events] --> DLT[Delta Live Tables: User Profile Update] DLT --> OnlineStore[Online Feature Store] end subgraph Model Serving API_GW --> Endpoint[Model Serving Endpoint] Endpoint -- 90% --> Champion[Champion Model] Endpoint -- 10% --> Challenger[Challenger Model] Champion --> OnlineStore Challenger --> OnlineStore end subgraph Batch Processing ProductCatalog[Batch: Product Catalog] --> OfflineStore[Offline Feature Store] end ```
Q5
A machine learning engineer is using the `FeatureEngineeringClient` to create a new feature table in Unity Catalog. The table will store user features, and it's critical that each user is uniquely identified and that features can be looked up efficiently for online serving. Which parameter in the `fe.create_table` method is used to specify the unique identifier column(s) for the entities in the table?
Q6
A data scientist is using Hyperopt to perform Bayesian hyperparameter optimization for a machine learning model. They need to supply the correct search algorithm to the `algo` parameter of the `fmin` function. Which of the following options implements a Bayesian approach, specifically Tree-structured Parzen Estimator?
Q7Multiple answers
An MLOps team is managing a critical fraud detection model registered in Unity Catalog. The current production model is aliased as 'Champion'. A new version has been validated and is ready to be promoted. To minimize risk, the team needs a strategy that allows for an immediate rollback to the previous version if the new model underperforms. Which steps should they perform? (Select TWO)
Q8
During exploratory data analysis for a demand forecasting model, a data scientist observes that a key feature, `user_daily_logins`, has a strong positive skew, with most users logging in 1-2 times a day but a small number of power users logging in over 50 times. This skew could negatively impact the performance of a linear regression model. Which feature transformation is most appropriate to apply to this feature?
Q9
A machine learning team is using `GridSearchCV` from scikit-learn to tune a Gradient Boosting model. The parameter grid is defined as follows: `param_grid = {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]}` They are using 5-fold cross-validation (`cv=5`). How many individual models will be trained during this entire hyperparameter tuning process?
Q10
A logistics company needs to deploy a model for real-time anomaly detection in its package delivery event stream. The pipeline must process a high volume of events with fluctuating loads and requires a solution that simplifies infrastructure management and automatically handles cluster scaling. What is the primary advantage of using Delta Live Tables (DLT) for this streaming inference task compared to a manually configured Structured Streaming job?