[2023] Databricks-Certified-Professional-Data-Engineer Exam Dumps, Test Engine Practice Test Questions [Q131-Q153]

Q131. A new data engineer [email protected] has been assigned to an ELT project. The new data
engineer will need full privileges on the table sales to fully manage the project.
Which of the following commands can be used to grant full permissions on the table to the new data engineer?

1. GRANT ALL PRIVILEGES ON TABLE [email protected] TO sales;

1. GRANT SELECT ON TABLE sales TO [email protected];

1. GRANT ALL PRIVILEGES ON TABLE sales TO [email protected];

1. GRANT USAGE ON TABLE sales TO [email protected];

1. GRANT SELECT CREATE MODIFY ON TABLE sales TO [email protected];

Q132. Which of the following scenarios is the best fit for the AUTO LOADER solution?

Efficiently process new data incrementally from cloud object storage

Incrementally process new streaming data from Apache Kafa into delta lake

Incrementally process new data from relational databases like MySQL

Efficiently copy data from data lake location to another data lake location

Efficiently move data incrementally from one delta table to another delta table

Explanation
The answer is, Efficiently process new data incrementally from cloud object storage.
Please note: AUTO LOADER only works on data/files located in cloud object storage like S3 or Azure Blob Storage it does not have the ability to read other data sources, although AU-TO LOADER is built on top of structured streaming it only supports files in the cloud object stor-age. If you want to use Apache Kafka then you can just use structured streaming.
Diagram Description automatically generated

Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
1.Directory listing – List Directory and maintain the state in RocksDB, supports incremental file listing
2.File notification – Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
When to use Auto Loader instead of the COPY INTO?
*You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches.
*You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
Refer to more documentation here,
https://docs.microsoft.com/en-us/azure/databricks/ingestion/auto-loader

Q133. Which of the following statements can be used to test the functionality of code to test number of rows in the table equal to 10 in python?
row_count = spark.sql(“select count(*) from table”).collect()[0][0]

assert (row_count = 10, “Row count did not match”)

assert if (row_count = 10, “Row count did not match”)

assert row_count == 10, “Row count did not match”

assert if row_count == 10, “Row count did not match”

assert row_count = 10, “Row count did not match”

Q134. What is the purpose of gold layer in Multi hop architecture?

Optimizes ETL throughput and analytic query performance

Eliminate duplicate records

Preserves grain of original data, without any aggregations

Data quality checks and schema enforcement

Optimized query performance for business-critical data

Q135. Which of the following benefits does Delta Live Tables provide for ELT pipelines over standard data pipelines
that utilize Spark and Delta Lake on Databricks?

The ability to write pipelines in Python and/or SQL

The ability to declare and maintain data table dependencies

The ability to automatically scale compute resources

The ability to access previous versions of data tables

The ability to perform batch and streaming queries

Q136. You have accidentally deleted records from a table called transactions, what is the easiest way to restore the records deleted or the previous state of the table? Prior to deleting the version of the table is 3 and after delete the version of the table is 4.

RESTORE TABLE transactions FROM VERSION as of 4

RESTORE TABLE transactions TO VERSION as of 3
C .
1.INSERT INTO OVERWRITE transactions
2.SELECT * FROM transactions VERSION AS OF 3
3.MINUS
4.SELECT * FROM transactions

1.INSERT INTO OVERWRITE transactions
2.SELECT * FROM transactions VERSION AS OF 4
3.INTERSECT
4.SELECT * FROM transactions

COPY OVERWRITE transactions from VERSION as of 3

Q137. You noticed a colleague is manually copying the data to the backup folder prior to running an up-date command, incase if the update command did not provide the expected outcome so he can use the backup copy to replace table, which Delta Lake feature would you recommend simplifying the process?

Use time travel feature to refer old data instead of manually copying

Use DEEP CLONE to clone the table prior to update to make a backup copy

Use SHADOW copy of the table as preferred backup choice

Cloud object storage retains previous version of the file

Cloud object storage automatically backups the data

Q138. The Delta Live Table Pipeline is configured to run in Production mode using the continuous Pipe-line Mode.
what is the expected outcome after clicking Start to update the pipeline?

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional testing

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing

All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline (Correct)

Explanation
The answer is,
All datasets will be updated continuously and the pipeline will not shut down. The compute re-sources will persist with the pipeline until it is shut down since the execution mode is chosen to be continuous. It does not matter if the pipeline mode is development or production, pipeline mode only matters during the pipeline initialization.
DLT pipeline supports two modes Development and Production, you can switch between the two based on the stage of your development and deployment lifecycle.
Development and production modes
Development:
When you run your pipeline in development mode, the Delta Live Tables system:
*Reuses a cluster to avoid the overhead of restarts.
*Disables pipeline retries so you can immediately detect and fix errors.
Production:
In production mode, the Delta Live Tables system:
*Restarts the cluster for specific recoverable errors, including memory leaks and stale cre-dentials.
*Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the buttons in the Pipelines UI to switch between develop-ment and production modes. By default,

pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution behavior.
Storage locations must be configured as part of pipeline settings and are not affected when switching between modes.
Delta Live Tables supports two different modes of execution:
Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipe-line are updated after their dependent data sources have been updated.
Continuous pipelines update tables continuously as input data changes. Once an update is started, it continues to run until manually stopped. Continuous pipelines require an always-running cluster but ensure that downstream consumers have the most up-to-date data Please review additional DLT concepts using the below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-live-tables-c

Q139. Which of the following programming languages can be used to build a Databricks SQL dashboard?

Python

Scala

SQL

R

All of the above

Q140. Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results?

SELECT DISTINCT * FROM table_name

SELECT DISTINCT * FROM table_name HAVING COUNT(*) > 1

SELECT DISTINCT_ROWS (*) FROM table_name

SELECT * FROM table_name GROUP BY * HAVING COUNT(*) < 1

SELECT * FROM table_name GROUP BY * HAVING COUNT(*) > 1

Q141. You are working on a marketing team request to identify customers with the same information between two tables CUSTOMERS_2021 and CUSTOMERS_2020 each table contains 25 columns with the same schema, You are looking to identify rows that match between two tables across all columns, which of the following can be used to perform in SQL

1.SELECT * FROM CUSTOMERS_2021
2. UNION
3.SELECT * FROM CUSTOMERS_2020

1.SELECT * FROM CUSTOMERS_2021
2. UNION ALL
3.SELECT * FROM CUSTOMERS_2020

1.SELECT * FROM CUSTOMERS_2021 C1
2.INNER JOIN CUSTOMERS_2020 C2
3.ON C1.CUSTOMER_ID = C2.CUSTOMER_ID

1.SELECT * FROM CUSTOMERS_2021
2. INTERSECT
3.SELECT * FROM CUSTOMERS_2020

1.SELECT * FROM CUSTOMERS_2021
2.EXCEPT
3.SELECT * FROM CUSTOMERS_2020

Q142. A data engineering team has created a series of tables using Parquet data stored in an external sys-tem. The
team is noticing that after appending new rows to the data in the external system, their queries within
Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this
issue.
Which of the following approaches will ensure that the data returned by queries is always up-to-date?

The tables should be updated before the next query is run

The tables should be converted to the Delta format

The tables should be refreshed in the writing cluster before the next query is run

The tables should be altered to include metadata to not cache

The tables should be stored in a cloud-based external system

Q143. A new data engineer has started at a company. The data engineer has recently been added to the company’s
Databricks workspace as [email protected]. The data engineer needs to be able to query the table
sales in the database retail. The new data engineer already has been granted USAGE on the database retail.
Which of the following commands can be used to grant the appropriate permissions to the new data engineer?

GRANT USAGE ON TABLE sales TO [email protected];

GRANT SELECT ON TABLE [email protected] TO sales;

GRANT SELECT ON TABLE sales TO [email protected];

GRANT CREATE ON TABLE sales TO [email protected];

GRANT USAGE ON TABLE [email protected] TO sales;

Q144. The marketing team is launching a new campaign to monitor the performance of the new campaign for the first two weeks, they would like to set up a dashboard with a refresh schedule to run every 5 minutes, which of the below steps can be taken to reduce of the cost of this refresh over time?

Reduce the size of the SQL Cluster size

Reduce the max size of auto scaling from 10 to 5

Setup the dashboard refresh schedule to end in two weeks

Change the spot instance policy from reliability optimized to cost optimized

Always use X-small cluster

Q145. What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table

3

2
(Correct)

1

0

NULL

Q146. You are currently working on a notebook that will populate a reporting table for downstream process consumption, this process needs to run on a schedule every hour, what type of cluster are you going to use to set up this job?

Since it’s just a single job and we need to run every hour, we can use an all-purpose cluster

The job cluster is best suited for this purpose.

Use Azure VM to read and write delta tables in Python

Use delta live table pipeline to run in continuous mode

Q147. Your team has hundreds of jobs running but it is difficult to track cost of each job run, you are asked to provide a recommendation on how to monitor and track cost across various workloads

Create jobs in different workspaces, so we can track the cost easily

Use Tags, during job creation so cost can be easily tracked

Use job logs to monitor and track the costs

Use workspace admin reporting

Use a single cluster for all the jobs, so cost can be easily tracked

Q148. Which of the following approaches can the data engineer use to obtain a version-controllable con-figuration of the Job’s schedule and configuration?

They can link the Job to notebooks that are a part of a Databricks Repo.

They can submit the Job once on a Job cluster.

They can download the JSON equivalent of the job from the Job’s page.

They can submit the Job once on an all-purpose cluster.

They can download the XML description of the Job from the Job’s page

Q149. If you create a database sample_db with the statement CREATE DATABASE sample_db what will be the default location of the database in DBFS?

Default location, DBFS:/user/

Default location, /user/db/

Default Storage account

Statement fails “Unable to create database without location”

Default Location, dbfs:/user/hive/warehouse

Q150. Drop the customers database and associated tables and data, all of the tables inside the database are managed tables. Which of the following SQL commands will help you accomplish this?

DROP DATABASE customers FORCE

DROP DATABASE customers CASCADE

DROP DATABASE customers INCLUDE

All the tables must be dropped first before dropping database

DROP DELTA DATABSE customers

Q151. The data engineering team is using a bunch of SQL queries to review data quality and monitor the ETL job every day, which of the following approaches can be used to set up a schedule and auto-mate this process?

They can schedule the query to run every 1 day from the Jobs UI

They can schedule the query to refresh every 1 day from the query’s page in Databricks SQL.

They can schedule the query to run every 12 hours from the Jobs UI.

They can schedule the query to refresh every 1 day from the SQL endpoint’s page in Databricks SQL.

They can schedule the query to refresh every 12 hours from the SQL endpoint’s page in Databricks SQL

Q152. You were asked to create a notebook that can take department as a parameter and process the data accordingly, which is the following statements result in storing the notebook parameter into a py-thon variable

SET department = dbutils.widget.get(“department”)

ASSIGN department == dbutils.widget.get(“department”)

department = dbutils.widget.get(“department”)

department = notebook.widget.get(“department”)

department = notebook.param.get(“department”)

Q153. Which of the following technique can be used to implement fine-grained access control to rows and columns of the Delta table based on the user’s access?

Use Unity catalog to grant access to rows and columns

Row and column access control lists

Use dynamic view functions

Data access control lists

Dynamic Access control lists with Unity Catalog

Related posts:

admin

You might also like

Leave a Reply Cancel reply