Databricks Data Engineer Professional Certification Sample Questions

Data Engineer Professional Dumps, Data Engineer Professional PDF, Data Engineer Professional VCE, Databricks Certified Data Engineer Professional VCE, Databricks Data Engineer Professional PDFThe purpose of this Sample Question Set is to provide you with information about the Databricks Certified Data Engineer Professional exam. These sample questions will make you very familiar with both the type and the difficulty level of the questions on the Data Engineer Professional certification test. To get familiar with real exam environment, we suggest you try our Sample Databricks Data Engineer Professional Certification Practice Exam. This sample practice exam gives you the feeling of reality and is a clue to the questions asked in the actual Databricks Certified Data Engineer Professional certification exam.

These sample questions are simple and basic questions that represent likeness to the real Databricks Certified Data Engineer Professional exam questions. To assess your readiness and performance with real-time scenario based questions, we suggest you prepare with our Premium Databricks Data Engineer Professional Certification Practice Exam. When you solve real time scenario based questions practically, you come across many difficulties that give you an opportunity to improve.

Databricks Data Engineer Professional Sample Questions:

01. The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline, which extracts, transforms, and loads data for its runs, is 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
 
a) Schedule a job to execute the pipeline once an hour on a new job cluster.
b) Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster.
c) Schedule a Structured Streaming job with a trigger interval of 60 minutes.
d) Configure a job that executes every time new data lands in a given directory.

02. A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. Currently, during normal execution, each microbatch of data is processed in under 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds.
The streaming write is currently configured with a 10-second trigger interval. Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement? 

a) Use the trigger once option and configure a Databricks job to execute the query every 8 seconds; this ensures all backlogged records are processed with each batch.
b) Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing a spill.
c) Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer-running tasks from previous batches finish.
d) The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

03. A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:

- email STRING, age INT, ltv INT
The following view definition is executed:
CREATE VIEW email_ltv AS
SELECT
CASE WHEN
is_member('marketing') THEN email
ELSE 'REDACTED'
END AS email,
ltv
FROM user_ltv
An analyst who is not a member of the marketing group executes the following query:
- SELECT * FROM email_ltv
What will be the result of this query? 
a) Only the email and ltv columns will be returned; the email column will contain all null values.
b) Three columns will be returned, but one column will be named "REDACTED" and contain only null values.
c) Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.
d) The email and ltv columns will be returned with the values in user_ltv.

04. The security team is exploring whether the Databricks secrets module can be leveraged for connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They modify their code to the following (leaving all other variables unchanged).

password = dbutils.secrets.get(scope="db_creds", key="jdbc_password")
print(password)
df = (spark
.read
.format("jdbc")
.option("url", connection)
.option("dbtable", tablename)
.option("user", username)
.option("password", password)
)
What will happen when this code is executed? 
a) The connection to the external table will succeed; the string "REDACTED" will be printed.
b) The connection to the external table will succeed; the string value of the password will be printed in plain text.
c) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed, and the password will be printed in plain text.
d) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed, and the encoded password will be saved to DBFS.

05. A Delta Lake table was created with the following query:
- CREATE TABLE prod.sales_by_stor
- USING DELTA
- LOCATION "/mnt/prod/sales_by_store"
Realizing that the original query had a typographical error, the code below was executed:
- ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command? 
a) All related files and metadata are dropped and recreated in a single ACID transaction.
b) The table name change is recorded in the Delta transaction log.
c) A new Delta transaction log is created for the renamed table..
d) The table reference in the metastore is updated.

06. A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT,
latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
a) post_id
b) post_time
c) date
d) user_id

07. The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database. After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).
password = dbutils.secrets.get(scope="db_creds", key="jdbc_password")
print(password)
df = (spark
.read
.format("jdbc")
.option("url", connection)
.option("dbtable", tablename)
.option("user", username)
.option("password", password)
)
Which statement describes what will happen when the above code is executed? 
a) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.
b) The connection to the external table will succeed; the string value of password will printed in plain text.
c) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.
d) The connection to the external table will succeed; the string "REDACTED" will be printed.

08. A Databricks job has been configured with three tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
What will be the resulting state if tasks A and B complete successfully but task C fails during a scheduled run? 

a) All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have been completed successfully.
b) Unless all tasks are completed successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
c) All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
d) Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

09. A data ingestion task requires a 1-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.
Which approach will work without rearranging the data? 

a) Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB*1024*1024/512), and then write to Parquet.
b) Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to Parquet.
c) Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to Parquet.
d) Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to Parquet.

10. The marketing team wants to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales organization.
Which solution addresses the situation while emphasizing simplicity?
 
a) Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions.
b) Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.
c) Use a CTAS statement to create a derivative table from the marketing table, and then configure a production job to propagate the changes.
d) Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from the marketing table.

Answers:

Question: 01
Answer: a
Question: 02
Answer: b
Question: 03
Answer: c
Question: 04
Answer: a
Question: 05
Answer: d
Question: 06
Answer: c
Question: 07
Answer: d
Question: 08
Answer: a
Question: 09
Answer: c
Question: 10
Answer: a

Note: For any error in Databricks Certified Data Engineer Professional certification exam sample questions, please update us by writing an email on feedback@certfun.com.

Rating: 5 / 5 (75 votes)