The purpose of this Sample Question Set is to provide you with information about the Databricks Certified Data Engineer Professional exam. These sample questions will make you very familiar with both the type and the difficulty level of the questions on the Data Engineer Professional certification test. To get familiar with real exam environment, we suggest you try our Sample Databricks Data Engineer Professional Certification Practice Exam. This sample practice exam gives you the feeling of reality and is a clue to the questions asked in the actual Databricks Certified Data Engineer Professional certification exam.
These sample questions are simple and basic questions that represent likeness to the real Databricks Certified Data Engineer Professional exam questions. To assess your readiness and performance with real-time scenario based questions, we suggest you prepare with our Premium Databricks Data Engineer Professional Certification Practice Exam. When you solve real time scenario based questions practically, you come across many difficulties that give you an opportunity to improve.
Databricks Data Engineer Professional Sample Questions:
CREATE TABLE prod.sales_by_stor
USING DELTA
LOCATION "/mnt/prod/sales_by_store"
Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command?
a) All related files and metadata are dropped and recreated in a single ACID transaction.
b) The table name change is recorded in the Delta transaction log.
c) A new Delta transaction log is created for the renamed table..
d) The table reference in the metastore is updated.
Which strategy will yield the best performance without shuffling data?
a) Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
b) Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
c) Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
d) Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
a) Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster.
b) Schedule a job to execute the pipeline once an hour on a new job cluster.
c) Schedule a Structured Streaming job with a trigger interval of 60 minutes.
d) Configure a job that executes every time new data lands in a given directory.
What will be the resulting state if tasks A and B complete successfully but task C fails during a scheduled run?
a) All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
b) Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
c) All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
d) Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
password = dbutils.secrets.get(scope="db_creds", key="jdbc_password")
print(password)
df = (spark
.read
.format("jdbc")
.option("url", connection)
.option("dbtable", tablename)
.option("user", username)
.option("password", password)
)
Which statement describes what will happen when the above code is executed?
a) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.
b) The connection to the external table will succeed; the string value of password will printed in plain text.
c) An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.
d) The connection to the external table will succeed; the string "REDACTED" will be printed.
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT,
latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
a) post_id
b) post_time
c) date
d) user_id
During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
a) Use the trigger once option and configure a Databricks job to execute the query every 8 seconds; this ensures all backlogged records are processed with each batch.
b) Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
c) Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
d) The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels. The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.
Which statement exemplifies best practices for implementing this system?
a) Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
b) Storing all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
c) Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
d) Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.
The user_ltv table has the following schema: email STRING, age INT, ltv INT
The following view definition is executed:
CREATE VIEW email_ltv AS
SELECT
CASE WHEN
is_member('marketing') THEN email
ELSE 'REDACTED'
END AS email,
ltv
FROM user_ltv
An analyst who is not a member of the marketing group executes the following query: SELECT * FROM email_ltv What will be the result of this query?
a) Only the email and ltv columns will be returned; the email column will contain all null values.
b) Three columns will be returned, but one column will be named "REDACTED" and contain only null values.
c) Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.
Which solution addresses the situation while emphasizing simplicity?
a) Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions.
b) Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.
c) Use a CTAS statement to create a derivative table from the marketing table; configure a production job to propagate changes.
d) Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from the marketing table.
Answers:
|
Question: 01 Answer: d |
Question: 02 Answer: c |
Question: 03 Answer: b |
Question: 04 Answer: a |
Question: 05 Answer: d |
|
Question: 06 Answer: c |
Question: 07 Answer: b |
Question: 08 Answer: a |
Question: 09 Answer: c |
Question: 10 Answer: a |
Note: For any error in Databricks Certified Data Engineer Professional certification exam sample questions, please update us by writing an email on feedback@certfun.com.
