[Q84-Q108] Ensure Success With Updated Verified Professional-Data-Engineer Exam Dumps [2023]

NO.84 You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non- public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?

Restrict the Google Cloud Storage bucket so only you can see the files

Grant the Project Owner role to a service account, and run the job with it

Use a service account with the ability to read the batch files and to write to BigQuery

Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery

NO.85 Which of these sources can you not load data into BigQuery from?

File upload

Google Drive

Google Cloud Storage

Google Cloud SQL

NO.86 Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

Field promotion

Randomization

Salting

Hashing

NO.87 Your neural network model is taking days to train. You want to increase the training speed. What can you do?

Subsample your test dataset.

Subsample your training dataset.

Increase the number of input features to your model.

Increase the number of layers in your neural network.

NO.88 Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

Use the NOW () function in BigQuery to record the event’s time.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

NO.89 You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’the property ‘actors’and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor=<actorname>ordered by date_releasedor all movies with tag=Comedyordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Manually configure the index in your index config as follows:

Set the following in your entity options: exclude_from_indexes = ‘actors, tags’

Set the following in your entity options: exclude_from_indexes = ‘date_published’

NO.90 An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?

BigQuery

Cloud SQL

Cloud BigTable

Cloud Datastore

NO.91 You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:
* Executing the transformations on a schedule
* Enabling non-developer analysts to modify transformations
* Providing a graphical tool for designing transformations
What should you do?

Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis

Load each month’s CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query

Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data’s schema changes

Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe.
Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery

NO.92 Your financial services company is moving to cloud technology and wants to store 50 TB of financial timeseries data in the cloud. This data is updated frequently and new data will be streaming in all the time.
Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?

Cloud Bigtable

Google BigQuery

Google Cloud Storage

Google Cloud Datastore

NO.93 You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

Increase the share of the test sample in the train-test split.

Try to collect more data and increase the size of your dataset.

Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

NO.94 Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)

The wide model is used for memorization, while the deep model is used for generalization.

A good use for the wide and deep model is a recommender system.

The wide model is used for generalization, while the deep model is used for memorization.

A good use for the wide and deep model is a small-scale linear regression problem.

NO.95 You’ve migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you’d like to optimize for it. You need to keep in mind that your organization is very cost- sensitive, so you’d like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

Increase the size of your parquet files to ensure them to be 1 GB minimum.

Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.

Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.

Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

NO.96 You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
– No interaction by the user on the site for 1 hour
– Has added more than $30 worth of products to the basket
– Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

Use a fixed-time window with a duration of 60 minutes.

Use a sliding time window with a duration of 60 minutes.

Use a session window with a gap time duration of 60 minutes.

Use a global window with a time based trigger with a delay of 60 minutes.

NO.97 You need to choose a database to store time series CPU and memory usage for millions of computers. You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?

Create a table in BigQuery, and append the new samples for CPU and memory to the table

Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second

Create a narrow table in Cloud Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second

Create a wide table in Cloud Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.

NO.98 What are two of the benefits of using denormalized data structures in BigQuery?

Reduces the amount of data processed, reduces the amount of storage required

Increases query speed, makes queries simpler

Reduces the amount of storage required, increases query speed

Reduces the amount of data processed, increases query speed

NO.99 You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this? Choose 2 answers.

Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.

Use managed exportm, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.

Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.

Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.

Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

NO.100 You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? (Choose two.)

Publisher throughput quota is too small.

Total outstanding messages exceed the 10-MB maximum.

Error handling in the subscriber code is not handling run-time errors properly.

The subscriber code cannot keep up with the messages.

The subscriber code does not acknowledge the messages that it pulls.

NO.101 What are the minimum permissions needed for a service account used with Google Dataproc?

Execute to Google Cloud Storage; write to Google Cloud Logging

Write to Google Cloud Storage; read to Google Cloud Logging

Execute to Google Cloud Storage; execute to Google Cloud Logging

Read and write to Google Cloud Storage; write to Google Cloud Logging

NO.102 Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?

dataflow.worker

dataflow.compute

dataflow.developer

dataflow.viewer

NO.103 You are planning to use Google’s Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?

ParDo

Sink API

Source API

Data extraction

NO.104 Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of dat
a. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)

Create a new view over events using standard SQL

Create a new partitioned table using a standard SQL query

Create a new view over events_partitioned using standard SQL

Create a service account for the ODBC connection to use for authentication

Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared “events”

NO.105 Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Export the data into a Google Sheet for virtualization.

Create an additional table with only the necessary columns.

Create a view on the table to present to the virtualization tool.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

NO.106 You’ve migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you’d like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you’d like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

Increase the size of your parquet files to ensure them to be 1 GB minimum.

Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.

Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.

Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

NO.107 You are developing a software application using Google’s Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

PCollection

Transform

Pipeline

Sink API

NO.108 You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. the column TS instead of the column DT from now on.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Free certification test prep

[Q84-Q108] Ensure Success With Updated Verified Professional-Data-Engineer Exam Dumps [2023]

Leave a Reply Cancel reply