Skip to Content

Google Professional Cloud Developer: How to Use BigQuery Storage Write API to Stream Data Without Duplicates?

Learn the simplest way to stream data to BigQuery using the Storage Write API while ensuring no duplicates. Create a write stream in the committed type for exactly-once delivery.

Table of Contents

Question

You are developing an application component to capture user behavior data and stream the data to BigQuery. You plan to use the BigQuery Storage Write API. You need to ensure that the data that arrives in BigQuery does not have any duplicates. You want to use the simplest operational method to achieve this. What should you do?

A. Create a write stream in the default type.
B. Create a write stream in the committed type.
C. Configure a Kafka cluster. Use a primary universally unique identifier (UUID) for duplicate messages.
D. Configure a Pub/Sub topic. Use Cloud Functions to subscribe to the topic and remove any duplicates.

Answer

B. Create a write stream in the committed type.

Explanation

When using the BigQuery Storage Write API to stream data to BigQuery, there are two types of write streams you can create:

  1. Default type
  2. Committed type

The default stream type provides at-least-once delivery semantics. This means that in the case of a failure, previously received data may be resent, potentially resulting in duplicates reaching BigQuery. While this ensures no data loss, it does not guarantee the absence of duplicates.

On the other hand, the committed stream type provides exactly-once delivery semantics. With this stream type, data is only considered successful when it has been durably written and committed to BigQuery. If a failure occurs before the commit, the data will not be visible in BigQuery and can be safely resent. This ensures that duplicates are not introduced, even in the face of failures or retries.

So to meet the requirement of streaming data to BigQuery without any duplicates in the simplest way, creating a write stream in the committed type is the best approach. It guarantees exactly-once semantics without the need for additional infrastructure like Kafka or Cloud Functions to handle deduplication.

The other options are not ideal for this scenario:
A) Using the default stream type may introduce duplicates.
C) Setting up a Kafka cluster for deduplication adds significant operational complexity.
D) Using Cloud Functions to subscribe to Pub/Sub and remove duplicates is overly complex compared to using committed streams.

In summary, when using the BigQuery Storage Write API, create a committed write stream to ensure data is streamed to BigQuery without any duplicates in the simplest way possible.

Google Professional Cloud Developer certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Google Professional Cloud Developer exam and earn Google Professional Cloud Developer certification.