Designing and Building Modern Data Platform on Google Cloud

Home » Cloud Computing » Designing and Building Modern Data Platform on Google Cloud

When it comes to integrating analytics data in the cloud, the traditional method of making the data warehouse the heart of a solution is a bit like putting all your eggs in one basket.

Traditional thinking to solve modern problems rarely provides the best solution.

The advent of cloud has radically changed the world of analytics data – making it possible to integrate, transform and analyze any time of data from any source – at scale. But designing modern analytic data platforms can be challenging.

This article brings together our own experiences with over 100 modern data platform design projects, using Google Cloud as the cloud platform. In this article, Pythian SVP of Analytics Lynda Partner gives you an introduction into data platforms. Read this article to learn about:

Why a data platform and not just a data warehouse
The core elements of a data platform
Ingesting, storing, processing, and serving layers
How to use Google Cloud as an accelerator
Batch streaming data
Enabling data consumers
How to map to specific Google Cloud components
Details about cloud data warehousing

Read and learn the best practices to design your data cloud platforms for a modern world.

Designing and Building Modern Data Platform on Google Cloud

Now that you understand the value of building a data platform on Google, you’ll likely want to read more on best-practices on implementation. In this article, we outline an introduction to Google Cloud services, why you should use a data platform, and not just a warehouse, and how to map cloud layers to the Google Cloud Platform.

Designing, deploying, and managing a cloud data platform is complicated, but Google-cloud certified experts like us can help at every stage of your data journey. Learn more about Pythian’s Cloud solutions for Google Cloud.

Traditional thinking to solve modern problems rarely provides the best solution.

The advent of the cloud has radically changed the world of analytics–making it possible to integrate, transform, and analyze any type of data from any source at scale. But designing these enterprise data platforms, especially when cloud services are changing at almost warp speed, can be challenging.

We, at Pythian, have been at the forefront of this new world for six years and we’ve been involved in almost 100 projects–enough to say we know what we’re doing. We spend a lot of our time designing cloud data platforms on all of the public clouds and we know that there is a shortage of fully integrated content on how to do it well.

This article aggregates a lot of our thinking on cloud data platform design, and while the concepts apply to platforms that can run on any public cloud platforms for several excellent reasons, we’ve done more of these on Google Cloud so that this article will use Google Cloud examples.

Table of Contents

Table of contents
Why a Data Platform and not just a Data Warehouse?
What exactly is a data platform?
What makes up a data platform?
Taking it up a notch
Google Cloud as the accelerator
The added value of Google Cloud
Mapping cloud data lake layers to specific GCP components
Key Takeaways—TLDR

Why a Data Platform and not just a Data Warehouse?
What exactly is a data platform?
What makes up a data platform?
Ingestion layer
Processing layer
Serving layer
Taking it up a notch
Metadata layer
Orchestration overlay
ETL overlay
Google Cloud as the accelerator
The added value of Google Cloud
Mapping cloud data lake layers to specific GCP components
Batch data ingestion
Streaming data ingestion
Data platform storage
Batch data processing
Real-time data processing and analytics
Cloud warehousing
Direct data platform access
Data Ops overlay
Orchestration layer
Data consumers
Key Takeaways—TLDR

Why a Data Platform and not just a Data Warehouse?

When it comes to integrating analytics data in the cloud, the traditional method of making the data warehouse the heart of a solution is a bit like cutting off your nose to spite your face AND putting all your eggs in one basket–all at once.

This usually stems from misconceptions about how a cloud data warehouse differs from a traditional data warehouse, and how cloud services facilitate data platforms that are so much more than just data warehouses–even as they include data warehouses as part of the overall solution.

A cloud data warehouse is massively scalable and elastic, with pricing based directly on the amount of processing you do in the data warehouse. This can create the mistaken impression that in the cloud, as you now have the biggest, most cost-effective data warehouse ever at your disposal, you can just bring all your data into it, do what you’ve always done, and go from there. But the traditional approach of importing all your data into a monolithic data warehouse is flawed for several reasons, including:

Data warehouses can’t easily accommodate the various types of data businesses now need to collect and analyze. While they’re great for structured data, they don’t provide easy access to semi-structured or unstructured data.
Data warehouses are designed for slow changes, with their output being mostly consistent views over time. That’s at odds with the modern pace of data changes and the agility needs of modern Business Infrastructure and data analytics.
Data warehouses like Google BigQuery are massively scalable but are most often used to present organized data to business users via SQL queries– not to store and make available raw unorganized data.
You can do all your processing in a data warehouse, but it has several potential drawbacks. It may negatively impact performance, increase costs and reduce flexibility, as you may want to process and blend data that is not in the data warehouses such as unprocessed or unstructured data or data that simply isn’t required by data warehouse users so didn’t make its way into the data warehouse.

The bottom line is that thanks to some truly awesome cloud services, you can now architect a modern and modular data analytics platform that includes a cloud data warehouse while taking advantage of the elasticity and benefits of the cloud. The platform feeds the data warehouse, but also provides cost-effective storage of any volume and variety of data that you could encounter. It allows direct access to all that raw, ungoverned data for activation in other systems and exploration by advanced users and data scientists.

What exactly is a data platform?

A cloud-native data platform is an analytics infrastructure solution capable of cost-effectively ingesting, integrating, transforming, and managing an almost unlimited amount of data of any type to facilitate actionable analytics outcomes.

In a data platform, a data lake is usually combined with a data warehouse that becomes a destination for analytics, but the data platform can also operate and deliver value in addition to the data warehouse, and in some use cases, without any data warehouse at all.

For more on why a data warehouse alone isn’t as fully featured and flexible as a data platform, see our e-book “The Data Warehouse is Dead, Long Live the Data Platform.”

To achieve maximum performance and return on investment, your data analytics platform should be designed with the following attributes in mind:

Robustness: What happens when we start adding more/different inputs? Is the existing processing fragile or is it robust enough to handle the many aspects of schema evolution/data changes? Do we still get the correct/ intended outputs?
Functionality: Can it accommodate the variety (structured, unstructured, and semi-structured), velocity (batch and streaming), and volume of data that’s required?
Modularity: What happens when we need to fix or upgrade a subcomponent of this system? Can we take advantage of new functionality without throwing away other, well-functioning components?
Scalability: What happens when system demand (whether data volumes or analytics workloads) increases? Can it scale easily and quickly to meet this increased demand?
Extensibility: What about introducing new functionality or a new feature? Is it modular and pluggable enough that it can be augmented with new features?
Cost-Effectiveness: Is it designed to take advantage of the cost-effectiveness of the cloud while optimizing costs across the entire system?
Manageability: Is it designed using fully managed serverless components to eliminate the need for significant platform infrastructure operations?

These are just some of the basic questions you need to ask–and truthfully answer–before designing any system.

What makes up a data platform?

A cloud data analytics platform is a system in and of itself. This means it has inputs and outputs, with the expectation that certain inputs map to certain outputs. But the tasks required to deliver these outputs are complex. At the highest level, the foundational building blocks of a data analytics platform are:

Ingestion: Brings data from external sources into the platform
Storage: Stores data in the optimal formats and required presentations
Processing: Applies required business logic while transforming and validating data
Serving: Delivers analytics outputs to various data consumers from humans to machines

The purpose of a data platform is to ingest, store, process, and make data available for analysis while supporting the appropriate level of organizational governance. To accomplish this while also delivering a platform that’s functional, modular, scalable, extensible, manageable, and cost-effective, data platforms require a layered architecture. These layers are functional components that perform specific tasks in the data lake system.

In practical terms, a layer is either a cloud service, an open-source or commercial tool, or some other application component you’ve implemented yourself. Very often it’s a combination of several such components, but where possible, we recommend using Platform as a Service (PaaS) solutions to reduce support requirements.

The high-level layers can be seen in the diagram below.

Simplified Cloud data platform design

In software development terms, these functional layers should be loosely coupled. This means separate layers must communicate through a well-defined interface, but should not depend on the internal implementation of a specific layer. This approach is critical as it ensures you have the flexibility to mix and match different cloud services and/or other tools to achieve your goals.

Cloud implementations are notorious for constant change. That’s because of the constant and seemingly never-ending stream of new services or service enhancements released by cloud vendors, or new projects available from the opensource community. This is good news. It means feature richness is always growing. But swapping out services can be a challenge–unless, of course, your data platform architecture is layered so the impact surface of the change is small within the layer and can be isolated from other layers. This layered approach allows you to respond to changes and upgrades with the least possible impact on the overall platform structure.

Let’s explore each of these high-level layers. We’ll then match each layer to the Google services you can use to bring your design to life.

Ingestion layer

The ingestion layer is all about getting data into the data lake. It communicates with various data sources such as relational or NoSQL databases, file storage, and internal or third-party APIs, and extracts data from them. However, the proliferation of different data sources with which organizations are now feeding their analytics means this layer must be very flexible.

Because of this, the ingestion layer is often implemented with a variety of open-source or commercial tools, each specialized to a specific data source type: one important characteristic of a data platform ingestion layer is that it shouldn’t modify or transform incoming data in any way. This ensures raw, unprocessed data is always available in the platform for data lineage tracking and reprocessing. Cloud services like Google Cloud also provide Data Loss Prevention (DLP) API, which can also help cleanse any raw data that may have PII, thus enabling the access to raw data for analytics/machine learning use cases but in a protected way without data/ PII leakage.

Processing layer

After data has been saved to cloud storage in its original form, it can now be processed using batch or streaming processing to make it useful. This data processing is arguably the most interesting part of a data platform. While data lakes make it possible to perform analysis directly on raw data, it’s often not the most productive method: usually, data is transformed to some degree to make it more usable for analysts, data scientists, or other data users. Processing data in the data lake typically includes several distinct steps including schema management, data validation, data cleaning, and the production of data products.

With this in mind, a data processing framework able to limitlessly scale and with cloud compute resources you can tap into any time is key elements of the modern data platform. Over the last several years, several data processing frameworks have been developed that combine scalability, support for modern programming languages, and also integrate well into the overall cloud paradigm. The most notable among these are Apache Spark, Apache Beam, and Apache Flink.

On a high level, all the above frameworks allow you to write data cleaning, transformation, or validation tasks using one of the modern programming languages (usually Java, Scala, or Python). These frameworks then read the data from scalable cloud storage, split it into smaller chunks as data volumes require, and process those chunks using flexible cloud compute resources.

When thinking about the data processing layer in the data platform, we also need to keep in mind the distinction between batch and stream processing. Looking at Fig. 1 we see the ingestion layer saving data to cloud storage, and the processing layer reading data from this storage and saving results back to it. This approach works very well for batch processing because while cloud storage is inexpensive and scalable, it is also not particularly fast–reading and writing data to it can take minutes, even for moderate volumes. There are many use cases today requiring significantly lower processing times (in seconds or less) and are generally referred to as stream processing. In this case, the ingestion layer bypasses cloud storage and sends data directly to the processing layer. Cloud storage is then used as an archive storage unit where data is periodically flushed.

Serving layer

The goal of the serving layer is to prepare data to be consumed by end-users– people or other systems. Demands for access to more data are increasing from a variety of users in most organizations. The challenge for IT is that these users often have different (or even no) technology backgrounds and different preferences as to which tools they want to use to access and analyze the data in the platform.

Business users often want reports and dashboards with rich self-service capabilities. Power users and analysts want to run ad-hoc SQL queries and get responses back in seconds. Data scientists and developers want to use the programming languages they’re most comfortable with to prototype new data transformations or build machine learning models, and then share the results with other team members.

Such diverse requirements put significant pressure on data lake architecture decisions, and you’ll usually have to use different and specialized technologies for different access tasks. But the good news is that Google Cloud makes it easy for all these tools to coexist in a single architecture. For fast SQL access, for example, you can load data from storage into a data warehouse like BigQuery. To provide access to other applications, you can load data from storage into a fast key/value or document store and point the application to that. And a cloud data platform allows data science and engineering teams to work with data directly in cloud storage by using a processing framework like Spark, Beam, or Flink. Cloud services also support managed notebook environments like Jupyter Notebook or Apache Zeppelin. Teams can use these notebooks to share results, perform code reviews, and collaborate in other ways.

The main benefit of the cloud, in this case, is that many of these technologies are offered as Platform as a Service (PaaS), which moves the bulk of operations and support to the cloud provider. It also allows companies to use a pay-as-you-go model, making them accessible for organizations of any size.

At a high level, an approach like this clearly delineates boundaries, roles, and responsibilities between different layers. It also ensures that each layer fulfills the role of a feature. With the proper technology choices behind each of these layers, we can say with confidence we’re well on our way to building a functional, modular, scalable, and extensible data platform.

Taking it up a notch

The diagram below shows a more sophisticated data lake architecture, building on the simpler version:

Cloud data platform layered architecture

We’ve added to or augmented the original ingestion, storage, processing, and serving layers in the following ways:

Ingestion layer: we’re now showing a distinction between batch and streaming ingestion
Storage layer: we’ve introduced the concept of slow and fast storage options
Processing layer: we outline how it works with both batch and streaming data, along with fast and slow storage
Serving layer: has been expanded beyond a data warehouse to include other data consumers such as data scientists, ML models, and other applications

We’ve also added three additional layers:

A metadata layer to enhance our processing layer
An orchestration overlay
An extract, transform, load (ETL) tools overlay

Metadata layer

This layer stores information about the activity status of different data platform layers, while also providing an interface for other layers to fetch, add and update metadata in the metadata store (for clarity, when we use the term “metadata”, we’re referring to technical metadata as opposed to business metadata). This data typically includes, but isn’t limited to, schema information from data sources, the status of ingestion and transformation pipelines, success, failure, error rates, and other statistics about ingested and processed data, and lineage information for data transformation pipelines.

This technical metadata is very important for automation, monitoring and alerting, and developer productivity. Since our design consists of multiple layers that sometimes don’t communicate directly with each other, we need a repository with information on the state of these layers. This allows, for example, the data processing layer to know which data is now available for processing–it can simply check the metadata layer instead of trying to communicate with the ingestion layer directly. This allows the decoupling of different layers from each other, reducing complexities associated with interdependencies.

Orchestration overlay

As we’ve now seen, a cloud data platform architecture includes multiple loosely coupled components that communicate with each other via a metadata layer. The missing piece, though, in this design is a component that coordinates work across multiple layers. In a modern data platform architecture, this is called the orchestration layer. It’s responsible for coordinating multiple jobs based on when required input data is available from an external source, or when an upstream dependency is met. It also handles job failures and retries. In a large data platform implementation, the dependency graph can contain hundreds and sometimes thousands of dependencies–in such implementations, multiple teams are usually involved in developing and maintaining the data processing pipelines. The logical separation of jobs and the dependency graph makes it easier for these teams to change parts of the system without impacting the larger data lake.

The need for such a layer varies depending on the design objectives. Where all data is streaming data and data transformation jobs, the jobs may have dependencies on each other, but are running continuously and need not be orchestrated by any external tool. In this use case, the orchestration must be local within the layers and the orchestration between the layers must be automatic based on the layers’ exposed interfaces that include events in and out. Those in and out events are connected implicitly when you stack up the layers and there is no need for a separate orchestration layer. But there are other cases, especially in batch workloads, where ingestion mechanisms have no way of notifying the processing layer (like FTPs) or where definitions of completeness are very strict, such as in a financial use case where a daily financial transactions report is only complete once ALL data for the day is available – not some of it, not continuously updated, but only after business day close. For such use cases, a “notify me when I should process the data” approach will not work, or will require building more logic into processing jobs instead and basically re-implementing a subset of orchestration layer in the processing layer.

ETL overlay

An ETL overlay is a product or suite of products that make the implementation and maintenance of cloud data pipelines easier. These products absorb some of the responsibilities of the various data lake architecture layers and provide a simplified mechanism to develop and manage specific implementations. These tools often have a user interface and allow for data pipelines to be implemented with little or no code.

ETL overlay tools are usually responsible for:

Adding and configuring data ingestions from multiple sources (ingestion layer)
Creating data processing pipelines (processing layer)
Storing some of the pipeline metadata (metadata layer)
Coordinating multiple jobs (orchestration layer)

Together the orchestration layer and DataOps allow configuration of multiple layers to be synchronized and look more like one action or one process to the end-user.

Google Cloud as the accelerator

If the data platform design is the recipe, then Google Cloud has the ingredients you’ll need to create a meal your guests will enjoy–because isn’t serving end-users the whole reason for building a data platform?

Google Cloud offers a rich and growing ecosystem of managed and serverless big data services that align very well with the layered design approach we’ve discussed. So much so, in fact, that if you took the above diagram and selected the corresponding Google services, you’d probably come up with something like this:

Data Platform Architecture on Google Cloud

That’s not to say this is the only possible architecture or correct selection of services–it’s not. Other combinations are possible and always worth experimenting with, but the above example is solid and flexible enough to address a broad number of different use cases.

The added value of Google Cloud

Never forget you’ve got options when it comes to data platform tools: they range from Google Cloud services to serverless, to open source and commercial software.

Data platform tool-type comparison

While there are pros and cons for each solution type, Pythian adheres to the following order of preferences when designing a cloud data platform, assuming that all meet security standards:

Cloud-native PaaS solutions
Serverless solutions
Open-source solutions
Commercial and third-party software-as-a-service (SaaS) offerings

PaaS solutions usually mean you’re not spending time on the mundane tasks associated with managing your own servers, such as ensuring versions of different libraries actually work together. These solutions also automate many time-consuming tasks like managing connections to external systems or keeping track of what data has been already ingested. This can significantly improve team or person productivity when working on the data lake. PaaS solutions are also a big area of investment from cloud vendors, so there are always new features and improvements being released.

Next on our list of recommended solutions are those of the serverless variety. These solutions allow you to execute custom application code without managing your own servers, or worrying about scalability and fault tolerance. You get all the benefits of a managed cloud environment but with flexibility since you can write your own code. Several serverless services exist on Google Cloud, including data processing services and short-lived lightweight cloud functions.

We recommend open-source solutions when special functionality, not available as PaaS or serverless, is required or when portability across cloud platforms is important. The trade-off is the cost associated with having to operate the software yourself and keep up with product updates.

We recommend commercial and SaaS solutions when special functionality is required but not available as PaaS or serverless or when open-source alternatives are not mature. Commercial products are also a good option when the company has already made a significant investment in a particular product and wants to protect that investment.

Mapping cloud data lake layers to specific GCP components

There are always multiple options for each layer.

It’s always important to remember that there’s no one-size-fits-all solution. Specific implementations depend on many factors such as skills, budgets, timelines, and analytical needs. The reality of a cloud data lake implementation is that you’ll most likely need to mix and match several solutions–that’s why a loosely coupled, layered architecture is so important.

We’ll next outline some proven Google Cloud services that can be mapped to each layer when designing a scalable, layered data platform.

Batch data ingestion

Google Cloud offers several services to perform batch data ingestion, including Data Fusion, Cloud Functions, and Data Transfer Service.

Data Fusion is an ETL overlay service that allows users to construct data ingestion and data processing pipelines using a UI editor, and to execute these pipelines using different data processing engines like Cloud Dataproc and (in the future) Dataflow. Data Fusion supports the ingestion of data from relational databases using JDBC connectors as well as the ingestion of files from Google Cloud Services (GCS). Data Fusion also has connectors to ingest files from FTP and Amazon S3. Unlike other managed ETL services, Data Fusion is based on an open-source project called CDAP. This means you can implement plugins for various data sources yourself and not be constrained by what’s provided out of the box.

As with AWS Lambda, Google Cloud also provides a serverless execution environment for custom code called Cloud Functions. Cloud Functions allows you to implement ingestions from sources not currently supported by Data Fusion or Google Cloud Data Transfer Service. As Cloud Functions limits how long each function can run before it’s terminated by Google Cloud, it’s well-suited to large data ingestion use cases. At the time of this writing, the time limit is nine minutes.

BigQuery Data Transfer Service is another viable choice for ingesting data into the data warehouse. Data Transfer Services allow you to ingest data directly into BigQuery from selected Google-owned and operated SaaS sources like Google Analytics, Google AdWords, and YouTube. Data Transfer Service also supports data from hundreds of other SaaS providers through a partnership with data integration company Fivetran. Google Cloud offers service provisioning for Fivetran connectors via the Google Cloud web console and unified billing, but the integration service itself is provided by Fivetran.

If there is a downside to using BigQuery Data Transfer Service–it’s that data goes directly into the warehouse which limits the ways it can be accessed and processed later. But if your analytics use cases require the ingestion of data from different SaaS providers like Google Analytics and Salesforce, the simplicity associated with Data Transfer Service may outweigh other architectural considerations.

BigQuery Data Transfer Service is expanding to support ingestion from relational databases, similar to AWS’s Database Migration Service. Currently, the only RDBMS source that’s supported is Teradata–in this case, Data Transfer Service actually saves data to GCS first, making it better suited for a cloud data platform architecture.

Streaming data ingestion

Google’s Cloud Pub/Sub services provide a fast message bus for data that needs to be ingested in a streaming fashion. Cloud Pub/Sub is similar in functionality to AWS Kinesis, but currently supports larger message sizes (1 MB in AWS Kinesis and 10 MB in Pub/Sub). Cloud Pub/Sub is just a message storage and delivery service and doesn’t offer any pre-built connectors or data transformations. You’ll need to develop the code that publishes and consumes messages from Pub/Sub, which provides integrations with Cloud Dataflow for real-time data processing and analytics, and with Cloud Functions.

Data platform storage

Google Cloud Storage (GCS) is a primary, scalable, and cost-efficient storage offering on Google Cloud. GCS supports multiple storage tiers that vary in data access speed and cost. GCS also integrates with many Google Cloud data processing services like Cloud Dataproc, Cloud Dataflow, and BigQuery.

Batch data processing

Google Cloud offers two different ways to process data at scale in batch mode– Cloud Dataproc and Cloud Dataflow.

Cloud Dataproc allows you to launch a fully configured Spark/Hadoop cluster capable of executing Apache Spark jobs. These clusters don’t need to store any data locally and can be ephemeral—meaning that if all data is stored on GCS, a Cloud Dataproc cluster is only required for the duration of the data transformation job and not a second longer. This saves you money.

Cloud Dataflow, on the other hand, is a fully-managed execution environment for the Apache Beam framework. Dataflow can automatically adjust the compute resources required for your job depending on how much data you need to process.

Think of Apache Beam as an alternative to Apache Spark–like Spark, it’s an open-source framework meant for distributed data processing. The main difference between Beam (Cloud Dataflow) and Spark (Cloud Dataproc) is that Beam offers the same programming model for both batch and real-time data processing, while Spark is a more mature technology that’s been tested in multiple production environments.

Real-time data processing and analytics

The primary cloud-native method of real-time data processing or analytics on Google Cloud is to use Pub/Sub in conjunction with Apache Beam jobs running on Google Dataflow. Beam provides robust support for real-time pipelines including windows, triggers, and ways of dealing with late-arriving messages. Dataflow currently supports Java and Python Beam jobs. No support exists for SQL yet, but expect it to be added in future releases.

One alternative to the Dataflow/Apache Beam combination is Spark Streaming running on a Cloud Dataproc cluster. The Spark Streaming approach to real-time data processing is usually called “micro-batching”–Spark Streaming doesn’t operate one message at a time, instead, it combines incoming messages into small groups (usually a few seconds long). These micro-batches are processed all at once.

Choosing between Apache Beam and Spark Streaming as your real-time data processing engine on Google Cloud usually depends on your investments in Apache Spark, including team skills and your existing codebase. But because Google is making significant investments into the Dataflow/Beam combo, for new developments it may be the best long-term choice. Beam also provides richer semantics when it comes to real-time data processing, making it an ideal choice if most of your pipelines are or will be in real-time.

Cloud warehousing

BigQuery is Google’s managed cloud data warehouse offering. It is a distributed data warehouse with several unique properties such as automatic compute capacity management and robust support for complex data types. Where other cloud warehouses require you to specify upfront how many and what type of nodes you want in your cluster, BigQuery automatically manages to compute capacity for you. For each query, BigQuery decides how much processing power you need and allocates just the right amount of resources. BigQuery also offers a per-query billing model, where you only pay for what you use–a payment model that works really well for low-volume analytics workloads or ad-hoc data exploration use cases, but can make it difficult to predict BigQuery costs. BigQuery also has robust support for complex data types like arrays and nested data structures, making it a great choice if your data sources are JSON-based.

Direct data platform access

There is currently no dedicated Google Cloud service to directly access data in the lake. BigQuery supports external tables, which allow you to create tables physically stored on GCS without having to load the data into BigQuery first. BigQuery also allows the creation of temporary external tables that exist only for the duration of your session. Temporary external tables are well-suited for ad-hoc data exploration on the lake.

The limitation of using BigQuery as a data platform access mechanism, however, is that you currently need to provide a schema for each external table. This can be a big barrier for ad-hoc analysis since the schema is often not known at this stage. An alternative way of working with data in GCS directly is by provisioning a temporary Dataproc cluster and using Spark SQL to query data in the lake. Spark can infer the schema for most of the popular file types automatically, making data discovery easier.

Data Ops overlay

We’ve already mentioned Google Cloud Data Fusion, a managed ETL service available on Google Cloud. Data Fusion allows data engineers to construct data processing and analytics pipelines using a UI editor and then have those pipelines translated into a data processing framework to be executed at scale on Google Cloud. Currently, only Apache Spark running on Cloud Dataproc is supported in this manner, with Apache Beam planned for future releases.

The main benefit of using a service like Data Fusion is the mechanisms it provides to search for existing data sets and immediately see which pipelines and transformations affect them. This allows you to perform a quick impact analysis to understand what data will be affected if a given pipeline is changed. Data Fusion also tracks the number of statistics about pipeline execution like the number of rows processed and timings of different stages. This information can be used for monitoring and debugging purposes.

Orchestration layer

GCP Cloud Composer is a fully-managed service for complex job orchestration, based on the popular Apache Airflow project. It can execute existing Airflow jobs without modifications. Airflow allows you to author jobs consisting of multiple steps, such as reading a file from GCS, launching a Dataflow job to process it, and sending a notification upon success or failure. Airflow also supports dependencies between jobs, allowing you to re-run either separate steps or full jobs on demand. Cloud Composer makes managing an Airflow environment easier–it takes care of provisioning required virtual machines, installing and configuring software, and other administrative tasks as part of the service.

Data consumers

BigQuery doesn’t have native support for JDBC/ODBC drivers, but these drivers are available for free from a third party called Simba Technologies Inc. BigQuery’s native data access is done via REST API because BigQuery acts more like a global SaaS than a typical database. JDBC/ODBC drivers from Simba act as a bridge between the JDBC/ODBC API and the BigQuery REST API.

As with any translation from one protocol to another, there are limitations, primarily around response latency and total throughput. These drivers may not be suitable for applications that require low latency response or that need to extract large amounts (10s of GBs) of data from BigQuery. Fortunately, several existing BI and reporting tools are starting to implement native BigQuery support, eliminating the need for a JDBC/ODBC driver.

You should always make sure the reporting or BI tools you want to use with your Google Cloud data platform offer support for BigQuery–most do. For faster report generation, consider BigQuery BI Engine–a fast, in-memory analysis service that integrates with familiar Google tools like Google Data Studio to accelerate data exploration and analysis. With BI Engine, you can build rich, interactive dashboards and report in Data Studio without compromising performance, scale, security, or data freshness.

When it comes to data consumers who need real-time data access, Google Cloud offers a fast key-value store called Bigtable that can be used as a caching mechanism. You’ll need to implement and maintain application code to load results from your real-time pipelines into Bigtable and then either build a custom API layer on top of Bigtable or use the Bigtable API directly in your applications.

A key concern for data consumption is always data security & management, who has access to what, where, and how. By leveraging concepts like data isolation and separation of concern, BigQuery supports multiple datasets and projects for which customized Role-Based Access and Control (RBAC) can be enabled. Additionally, by deploying data isolation at a project or resource level, one can also control and have good visibility on who is doing what and how much in the system. This not only enables the visibility needed to protect the data in the system but also the system itself from overconsumption and blowing out budgets (by enforcing quota management).

Key Takeaways—TLDR

A data platform on Google Cloud is a native Google Cloud analytics infrastructure solution that can cost-effectively ingest, integrate, transform, and manage an almost unlimited amount of data of any type to facilitate actionable analytics outcomes.
In a modern data platform, a data lake is usually combined with a data warehouse that becomes a destination for analytics, but the data platform can also operate and deliver value in addition to the data warehouse, and in some use cases, without any data warehouse at all.
Google provides a rich product suite of managed and serverless big data and analytics services. These tools make it easier to leverage and integrate highly performant components when building data analytics platforms on Google Cloud.
Building a modern data platform is not an ad-hoc activity. Take your time, define your features, develop a layered feature-oriented architecture, define the corresponding functional role(s) for each layer (per the original feature requirements), and select the technologies that best align with the role of each layer. Fail to plan, plan to fail.
Designing, deploying, and managing a cloud data platform is complicated, but Google Cloud-certified data experts like Pythian can help at every stage of your data journey–from assessment to design to implementation and ongoing transformation support. Learn more about Pythian’s cloud solutions for Google Cloud.

Source: Services and solutions for Google Cloud Platform (GCP)

Table of contents