ExcelVisibility: Inside Google BigQuery

a large brick warehouse with green shutters

As organizations rush to gather and process as much data as possible, database administrators can struggle to keep pace. Though a number of cloud-based services have sprung up to help organizations quickly scale their computing and storage resources, the challenge of managing data warehouses remains. Google’s BigQuery is a novel data warehouse system that abstracts away many of the technical issues associated with setting up and managing a data warehouse.

To understand what this could mean for your organization, we’ll take a look at some of the challenges associated with conventional data warehouses and how BigQuery solves them.

Why Do You Need Two Databases?

First off, what is a data warehouse? Unlike a transactional database, whose purpose is (unsurprisingly) to help facilitate transactions, data warehouses are meant to collect data for later analysis. This makes it a business intelligence tool more than an operational one.

These different roles also mean that transactional databases and data warehouses have different technical requirements. A transactional database is typically optimized for read-write operations from a single data source. By contrast, a data warehouse is optimized to facilitate complex queries on large data sets that may come from a variety of sources.

Setting Up a Data Warehouse

Setting up a successful data warehouse is a complex task. Since Facebook introduced Hive, running some kind of SQL-like layer of abstraction over a typical MapReduce operation has been a common way for organizations to handle big data analytics. Unfortunately, setting up, managing, and scaling a Hadoop cluster is an ongoing challenge. What’s more, MapReduce isn’t suited to real-time analysis and ad hoc querying.

It’s no surprise then that a number of cloud-based competitors have appeared to tackle these problems. Major players in this area include Amazon’s Redshift and Microsoft’s Azure SQL Data. These services are essentially virtual data warehouses–they let database admins ingest data, provision storage and computing resources, and integrate with other BI tools. As with physical data centers, however, admins occasionally have to perform maintenance and cleaning operations. To be clear, these are powerful, but complex tools. A cottage industry of startups has emerged entirely to help organizations set up and manage these kinds of services.

What Makes BigQuery Different?

A significant part of managing most data warehouses is formatting data and provisioning resources. Even cloud solutions require you to spin up (and wind down) clusters of machines for given tasks. BigQuery dispenses with both of these concepts. It’s all about ad hoc queries.

BigQuery may be the first major cloud-based data warehouse to emphasize querying over administration. Functionally, what that means is that Google handles all provisioning and maintenance operations–all you have to worry about is connecting data sources and executing queries. This can be a game-changer for certain organizations, allowing data teams (or any team that needs to run SQL queries regularly) to potentially set up and run their own super-fast analytics operations without needing a database admin.

How Does It Work?

There are two major components to BigQuery. Essentially, it combines a couple of other Google projects into a package that’s ideal for running ad hoc queries across petabyte-scale databases.

The first is the query engine itself, which is often called Dremel. Similar to Hadoop projects like Hive and Pig, Dremel sits on top of a file management system and translates SQL queries into lower-level instructions for the engine itself. The difference is that where Hive and Pig typically rely on the MapReduce paradigm to execute queries, Dremel uses a technique called aggregator trees that’s designed to speed through tons of read-only data, allowing it to get results in seconds that could take minutes or hours with MapReduce, which makes it well-suited to ad hoc queries.

The other major component is Borg, Google’s large-scale cluster management system. Rather than requiring you to manually provision computing and storage resources, BigQuery relies on Borg to assign server resources to individual jobs. Need 3,000 CPUs to process a massive job right away? When you have Google’s computing resources, that’s not a problem. In fact, you don’t even have to provision those those CPUs–they’ll automatically be assigned by Borg based on your job.

As you might expect, BigQuery is designed to play nicely with other Google cloud services, including Google Analytics, Data Studio, and Cloud Storage. That also includes Nearline, Google’s ultra-cheap cold storage system, meaning you can store less valuable data on cheaper, slower storage while keeping it accessible. (To learn more about cold storage, check out our article “What’s Cool About Cold Storage.”)

Running Queries

Since BigQuery is all about querying, let’s take a quick look at its key features. The main thing to know about BigQuery is that it executes queries using standard SQL (Previously it relied on a non-standard SQL dialect, though as of BigQuery 2.0 you can use either).

On that note, BigQuery sports a couple of interesting features related to querying. First is a nifty feature called query caching that checks to see if the results you’re looking for are already in the cache, preventing BigQuery from having to check the database again, potentially saving you a good deal of time. The other feature allows you to set your query priority. By default, all queries execute as soon as possible, but you can also set them to execute on a batch basis, which means they’ll run as soon as idle resources are available, which could be several minutes to several hours. Why would you want to run a slower query? BigQuery limits the number of concurrent queries you can run at a time, but batch queries don’t count against this limit, allowing you to start more queries at the same time.

When Should You Use BigQuery?

For all of its advantages, BigQuery comes with a couple of downsides. As with other systems built on abstraction, what you gain in ease-of-use you lose in performance. While BigQuery is still likely to be faster than most MapReduce-based setups, it can’t match the fastest speeds possible in systems like Redshift.

That said, there are some use cases where BigQuery really shines. According to a former Google program manager, BigQuery is best suited to running ad hoc queries across extremely large (think petabyte scale) datasets, things like web logs, sensor data, or sales data.

Learn how to write an awesome job post or check out Upwork’s Hiring Guides to get more information about how to find and hire the right freelancer for your project.

Source: B2C

Pages

Saturday 20 May 2017

Inside Google BigQuery