The amount of data a single business collects, processes, and stores is rapidly increasing exponentially by the day. A growing company relies so much on data from various sources that turning to data lakes and larger databases as solutions is the next logical step. Data analysis tools specifically designed to collect, process, and store data are also becoming more advanced and complex driven by machine learning and artificial intelligence features. Still, the challenges around these operations are growing faster than the speed at which conventional data management tools are currently being updated.
Google BigQuery service is a data warehouse in the cloud intended to overcome those challenges. From the outset, Google BigQuery was designed to be serverless, highly scalable, and capable of handling exponential amounts of data. It’s not necessary to spin up dedicated computing engines to take advantage of the processing prowess of BigQuery. More importantly, businesses optimizing the service are no longer limited by factors such as query speed and the skill level of their database administrator. Accordingly, BigQuery is part of Cherre’s CoreConnect and CoreAugment data architecture because it is ideal for processing and normalizing massive amounts of datasets.
BigQuery runs in a serverless state and allows users to collect massive amounts of data without putting too much stress on the underlying cloud environment.
It is also highly scalable. BigQuery storage engines scale up alongside the size of the data pool—we will examine this in more detail later. Even the on-demand computing services are meant to handle and provide control over an infinite amount of data at scale without the usual level of complexity.
Google BigQuery has a number of features and key characteristics that craft the end-to-end data warehouse and processing service. Breaking down these components gives a clearer view of how powerful BigQuery really is compared to other data warehouse solutions.
Before we get to the characteristics that make Google BigQuery so popular among data scientists and corporate users, let’s take a closer look at the three key features offered by BigQuery, starting with BigQuery ML. As the abbreviation suggests, BigQuery ML is the machine learning feature of BigQuery.
BigQuery ML allows users to develop machine learning models using SQL queries. Rather than having to develop an individual model, it’s possible to use a command-line tool or a web-based user interface to run said SQL queries. The feature can also be integrated with business intelligence tools that your team are already familiar with. The aim of this simplified subset of ML is so that it’s easily accessible to data engineers and data analysts. This allows Cherre—and any other business— a lot of flexibility.
BigQuery ML eliminates the natural complexity of machine learning. Data scientists and engineers can develop their own machine learning models using spreadsheets and BI tools. By eliminating the complexity, BigQuery ML gives data scientists the ability to train complex analysis and prediction models out of the box.
The second feature is known as the BigQuery BI Engine. Once again, the name of the feature says it all. This is the part of BigQuery that handles business intelligence. The real power of BigQuery BI Engine lies in its ability to process large chunks of data from multiple datasets at a lightning-fast pace. Since BigQuery BI Engine runs natively, there is no need to move or pre-process data at all. The BI Engine is low cost and you can run without it. However, we wouldn’t recommend it because you do see the performance difference.
BigQuery BI Engine only works with Google Data Studio. You can generate dashboard elements based on data processed by BigQuery. It’s also possible to define custom queries, set filters, and limitations, and even have subqueries fine-tuning the kind of business insights that you might want to display on the dashboard. Custom queries are fully supported and they can all be made using SQL standard language.
However, we’ve found that Google Data Studio has some limitations. So, at Cherre, we offer our clients Looker as an alternative, which is a highly flexible BI tool. Looker is an integral part of our CoreExplore product to help clients swiftly research, run due diligence, and analyze any of the 177 million properties on the Cherre knowledge graph.
Combining the two together helps us provide our clients with custom dashboards, filters, and visualizations of their connected data as well as create flexible, detailed lists and reports. For more on Looker, check out our blog post ‘How to upgrade Looker via bash’ here.
Connecting the BigQuery BI Engine to Google Data Studio is easy. Set custom roles using bigquery.user or bigquery.jobUser to limit access to the data warehouse. From Google Data Studio, simply create a new data source and link both services together. Creating charts and visualizing data becomes easier at this point.
BigQuery GIS is the third feature we are going to discuss in this section. This is the feature that handles geospatial data in BigQuery. BigQuery GIS allows you to achieve a lot, including predicting location-based actions and automating the delivery of packages from certain retail outlets using predictions.
BigQuery GIS includes geospatial functions without forcing the user to learn about how to manage location-sensitive data. Similar to the previous key features, commands are still based on SQL queries, empowering data scientists and analysts in the process. For example, it’s possible to import data about fleet trips and routes, and then use the power of BigQuery to predict load and optimize fleet management.
A common ST_GeogPoint(longitude, latitude) parameter can be added to any portion of an SQL query. You can, for instance, perform a standard SQL SELECT command for fleets whose last known location is in specific coordinates. Since you still benefit from BigQuery’s native performance, location-based analysis becomes more accessible.
Key features aside, Google’s BigQuery has a number of characteristics that make it an instant hit, starting with support for both Legacy SQL and Standard SQL. Enable Standard SQL and migrate to the dialect as soon upon set up of your BigQuery data warehouse.
BigQuery pricing is also very interesting. With BigQuery pricing, users pay for active storage when the data warehouse is populating and processing data, and then switch to passive storage when no more changes are done to the warehouse. The pricing model for storage is similar to that of GCS and hence the platform delivers some of the lowest storage costs in the cloud.
Commands are all very simple. Google actually provides sample datasets for you to play around with—plus you also gain access to public datasets. The Google Analytics sample dataset is a good place to start. You can begin running Standard SQL queries with commands like UNION ALL against the sample dataset.
Queries are fed through a web user interface or via a command-line interface. Either way, you can debug queries before running them and refine queries for maximum performance. Even with advanced queries (i.e. querying the number of products purchased based on the customers who purchased them) against millions of data, BigQuery doesn’t break a sweat.
The Dremel Execution Engine powering BigQuery is an exceptional component on its own. The computing engine runs on a massive multi-tenant computing cluster, providing maximum reliability without the usual high cost of running such a robust system.
Of course, you also benefit from Google’s own Jupiter Network, which bridges the gap between the computing cluster and storage blocks. Jupiter eliminates any lag between the two with its unrivaled Petabit of bisectional traffic handling capacity.
Even the serverless nature of BigQuery becomes a huge advantage on its own. There is no need to worry about manually assigning blocks, creating and managing individual clusters, and doing maintenance work.
That last part is actually very important. BigQuery is completely maintenance-free for one specific reason: it is a managed service. The data warehouse and query engine are offered as services for use out of the box without any additional set up.
Back to our sample datasets: it’s possible to replace the datasets used for certain queries without setting up new jobs. Simply alter the configuration of existing jobs and you are all set. The only real challenge is creating a table and copying data (especially if working with a public dataset) but BigQuery will still handle most of the heavy lifting for you.
BigQuery is an incredibly easy and intuitive tool to get started with so providing a tutorial is really not necessary to define—you can start with any un- or semi-structured data. Google’s BigQuery provides a multitude of possibilities for data processing and manipulation. It works well with visualization tools like Data Studio and Viz. It can also handle large datasets and has the ability to import data from public datasets on the fly. If you know your way around SQL, Google BigQuery provides the tools necessary to turn data into valuable insights to make better informed business decisions.
Stefan Thorpe, VP of Engineering @ Cherre Cloud Solutions Architect ▻ DevOps Evangelist