What do you when you are the world’s largest search engine platform and need an urgent solution to manage all your ad-hoc searches. Probably you can innovate one.
And that is exactly what Google did, back in 2006, when they came up with Dremel, a query service which allows Google employees to run SQL like queries and get results in a fraction of a second.
Since, then Google has come up with BigQuery, an external implementation of Dremel, which developers and technologists can access to use the core features of Dremel. Launched in 2012, BigQuery has caught the attention of industry giants and startups alike as an alternative for complex Data warehousing systems.
So, what exactly is BigQuery?
At its core, BigQuery is a serverless, highly scalable, low cost, logical data warehouse which firms can easily use to create useful data insights. Reduced with the burden of data handling and management aspects, firms can then only choose to focus on how to best analyze the available data.
Third-party developers can access BigQuery features using REST API, Web or command line interface etc. to match the actual DREMEL performance. With billions of rows of datasets scanned across tens of thousands of servers, developers will face no problem in getting a response from BigQuery in a matter of seconds.
So, how does BigQuery work?
Borrowing its architecture from DREMEL, BigQuery has two main technologies at its heart to create a very efficient system of Data Management. They are the following:
Columnar Storage – Every new record is broken into component values which are then stored into different columns of logical storage.
Tree Architecture – Every query is quickly pushed through the nodes of a tree and gets an equally quick response through the same nodes.
This architecture gives us the following advantages:
High Storage Compression – Because similar values are added in the columns, it is much easier to compress them and store into a format where performance is not compromised by size.
Also, since it is a logical storage system and not a physical one, compression of data is quite feasible. So, we can get compression ratios as high as 10:1, whereas compression ratios of 3:1 is possible at best for normal record-oriented storage format.
Refined Search– Since the record is stored in the form of columns, only required columns would be searched for any given query. This helps by saving time on parsing through each and every record, which may or may not have anything to do with the actual query.
One of the handicaps, both Dremel and BigQuery face is that a record cannot be updated. The architecture has a very inefficient mechanism to support that.
But, if the intention is to only run queries to get read-only responses, then BigQuery is the fastest and most efficient technology out there. By combining the tree nodes with the columnar storage, Google created an ad hoc search system which gives amazingly fast responses, regardless of the size of datasets.
How BigQuery scores over traditional MapReduce and Hadoop architecture?
Today the most common technology for storing large amounts of data is Hadoop, an open source application which stores data in form of objects through HDFS (Hadoop Distributed Files System).
The processing of the stored data is then done through MapReduce, a programming model which can be integrated with applications like Hive and Pig for running SQL queries.
Some of the key differences between the two are:
|Main Differences||BigQuery||MapReduce/ Hadoop|
|What is it exactly?||A query service for large datasets||A programming model for processing large datasets|
|Common use cases||Ad-hoc and trial-and-error interactive queries for quick analysis||Batch processing of large dataset for time-consuming data conversion or aggregation|
|Sample use cases|
|OLAP/BI use case||Yes||No|
|Data Mining use case||Partially||Yes|
|Response time – fast?||Yes||No (May take days)|
|Easy to use for non-programmers (analysts, tech support, etc)||Yes||No (Requires usage of Hive etc.)|
|Programming complex data processing logic||No||Yes|
|Processing unstructured data||Partially (Regular expression matching)||Yes|
|Handling large results /||No||Yes|
|Join large table||No||Yes|
So, which one is better?
Map-Reduce is totally unsuitable for ad-hoc searches and trial-and-error data analyses but can very well work with unstructured data.
On the contrary, BigQuery faces some problems with unstructured data but can easily run ad-hoc queries and analyses.
The point is to not to prove why one is better than the other, but how both can be used together to get synergistic effects.
Why is BigQuery perfect for Analytics use?
- Pay per second model – As it is very difficult to predict the exact storage requirements for analytics uses, most firms end up paying more for any cloud services than required. Hence, they have a very poor resource utilization ratio.
But, by having a flexible, on-demand price model which only charges customers for total number of seconds of usage, they don’t need to forecast their storage requirements. With resource utilization of almost 100%, firms can garner more value out of their investments.
2. Caching advantage – For the same data, BigQuery maintains the results of your SQL queries for up to 24 hours in its cache memory. So, the next time you run the same query, you won’t be charged anything at all and will have yet another saving in form of time.
3. Batch-Ingestion –Usually, when you are trying to perform batch ingestion, you are not able to perform queries or any other analytical functions because your ‘compute memory’ is under usage.
BigQuery allows you to have a separate ‘compute memory’, apart from the ‘storage memory’ during batch ingestion phase and hence, querying capacity is not compromised for data ingestion of any size.
4. Security – By default, in BigQuery, all data at rest and in motion are encrypted. Usually it is observed that performance of analytics databases suffers by 30-50% because of lack of proper encryption. Such is never the case with BigQuery.
Business and Technology Use Cases
Real-time Inventory Management – Managing inventory has never been easier. Any product can be tracked real time to get accurate updates on stock inventory.
Queries on availability of products/SKUs can give accurate product flow information within tens of seconds. Hence, firms can go beyond the operational issues of supply chain and focus their sales efforts on customer experiences.
IoT driven Advanced Analytics– IoT Sensors capture data at a very high rate and this creates a need for analyzing large volumes of in a very short period of time. If the data is structured enough, ad-hoc queries and analytics could be run by BigQuery for real-time advanced analytics.
Even if the data is unstructured, MapReduce functionality can be used in conjunction with BigQuery to give near Real time Advanced Analytics.
Public Datasets – Google facilitates the access to few databases to the public, which is stored in BigQuery Cloud. Small firms can easily integrate these datasets with their applications to run queries and analytics. Storage costs is directly paid by Google and the customers can use the queries services for a simple pay per query model.
Predictive Digital Marketing – Google Analytics 360 is now integrated with BigQuery which allows ad-hoc analysis of logs generated by users on a very large scale. This gives firms an opportunity to track every single activity of their customers to get an in-depth analysis in a matter of seconds.
By linking logs of customer activities across registration forms, shopping carts, social media interactions, firms can answer questions like ‘Who are the most likely customers to purchase a particular product?’ or ‘How probable is a customer to churn in a week/ month?’ etc.
For many firms, BigQuery is already revolutionizing the way they handle data. Other firms, which had invested in alternative technologies, are catching up soon.
With Dremel and BigQuery, Google is not aiming to cannibalize its core technologies like MapReduce or BigTable. Instead, BigQuery is here to complement MapReduce for solving some of the most difficult problems faced by developers and data scientists alike.
The only catch in the BigQuery implementation is that you must use Google Cloud Platform for storing data on cloud. Considering the performances and prices offered by GCP, it is a no-brainer that firms will be more than willing to work with Google, simply for the value delivered by BigQuery.