A few blog posts ago, we spoke about data warehouses and the potential benefits that they can have on your organization. Now that you know all about data warehouses, let's explore the different options that are out there so that you can pick the best warehouse for your use case.
Just as a quick refresher, a data warehouse is a data storage system that stores huge amounts of historical data from multiple sources, and acts as the source of truth for data analysis and reporting.
When picking your data warehouse provider, there are a few things you might want to consider:
Over the course of this blog post, we will talk about the 3 most popular cloud data warehouses and how they differ along multiple dimensions.
All of the solutions that we will discuss today are cloud-based, which will allow you to scale your warehouses as your data needs grow. Additionally, all the solutions that we will discuss today have historically been extremely reliable. Although no one is immune to outages, all of these data warehouses have high uptime.
Of the three data warehouses that we are discussing today, Snowflake is growing the fastest. Snowflake is a cloud data warehouse that can handle both structured and semi-structured data. Snowflake offers fully managed data warehouses with near-unlimited storage. Performance is highly scalable to ensure that your queries stay fast. Snowflake bills for compute usage and storage separately. Compute pricing is based on a credit system, with each credit translating back to compute seconds.
Amazon Redshift is a fully-managed cloud-data warehouse that handles structured data and semi-structured data. Users love Redshift as it integrates with ease into the Amazon Web Services (AWS) ecosystem. Of the warehouses we will discuss today, Redshift has the most customers. Redshift is loved by users who are working with huge amounts of data.
Google BigQuery is the cloud data warehouse component of the Google Cloud Platform (GCP) environment. Unlike Redshift, BigQuery is serverless, meaning that resources are allocated dynamically without upfront provisioning of hardware. BigQuery easily integrates with Google's machine learning tools, which is a huge advantage if you are using your data for artificial intelligence.
Each data warehouse supports different data types.
Snowflake supports the largest variety of data types, including JSON, XML, Avro, and Parquet and can store structured or semi-structured data. The high level of flexibility makes Snowflake great for users who do not have a predefined, rigid data structure.
Redshift only supports JSON data. Redshift recently added support for semi-structured data.
BigQuery supports both JSON and XML in structured and semi-structured formats.
All three warehouses use an ANSI compliant SQL syntax. There are some slight differences between each of them, but nothing that a quick Google search can't solve.
When you are selecting a data warehouse, you will likely want one that can scale, both in terms of performance and storage.
Snowflake separates compute power and storage so that you can easily scale up storage and compute resources independently. By separating storage and compute layers, it is easy for many concurrent users to run queries and work with your data. Snowflake has near unlimited storage capacity and will scale up automatically. Additionally, compute clusters can automatically spin up and down depending on usage, making Snowflake well positioned to handle high variations in demand. Automatic scaling makes operating your warehouse without a data administrator much easier.
Redshift performs well and can deliver fast queries on huge datasets, however, it does not scale as quickly as its competitors. Redshift scales horizontally and has the ability to scale significantly, however, resizing or changing your clusters can take some time. Redshift is capable of handling thousands of concurrent connection with many concurrent queries, but your warehouse needs to be configured to handle that load beforehand.
BigQuery separates storage and compute resources, which is great for maintaining high performance at lower costs. Separating compute and storage allow users to scale processing and memory independently, so users can scale the way they want. Users can also use automatic resource provisioning to scale up compute resources dynamically to handle larger loads.
Snowflake separates storage and compute resources, and prices for each independently. Storage is billed at a rate of $23-$40 per compressed terabyte per month depending on if you pay up front or on demand. Compute resources are priced in terms of credits. Credits are based upon compute time, but also take into account the size of the warehouse that you are operating (since 10 seconds on a small and large warehouse are not the same). Compute costs $0.00056 to $0.0011 per second per credit, depending on the plan that you select.
Although Snowflake bills by the second, the minimum time for a compute operation is 1 minute. As a result, even if you run a 1 second query, you will be charged for 1 minute - this can lead to huge costs if you run a lot of short queries. You can also save significantly by pre-purchasing credits.
Redshift bundles storage and compute resources together. This makes billing a lot simpler, however, it also forces users to scale up storage and compute resources together. Traditionally, Redshift was only billed on a per hour per instance model, no matter what the load was like. Recently, Amazon made it possible for users to manually pause and resume a cluster, allowing users with cyclical workloads to cut their costs. Pricing starts at $0.25 per hour and scales up from there.
Redshift also offers a managed storage system, where users can pay per volume of data. By prepaying for reserved instances, users can obtain significant savings as well.
BigQuery separates storage and compute costs like Snowflake. Google offers on-demand pricing, where users will be charged per queried byte processed, with the first 1 TB per month being free. Alternatively, uses can pre-purchase compute resources with flat-rate pricing to obtain discounts.
Storage costs are separated into active and long-term storage, where active storage includes any tables that have been modified in the last 90 days. After 90 days, storage costs from from $20 per TB to $10 per TB.
Snowflake, Redshift, and BigQuery are all very secure. Security for Snowflake depends on the cloud provider's features. Redshift requires encryption to be explicitly enabled. BigQuery encrypts data at rest and in transit by default. All three data warehouses meet most data protection standards (HIPAA, SOC 1, SOC 2, etc.)
Snowflake is the only warehouse can be run within multiple ecosystems. Snowflake can be deployed on AWS, GCP, or Microsoft Azure.
Amazon Redshift and Google BigQuery are limited to their own company's cloud ecosystem, Redshift only works with AWS and BigQuery with GCP.
Setting up and maintaining Snowflake requires little to no maintenance. Within the setup process, no size needs to be selected, however, users muse select their cloud provider. Snowflake automatically scales and provisions resources and no management is required from users.
Redshift requires the most maintenance of the three warehouses we are discussing today. The setup process involves sizing and assigning clusters as storage or compute and designing a data workflow to match resource size. Vacuuming and analyzing tables on occasion is also needed for maintenance, and Redshift does not self-optimize. Redshift is difficult to manage without an AWS architect.
BigQuery is quite similar to Snowflake in terms of setup, maintenance, and management. No sizing is required and little maintenance is needed from users.
Snowflake is great for handling regular usage with a lot of upscaling and downscaling. It is also an amazing option if you don't want to spend any time maintaining or manually scaling your warehouse.
Redshift is great for handling large-scale data with lots of concurrent users, especially if you need a query response quickly. Companies of this scale can often predict usage in advance, and you can get some big discounts from prepaying for usage.
BigQuery is great for handling spiky workloads and data mining / ML applications. The pricing model can make it cost effective for storing lots of data that isn't queried as often.
As you can probably tell from this blog post, it isn't really possible to say which data warehouse is the best. When you are making a decision, it is important determine which is best for your specific use case.