From your marketing efforts to your customer's purchase behaviour, if you're running a business in 2021 you're likely wrangling massive amounts of data from a variety of sources. Making sense of that data can be a daunting task, especially when you're trying to connect multiple sets of data together - how can you find out if your marketing is affecting the average cart size? Or how can you tell if your social media campaigns are increasing customer retention?
The general solution to these problems is simple: centralize your data. In researching how to centralize your data, you'll inevitably run across a few key terms: databases, data warehouses and data lakes. I'm going to explain what each of these means, the benefits they provide, and how your business can start using them
A database is an organized collection of data, structured to optimize for the creation, querying and storage of electronic information. To further divide databases, they fall into two main groups: relational databases, and non-relational databases.
Relational databases are generally what people envision when they think of databases, and have historically been the more commonly used type. They typically use SQL commands to write, query and read information in a database. Within the database itself, information is stored in a set of tables comprised of columns and rows. This allows information to be structured in a manner to make vertical scalability easy, and the ability to use SQL queries to manipulate data.
Non-relational databases (often referred to as NoSQL databases) have become more popular in recent years due to their relative performance speeds, and ability to remain more adaptive to changing schemas. They don't rely on structured tables and can remain flexible. Some different types of NoSQL databases are key-value, document-based, or graph-based, with each offering unique ways to keep your database organized.
A data warehouse shares many similarities to a database in terms of centralizing your information. If we think of a database like a book, then a data warehouse is more like a bookshelf - it allows businesses to connect multiple different databases and streams of information from different applications, all in one central location. In this way, you're able to centralize all the different platforms that you collect data from and build a more complete view of your business.
Beyond normal database concepts, for data warehouses you also need to understand the architecture and functionality that they offer.
The storage layer is fairly self-explanatory, it's the ability to hold all the data that is brought into the warehouse. With the book analogy, this would be like the actual bookshelf that has the capacity to hold books. If you get too many books you'll need to obtain a larger bookshelf, in the same way that you pay for greater storage in a data warehouse.
The compute layer is responsible for executing data processing tasks required for queries. This would be like you searching for a book on your bookshelf, or the energy required to organize your books by subject material.
The client services layer is comprised of any number of business intelligence, reporting or analysis tools that help you make sense of the data. In our analogy, these tools could be similar to you using Goodreads to identify which books on your shelf are the highest rated.
Data lakes are optimized for realtime data recording, where sources are continuously pouring in from "data streams" into a main "data lake". The data coming into the lake is generally untouched, and no data is filtered out initially. From this raw set of data, analysis and cleaning can be done further down the line if necessary.
The idea is that all information that is recorded should be captured in its entirety, and this lack of rigid structure allows for greater flexibility down the line. If we imagined a data warehouse like a bookshelf, think of a data lake as a giant box where you printed out anything you ever read - whether it was a news article, a textbook, or a tweet. These various data types give you a lossless version of all the information you've gathered, whereas the bookshelf is limited to the entries that fit a certain mold.
To help you answer this question, I'm going to ask you some questions about the use case you have in mind, which will allow you to form a better idea of which solution is correct for you.
If your primary purpose is to aggregate data for a single purpose like recording sales, then a database is the ideal solution for you. Databases are optimized for reading and writing data, and you'll be able to connect your database to many leading business intelligence or analysis tools later on.
If your primary purpose is to join together data you gathered from multiple softwares or sources and use the collective information to draw advanced correlations, then a data warehouse is the correct solution for you. The warehouse will allow you to transform and store information from all your different sources, and allow you to draw conclusions from those connections.
If your primary purpose is to monitor data in realtime, or perform analysis on largely unstructured data, then a data lake would be the optimal solution. The data lake allows you to maintain the largest degree of flexibility, and you don't run the risk of missing out on any crucial data, since you're capturing all data that's recorded.
If you're gathering data from just one source, don't overcomplicate things - just go with a database.
If you're collecting from multiple sources, then either a data warehouse or a data lake will be the ideal solution for you, depending on how these sources are structured. For sources feeding continuous data such as machine readings in a factory, a data lake would be more appropriate. If your sources are a variety of reporting softwares and databases, then a data warehouse would fit your needs best
Once the data is within your structure, how you intend to use or manipulate it can change which solution is best for you. If you're simply looking for a digital ledger to record transactions or a way to store records for basic queries, a database will be a great asset for you to develop.
If you hope to draw deeper insights from your data sources and connect various tools together to forecast, model, or visualize information, then a data warehouse is the ideal solution for you.
If you are trying to monitor data in realtime, or trying to do broad analysis over absolutely all your data (such as sentiment analysis on all customer communication), then a data lake is an excellent solution.
Get started with turning offline files into databases.