Should Spreadsheets and Databases be Mutually Exclusive?

Spreadsheets let you quickly build reports and perform lots of calculations quickly, but they don't work well when your dataset is really big. Databases on the other hand, can handle large datasets, but are harder to set up and use, and don't let you perform calculations using simple formulas. Do their usage and defining features have to be mutually exclusive?

Almost every business has data stored in Excel spreadsheets or CSV files. That's not only because data is originated in one of these formats, but also because data is exported/extracted from other systems and applications into them. We've used these formats for a long time now, but the volume of data we work with has increased significantly, yet the tools remains mostly the same. Excel was launched about 35 years ago in 1986-1987 and the CSV as a data format predates the personal computer, with the IBM Fortran compiler supporting it in 1972.

Input entries are separated by blanks or commas, with successive commas indicating values to be omitted. - IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information

Fortran guide — Screenshot of an IBM Fortran user guide, which makes reference to a CSV as a data format

Spreadsheets and CSVs are usually stored in file systems in your computer, SFTPs and email servers, and increasingly often backed up and synced to the cloud. They are stored as files, and while these files can be moved online, the underlying data is pretty much still “trapped” inside the files, even if they are stored in cloud drives. That's because it's still hard to analyze the data, to centralize it for reporting, or to automate business operations without further processing or technical help. To make this data useful, we still need to clean it, format it, structure it, and validate it.

On the one hand, spreadsheets like Excel, Google Sheets, or Airtable are simple to use, but they aren't built for large volumes of data or scale. On the other, tools that can scale and are built for volume are too complicated for most people to set up and use, and that’s especially true for an average spreadsheet user. There seems to be a gap between these two categories of tools.

Yet the problem isn't necessarily in the tools themselves. We have spreadsheets that are powerful, easy to use, and understand. We have databases and data warehouses that can handle scale, large amounts of data, and can even query a millions rows in a second or less.

So why do we still struggle with data tools?

One reason could be that we while we have tools for both sides of use-case extremes, each side continues to improve in areas that they are already good at. Spreadsheet developers continue making their spreadsheets better, adding more ways to filter, views, group, and pivot data. Database developers continue making their databases better, enhancing speed, reliability, performance, and security.

But what about those who outgrow their spreadsheets, but can't easily access databases?

A solution could be something that brings the power of databases to spreadsheet users. Or alternatively, something that elevates the features of spreadsheet into tools that have the desirable properties of databases for dealing with larger volumes of data. Today, for the most part, spreadsheets are for business users and databases are for developers. But there is an emerging category of users - a power business user. These are users who need to work with more data, who have a basic understanding of databases and even SQL, and want database features, but struggle to set one up and easily import data to it without support from a technical team. These power business users want to work with databases but can't easily do so, because databases aren't built with them in mind.

How could a solution look like?

One way to go about it could be to combine 3 concepts that almost everyone is familiar with: spreadsheets, file managers, and storage drives.

Spreadsheets: Spreadsheets are one of the most universal user interfaces. They are simple, easy to use and understand. You can add formulas, create tables, and filter all your data with a few clicks.
File managers: Most people who use computers are familiar with the concept of file managers. Mac, Windows, Linux all have them. They let you see all of the data in your hard drive. You can organize your data neatly into folders, and rename and delete files.
Hard drives: Hard drives are used to store files. You use a file manager to manage files in your hard drive.

The idea would be something like this: you have a file manager, but instead of managing files, it lets you manage the underlying data in your files. This data is stored in a hard drive and organized by tables. You can access your data through a spreadsheet interface that allows you to view, edit, filter, and clean it. You can drag and drop some data to this file system and you can use this data they way you use a database. Embedded in this system, there’s a an actual database and a corresponding database engine that could be optimized for more analytical workloads (a column oriented database). When you open a table through this file system, it’s actually querying the embedded database. This process is of course abstracted from the user. To the user, opening a table would be like opening up an Excel file, only that they are really opening a database table.

Challenges - what can actually be solved?

One of the surprising challenges of using a database, aside from setting it up, is to actually create tables and import data to it. If the data from the spreadsheet file is perfectly clean, properly formatted, and validated, it’s relatively easy. But most practical data is imperfect, messy, and poorly formatted. So while in theory, importing an Excel file to a database should be easy, in practice it is from super tedious to impossible depending on the user, including even more technical users. Why is this?

Take a look at the following Excel spreadsheet. Do you notice anything that’s out of place?

Common Excel file — A typical business Excel file

The gotcha answer is no, and that’s because a lot Excel spreadsheets actually do look like this. Just ask any of your friends in non-technical roles. If anything, this specific Excel is on the less crazy size of the spectrum. (Feel free to send me a screenshot of the craziest Excel file you’ve seen).

Importing this file to your database is not trivial, even if it should be or you expect it to be.

First, note that the column headers don’t start in the first row. To make it more complicated, the data of interest is not directly adjacent to the column header.

Next, assuming you want to preserve data types and not just import everything as string or text, there are a few additional issues with this file that prevent you from easily importing it to a database. In no particular order:

You’ll want your age column to be an integer column, but row 18 says “fifty” as a string and not an integer. Your database import will either completely fail the data import, stop when it reaches that row, or skip that row and continue, even though the data is not invalid.
The date formatting in the date_subscription column varies a lot. DD-MM-YYYY, YYYY/MM/DD, DD-MMM-YY, MMM D YYYY, etc. Additionally, the date in row 7 and row 10 are ambiguous dates. In the case of row 7, there is no month 14 or 30; in row 10, there is no day or month 100. If you want to keep it as a date or datetime column so you can filter it later or run on SQL on it, your database might not be able to ingest this unless you standardize the formatting first or at least remove/fix the ambiguous dates.
Depending on what you use this data for, some other columns might be problematic e.g. names are not properly capitalized and email formatting is off, but your database will still let you insert them, even though the data might be less useful without further processing.

Other challenges related to importing data is that spreadsheets allow users to create formulas (or pretty much code), pivot tables, and charts. In theory all of these can be recreated with apps/tools that sit on top of a database, but these concepts don’t translate 1-1 to a SQL database without additional work. You can’t just import a spreadsheet along with its formulas and automatically map that data to database tables and SQL statements. In theory it could be possible (or technically possible to a certain extent), but it wouldn’t be easy to do.

Finally, spreadsheets are meant to be “reactive”. By this I mean that changing a constant in one cell should update a calculation that depends on that constant in different cell. This applies not only with constant, cells, and calculations located in the same sheet, but also across sheets. For some use cases in certain industries, there could be spreadsheets with tens of sheets/tabs, thousands of rows, and millions of cell calculations altogether, times a thousand because that’s how many spreadsheets power the entire business. (This is not made up. I’ve seen this first hand when consulting for a huge pension fund). All of this this means that a solution would probably require a new kind of database that would look more like some hybrid between a transactional database, an analytical database, and maybe even a graph database - essentially a kind of database that can operate quickly with large amounts of data, and can still allow for high-performance reactivity, even with lots of calculations required.

There definitely seems to be a gap in between increasingly large spreadsheets and (analytical) databases, but I wonder - Is this a gap that could actually be closed and if so, would it make sense to?

Or should we leave spreadsheets as the ubiquitous, powerful, beautiful, but messy monster they are?

Company

Dropbase just reached 1,000 stars on Github!

July 29, 2024