Working with large CSV/Excel files leads to many problems in any data pipeline. In this article, we're going to explore 5 key issues with flat files and how Dropbase can help you solve them.
For some use cases, you can only receive data in a CSV format that a data entry team has to edit before ingestion. You could be an ecommerce company that has to sift through user reviews, an accounting firm that works with billing information, or aggregating user-generated data from your mobile app. Manually editing large CSV's among team members with its many, many workarounds has just become a part of your data workflow.
Take for example, a talent recruitment company that collaboratively edits data on a regular basis. They'd have to manually update job changes for certain leads, and have a data quality team that adds flags, and cleans the data. Ideally, they would like to see the changes requested by the data entry team and why the changes were made.
Doing this entire flow on just CSV's is painful to imagine. You're used to having multiple data_new and data_new_new files sitting on your desktop after juggling around with your teammates. But you can't help shake the feeling that there must be a good solution out there.
Dropbase allows you to drop a CSV and share it with your team. Your team can then make edits in a spreadsheet-like interface, saving their change requests along the way. When you're ready to commit those changes, the owner/admin of the data just has to click approve. Then, all of your changes are executed in your database and stored safely away.
This prevents having to save a comical number of copies of the same file and deal with merging data from different team members. Dropbase empowers you to collaboratively edit your flat files and excite your customers.
CSV and Excel files make up a huge portion of most business data flows, but most cloud data warehouses don't work well with flat files out of the box. Google's Big Query has many limitations for how CSV files are formatted. You then either have to upload the file each time using batch loading or figure out a way to load your CSV somewhere online and use their write API. Mode Analytics, a popular BI tool, has stopped natively supporting CSV uploads and only accepts connections to databases.
So it's annoying to figure out how to combine the data from your flat files with your online sources so that you can share the results with your team. Otherwise, the data is "locked" within those flat files and you'll have two different silos: online and offline sources.
Bringing your flat files online is a breeze in Dropbase. Check out this short video showing how:
If this CSV file comes at you regularly (e.g. daily, weekly, monthly), then you can just drag and drop it again to append the new data with the Dropbase data tables.
Imagine you're a healthcare insurance company that deals with flat files. A lot of them. We're talking CSV's from hospitals, Excel files from some private clinics, and actual text files from a genial, geriatric dentist. You take data from these different customers that comes in on a regular basis, clean, edit, and validate it, to then drop off the cleaned data to your data warehouse. All this requires collaboration among multiple team members: at least a data entry team and an ETL team. Every. Single. Day.
Manually editing the data cleaning like this takes time away from more impactful things: generating insights, getting more customers, taking time to smell the roses, etc.
In addition to pre-created steps, Dropbase pipelines can also include custom python steps that can be applied to flat files of the same schema:
So instead of doing the same cleaning steps repeatedly, Dropbase pipelines just have to be created once and you never have to clean the same data twice.
You'd have to have many tricks up your sleeve to effectively work with large CSV's. A popular one is to not work with CSV's in the first place and instead convert your CSV to a database. But maybe your data workflow is already entrenched with some other practices like using IDEs or code editors, version-tracking software, or an ad-hoc flagging system for your team.
For example, you could be the data team behind a personal weight loss mobile app. App users fill in what they've eaten which is uploaded to your server every day. You have some cleaning steps in your in-house data pipeline but there's invariably thousands of rows that have to be dealt personally. So your data quality team goes through those rows and this is where things get s l o w. Opening this HUGE file, Finding each line in question, asking your team members if anyone else has already fixed this issue, and then trying to merge all changes... this slows down your value offering and you want a faster way to work with data.
You might already have an in-house data pipeline developed by your data engineering team. It works well in importing all your data, aggregating it, and piping it to your BI tools. But if there's a slight change required—like accommodating a format change in a supplier's inventory list—you end up waiting for the engineering team to fix it. Because the pipeline is so code-heavy, nobody on the business team can safely make changes to the pipeline and the business team's now paralyzed.
Dropbase gives you dozens of pre-made processing steps to apply to your pipelines to ensure your data is at its best prior to entering the database. Creating conditional columns, deleting columns, adding text and numeric validation rules are a few to name. You can apply as many steps as you want and edit them later as well. You, and your business team, now have the power to update your pipeline instead of waiting for engineering.
There are tons of issues that arise when working with CSV files. It's error-prone, hard to combine, repetitive, slow, and sometimes paralyzing to update. Dropbase offers a solution to each of those issues with features like instant databases, pipelines, and cell editing. With all of these features at your disposal, you can work effectively and quickly with CSV's to get the job done.