Data tips

How to effectively work with very large csv and excel files

Working with csvs can be error-prone, hard to integrate with online sources for data analysis, repetitive and paralyzing to update. Read on for solutions!

Working with large csv/excel files leads to many problems in any data pipeline. In this article, we're going to explore 4 different issues with flat files and how Dropbase can help you solve them.

1/5 Manually editing large csv's leads to errors and bottlenecks

For some use cases, you can only receive data in a csv format that a data entry team has to edit before ingestion. You could be an ecommerce company that has to sift through user reviews, an accounting firm that works with billing information, or aggregating user-generated data from your mobile app. Manually editing large csv's among team members with its many, many workarounds has just become a part of your data workflow.

Take for example, a talent recruitment company that collaboratively edits data on a regular basis. They'd have to manually update job changes for certain leads, and have a data quality team that adds flags, and cleans the data. Ideally, they would like to see the changes requested by the data entry team and why the changes were made.

Doing this entire flow on just csv's is painful to imagine. You're used to having multiple data_new and data_new_new files sitting on your desktop after juggling around with your teammates. But you can't help shake the feeling that there must be a good solution out there.

Dropbase's Instant Databases with cell editing makes collaboratively editing csv's a breeze.

Dropbase allows you to drop a csv and share it with your team. Your team can then make edits in a spreadsheet-like interface, saving their change requests along the way. When you're ready to commit those changes, the owner/admin of the data just has to click approve. Then, all of your changes are executed in your database and stored safely away.

This prevents having to save a comical number of copies of the same file and deal with merging data from different team members. Dropbase empowers you to collaboratively edit your flat files and excite your customers.

2/5 It's frustrating to combine flat files with other online sources for analysis, sharing, and reporting

Csv's and Excel files make up a huge portion of most business data flows, but most cloud data warehouses don't work well with flat files out of the box. Google's Big Query has many limitations for how csv files are formatted. You then either have to upload the file each time using batch loading or figure out a way to load your csv somewhere online and use their write API. Mode Analytics, a popular BI tool, has stopped natively supporting csv uploads and only accepts connections to databases.

So it's annoying to figure out how to combine the data from your flat files with your online sources so that you can share the results with your team. Otherwise, the data is "locked" within those flat files and you'll have two different silos: online and offline sources.

With Dropbase, combining flat files with online sources is a drag and drop away

Bringing your flat files online is a breeze in Dropbase. Check out this short video showing how:

If this csv file comes at you regularly (e.g. daily, weekly, monthly), then you can just drag and drop it again to append the new data with the help of Dropbase pipelines.

Your flat file is then uploaded to a Snowflake data warehouse where getting the access credentials is 2 clicks away:

We give you access to the Snowflake warehouse so you can use the downstream product of your choice and fully utilize your data without silos.

3/5 No one likes repetitive data cleaning

Imagine you're a healthcare insurance company that deals with flat files. A lot of them. We're talking csv's from hospitals, Excel files from some private clinics, and actual text files from a genial, geriatric dentist. You take data from these different customers that comes in on a regular basis, clean, edit, and validate it, to then drop off the cleaned data to your data warehouse. All this requires collaboration among multiple team members: at least a data entry team and an ETL team. Every. Single. Day.

Manually editing the data cleaning like this takes time away from more impactful things: generating insights, getting more customers, taking time to smell the roses, etc.

Dropbase pipelines automate data cleaning steps

In addition to pre-created steps, Dropbase pipelines can also include custom python steps that can be applied to flat files of the same schema:

So instead of doing the same cleaning steps repeatedly, Dropbase pipelines just have to be created once and you never have to clean the same data twice. For an in-depth view of pipelines, check out this article: Dropbase Pipelines and how they solve all your ETL problems

4/5 Working with large csv's gets s  l  o  w, really quickly

You'd have to have many tricks up your sleeve to effectively work with large csv's. A popular one is to not work with csv's in the first place and instead convert your csv to a database. But maybe your data workflow is already entrenched with some other practices like using IDEs or code editors, version-tracking software, or an ad-hoc flagging system for your team.

For example, you could be the data team behind a personal weight loss mobile app. App users fill in what they've eaten which is uploaded to your server every day. You have some cleaning steps in your in-house data pipeline but there's invariably thousands of rows that have to be dealt personally. So your data quality team goes through those rows and this is where things get s l o w. Opening this HUGE file, Finding each line in question, asking your team members if anyone else has already fixed this issue, and then trying to merge all changes... this slows down your value offering and you want a faster way to work with data.

Dropbase doesn't have an upload size limit for csv's

Even if your csv is hundreds of megabytes large, all you have to do is drag and drop it into Dropbase. Your csv is immediately saved and ready for editing. You don't have to work with this one huge csv file that you dread opening every week.

5/5 Waiting for the technical team to make a tiny tweak in your data ingestion workflow takes too long

You might already have an in-house data pipeline developed by your data engineering team. It works well in importing all your data, aggregating it, and piping it to your BI tools. But if there's a slight change required—like accommodating a format change in a supplier's inventory list—you end up waiting for the engineering team to fix it. Because the pipeline is so code-heavy, nobody on the business team can safely make changes to the pipeline and the business team's now paralyzed.

Dropbase pipelines can be easily edited without any coding

Dropbase gives you dozens of pre-made processing steps to apply to your pipelines at no extra charge. Creating conditional columns, deleting columns, and concatenating strings to name a few. You can apply as many steps as you want and edit them later as well. You, and your business team, now have the power to update your pipeline instead of waiting for engineering.

Conclusion

There are tons of issues working with csv's. It's error-prone, hard to combine, repetitive, slow, and sometimes paralyzing to update. Dropbase offers a solution to each of those issues with features like instant databases, pipelines, and cell editing. With all of these features at your disposal, you can work effectively and quickly with csv's to get the job done.

January 12, 2022
by 
Narayan Subramoniam
Newsletter
Insights and updates from the Dropbase team.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
By signing up you agree to our Terms of Service