We cover using Power Query in Excel for very large csv's and it's drawbacks. Followed by a Bring Your Own Database solution and the ease of using Dropbase to solve all of your flat file conundrums
Sometimes, Excel's million row limit is far below what you need for your work. You can experience sluggishness, formatting errors, and random crashes at far below the limit as well. So, what can you do if you have a data file that has something close to or over 1,048,576 rows? There are many possible solutions, each with its pros and cons. In this article, we'll cover an Excel trick to work with very large csv's, followed by solving the problems to clean the large file in the first place.
How to utilize Power Query to work with large csv files
Locate your Text/CSV data source file. In this method, you cannot change the filename or move the source file once you have created your query. So make sure your source file has a great name and that you can access it.
Open up Excel and navigate to Data > Get & Transform Data > From Text/CSV. This will open up a file browser where you can select your source file. Click Import.
Here's the first important bit: once a data preview window opens up, click on the small arrow besides Load to open a dropdown menu and click on Load To...
4. This will open up an Import Data pop-up window. Since your Excel file cannot handle the entire file any option besides Table will work. Both PivotTable Report and PivotChart are for creating Pivot summaries of your data. The Only Create Connection option lets you query the data with Power Query.
5. Importantly, check the Add this data to the Data Model box if you plan on combining different tables in any way.
A Data Model in Excel is a set of tables in a Field List. Adding tables to your Data Model allows you to build relationships between them—exactly like a relational database.
6. After a bit (or a lot) of loading you should have all the rows loaded in and ready for analysis.
The Drawbacks of using Excel
Using Excel's Get Data feature is great if you are fine with your underlying data staying as a csv and want to do some quick n' dirty analyses in Excel. This option works best if you have to work once with a large data file. However, this option fails for a whole bunch of reasons:
Power Query does not do repeatability well. If you'd like to apply the same transformation steps on an newer source file, then you have to either do the steps all over again or write a bit of M code to point the query to another file.
Using Power Query locks you down to Excel. Using Excel prevents you from exporting your transformed steps to other tools. The transformed steps exist only as a query on top of the original file and you can't pipe the results to another tool. Quite a big drawback if you're part of a team!
Excel doesn't store a edited version of the source file. You may have noticed while using Power Query that all of your transformation steps only created a view of the transformed data. Your underlying source file wasn't altered. This is good if your incoming file is cleaned and acts as a source of truth. But what if the file isn't cleaned or formatted in the first place?
Power Query doesn't do repeatability well, you're locked down to Excel, and the source file can never be cleaned from within Excel for downstream connections.
→ Then, you need to ingest your file into a database and clean the file there. There are two ways to do this.
Setting up your own database
The first option is to set up your own database server. There are many, many decisions to make for this option.
The first decision is storage for which you have 3 broad choices:
Personal server + database software: In this option, you have or will have a dedicated server that you will install a database software onto. You'll have the security of always knowing where your data is but this is the most technically demanding and least scalable option.
Cloud server+ database software: You can purchase/rent server space from providers such as Cloudware, where you can install a database software. Definitely easier to set up than the previous option but you'll incur consistent costs.
Cloud database: By far the most popular and easiest option is to just use a managed database. These have familiar names like Amazon Web Services (AWS) and Microsoft Azure. You can find a list of the best ones over on Simplilearn
The second decision is choosing a database server software (if you need one). Each software is suited to different levels of scale and comes at varying prices.
For the first two choices in the previous step, you have a wide berth for what server software you can install. You could start with a free software like MySQL and see if this option is for you.
For cloud databases, the software you have access to is dependent on the specific service provider although you usually have many options.
And once you've made those decisions and set up your database, you can import your large source file in and query it using SQL to wrangle the data.
Setting up your own database is best if you have the technical know-how to navigate through the various options and execute. This option is best if set-up times are not a constraint and you want to learn database management along the way.
How to use Dropbase to work with large csv files
But what if you want to work with your large csv's now without bothering your technical team? Or if you want to ensure that you never have to clean the same data twice? It's simple—just try out Dropbase.
You drag and drop your csv file into Dropbase
Apply any transformation steps and save them as pipelines
Click Load to Database
That's it! You now have access credentials to the database once you navigate to Workspace Settings. With these credentials, you can connect your favourite BI tool to create data-driven insights or download the file. Check out this quick video for a quick run of Dropbase. If you'd like to try out Dropbase's ability to work with large csv's, sign up for a free trial here.
Sign up for a free trial to Dropbase if you'd like to try out these features and more!