As the need for accurate modelling and data driven decisions increases, the prevalence of organizations using machine learning has increased. With these increased reliance on algorithms and machines to predict important business metrics, comes significant risk. If the data that you're feeding into the algorithms is flawed or not cleaned, you can get results that are far from accurate and potentially dangerously misleading to your prediction efforts. To reduce the risk of creating a "Garbage In, Garbage Out" model, this article will share 7 key data cleaning steps that you should undergo before applying datasets to machine learning and artificial intelligence models.
We'll focus on two ways of completing these essential data cleaning steps: first, a general explanation of what Pandas function to use to clean the data, and then second, how you can streamline this process using Dropbase's built in and custom Python functions.
One of the most obvious things that can ruin the predictive validity of a dataset is outliers that don't belong in the set. Let's say for example that you're trying to predict the profitability of concerts, and revenue figures for each of the concerts is in the dataset is inputted as thousands of dollars ($637,000 revenue is represented as 637 in the dataset). However, someone erroneously inputted a number of concerts with their real dollar amount. If you run the dataset with these values included, the profitability of certain factors might be overstated by a factor of hundreds.
To solve this issue, you can apply a standard formula based on the interquartile ranges of your data to identify outliers. To do so, take the data that represents the 75th percentile and subtract the data that represents the 25th percentile. This resulting value is considered your interquartile range (IQR). To identify outliers, take the figures for the 25th and 75th percentiles, and subtract and add 1.5 x IQR to each, respectively. Any values outside this range are considered outliers.
With Dropbase, the removal of outliers can be automatically applied with the click of a button. Simply select the column with outliers, and use the prebuilt outlier removal function. As you can see in the above example, the outliers in the revenue column are removed. The removal of outliers ensures that your predictions are not skewed by recording anomalies, and can offer a more accurate view on trends over the dataset.
For all intents and purposes, columns that contain a single value are useless in the field of machine learning. Known as zero-variance predictors, these columns will not affect your models in any positive way, and may have unintended negative effects to your modelling attempts.
To remove these columns, you can either do this through manual inspection (if your dataset has a limited number of columns), or programatically if you want to streamline this task for the future. In Dropbase, you can complete either of these functions.
Using the prebuilt "Delete Column" function, you can choose columns that you don't need in your dataset, and remove them. This step can also be saved as part of a data flow, which allows the steps to be run again with new data in the future.
You could also create a custom function in Dropbase to programatically identify columns with only one unique entry, by using the nunique() Pandas function. If you wish to do this, check out the function below:
After running the function, the resulting table is seen below:
Often times your data set may come to you in a format that has dates stored as a string, or some numerical field stored as a text value. To properly apply certain data manipulations, you'll need to ensure that your data is stored as the right type.
Doing so is quite simple with Pandas, you can convert data from one type to another using the astype() Pandas function. Alternatively using Dropbase, you can convert data types simply by selecting a column, and changing the type.
Inevitably when you're dealing with massive training sets for machine learning and artificial intelligence, you'll have data that is incomplete. When this occurs, you generally have three options: leave it as is, fill the null values, or remove the null values.
If you leave the values as is, you risk harming the data model that you've created, and reducing the predictive validity of the model. If the data that's missing provides crucial insights into the results you're trying to predict, leaving it as is will most certainly lead to imperfect predictions. For this reason, we recommend you either fill or remove your null values.
If you can reasonably determine the value that should be inputted in the null cells, then this is the best solution. For example, if you know that columns with null values in "score" means that no score is recorded, you can simply replace it with null value with a 0. By doing this, you preserve the integrity of the dataset, and allow for predictions to accurately be made about scores. To do this, use the fillna() Pandas function. You can either replace it with a static value, or you could fill it with a statistical mean value for example. Alternatively, with Dropbase, you can use the prebuilt Fill Nulls function to do this without needing to code anything!
If the data cannot be reasonably predicted, then your best option is to remove them from the dataset. By doing this, you ensure that only fully inputted data is being tested. To do this, you can use the dropna() Pandas function to automatically drop any column that contains at least one null value. With Dropbase, there is once again a prebuilt function that you can use, Drop Nulls, to execute this in a code-free manner. Just be sure to complete this step after you've checked to ensure that the values cannot be filled predictably. Otherwise, you run the risk of excluding every "score" of 0 (as in the last example), which would skew your predictions.
One of the best and most efficient ways to improve your datasets is through the use of regular expressions to format and clean your data. Perhaps you have a column that has text strings like ("1 year", "5 years", "10 years"). You can't directly convert these to integers, but you know that they would be far more useful in modelling relationships if you could extract just the numbers. Using regular expressions, and the re Python library, you can use an expression like "\d" to remove any character that's not a number.
With Dropbase, you can skip the coding and work directly with the regular expressions, using the Find and Replace regular expression tool. This allows you to quickly format your data in batches, allowing you to better use your information. To enter regular expression editing mode, create a find and replace step, then click the button circled in blue below. This will allow you to write in regular expressions directly.
If you've aggregated your data from multiple sources, you also run the risk of having portions of your dataset overlap. Let's say you combined the last 3 months of sales data, but one week got recorded in two of the sets. If we ran the data as is, that week's sales would be double weighted, which is an unintended effect. To remedy this, use the drop_duplicates() Pandas function. This will drop any rows where the entire row is identical.
It's important to be aware of the data that you have in your dataset however. If there is a valid reason why duplicate rows might be present, doing this dropping of duplicates won't improve your dataset, rather it will degrade it by removing measurements that are frequently occuring. To avoid this problem, having some sort of unique column like a timestamp or user ID will ensure that duplicate measurements are still in unique columns.
If you don't want to learn how to use regular expressions, or if you only want to remove a couple specific words, there are other ways to clean your data in a way to make it more suited for artifical intelligence and machine learning. Using the replace() Pandas function will let you find the troublesome data and replace it with the intended data. This would be useful if you have a column that has values like ("Paid", "Not Paid") and you wished to replace them with a binary 1 or 0 representation.
Using Dropbase, the built-in find and replace tool allows you to complete this action with ease, simply specifying what column to parse through, what value to find, and what value to replace it with.
And there you go! Seven essential ways to clean your datasets before applying machine learning and artificial intelligence. Whether you stick to just using Pandas functions, or you try out Dropbase for free for your data cleaning and processing needs, I hope that these functions have improved your ability to extract the maximum possible value from your datasets.