How to Deal with Missing Data?
A quick guide to help you deal with missing values in your next machine learning project.
While working with real-world data during machine learning, you will encounter that the dataset may have missing values. Missing values in datasets are a common occurrence. Your dataset can have missing values for a variety of reasons ranging from data entry failures to data collection issues. The point is missing data needs to be handled!
Missing data can be a big hindrance during machine learning. Machine learning algorithms do not support missing data and it needs to be handled in the preprocessing of the dataset.
Suppose we have the following dataset.
The NaN values in the dataset are the missing values. If I have to implement a machine learning algorithm on this dataset, I will first need to handle these missing values.
There are two ways we can handle missing data –
- Removing the rows with missing values.
- Using Imputation.
How to Check If Your Dataset Has Missing Values?
Before we handle missing data, we need to check if we have missing data in the first place. And if so, what columns have the missing data. You perform a column-wise check for missing data by doing –
df.isna().sum()
You would see an output like this –
Handling Missing Data
Removing Rows With Missing Values
The first method you can employ while dealing with missing data is deleting rows with missing values from your dataframe. One way to do this is –
df.dropna(inplace=True)
Let us check for missing values in our dataset now.
df.isna().sum()
This method is not the best solution to handling missing data though. There is one major drawback to this method – loss of information and data. This method works very poorly if we have a high percentage of missing values. For instance, 30-40% of the whole dataset. So, I would recommend you use this method only if you have a very large dataset.
Using Imputation
Statistical Imputation is a popular approach to handle missing data. It involves using statistical methods to estimate the missing value in a column using those values that are present, then replacing all the missing values in that column with the calculated statistic. These statistics can be finding:
The mean of the column
The median of the column
The mode of the column
Using a simple constant value
Let us understand this with an example. Consider the above dataset (before we used the first method). Let us arrange the data into feature matrix (X) and target vector (y). The target vector y will have only the values of the Buy
column.
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
You feature matrix X will look like –
Scikit-learn provides us with the SimpleImputer class within impute module of scikit-learn. Here is how we import it.
from sklearn.impute import SimpleImputer
Now, let us start with the missing values in the Yearly Amount Spent
column. How about we fill all the missing values with the mean of all the existing values in that column. To do this, we create an instance of the SimpleImputer class. By default, the SimpleImputer class is used to replace missing values. Since we want to replace the missing values with the average value of the Yearly Amount Spent
column, we specify the imputation strategy (using the strategy
parameter) as “mean”. Here is how you do it.
yearly_amount_imputer = SimpleImputer(strategy='mean')
Then we use yearly_amount_imputer
(the object of the SimpleImputer class) to apply this imputer to our data using the .fit()
method. We specify X[:, 3:4]
as it takes into account only the values of the Yearly Amount Spent
column. The .fit()
method looks at the missing values in the argument we specified and computes the average of those values. For the replacement, we call the .transform()
method.
yearly_amount_imputer.fit(X[:,3:4])
X[:,3:4] = yearly_amount_imputer.transform(X[:,3:4])
And as you can see, our missing values in the Yearly Amount Spent
column have been replaced with the average value of that column i.e, 7230.0.
We can still see the missing values in the Premium Member
column. Let us use an imputer to fill the values of that column with a constant value - ‘missing’. We will need to specify our imputation strategy as ‘constant’ and specify ‘missing’ as the fill value of the imputer. Here is how you do this -
premium_member_imputer = SimpleImputer(strategy='constant', fill_value = 'missing')
premium_member_imputer.fit(X[:, -1:])
X[:, -1:] = premium_member_imputer.transform(X[:, -1:])
And as you can see, we have handled all our missing values.
Note:
- An alternative to Imputer is using the
.fillna()
function of pandas. So the above steps can be done as follows –
I would however encourage you not to use them if you are handling missing data for machine learning. One reason being using SimpleImputer is faster thandf[‘Yearly Amount Spent’].fillna(df[‘Yearly Amount Spent’].mean(), inplace=True) df[‘Premium Member’].fillna(‘missing’, inplace=True)
.fillna()
. Using.fillna()
is okay if your intention is data analysis only but for machine learning, prefer using the imputer. The second reason is mentioned in the next point.
- Split your data into training and test data before you do imputation or any other preprocessing step. A very nice explanation of this is given at this link.
Conclusion
Thanks for reading! I hope you enjoyed this article and learnt from it. I would love to see your comments and feedback on this article. Let me know if you think I missed any other way we can handle missing values before using machine learning algorithms. You can connect with me on LinkedIn or Twitter.