How to Deal with Missing Data?

How to Deal with Missing Data?

A quick guide to help you deal with missing values in your next machine learning project.

While working with real-world data during machine learning, you will encounter that the dataset may have missing values. Missing values in datasets are a common occurrence. Your dataset can have missing values for a variety of reasons ranging from data entry failures to data collection issues. The point is missing data needs to be handled!

Missing data can be a big hindrance during machine learning. Machine learning algorithms do not support missing data and it needs to be handled in the preprocessing of the dataset.

Suppose we have the following dataset.

dataset.png

The NaN values in the dataset are the missing values. If I have to implement a machine learning algorithm on this dataset, I will first need to handle these missing values.

There are two ways we can handle missing data –

  1. Removing the rows with missing values.
  2. Using Imputation.

How to Check If Your Dataset Has Missing Values?

Before we handle missing data, we need to check if we have missing data in the first place. And if so, what columns have the missing data. You perform a column-wise check for missing data by doing –

df.isna().sum()

You would see an output like this –

check_for_missing_values.png

Handling Missing Data

Removing Rows With Missing Values

The first method you can employ while dealing with missing data is deleting rows with missing values from your dataframe. One way to do this is –

df.dropna(inplace=True)

Let us check for missing values in our dataset now.

df.isna().sum()

dropna_to_remove_missing_data.png

This method is not the best solution to handling missing data though. There is one major drawback to this method – loss of information and data. This method works very poorly if we have a high percentage of missing values. For instance, 30-40% of the whole dataset. So, I would recommend you use this method only if you have a very large dataset.

Using Imputation

Statistical Imputation is a popular approach to handle missing data. It involves using statistical methods to estimate the missing value in a column using those values that are present, then replacing all the missing values in that column with the calculated statistic. These statistics can be finding:

  • The mean of the column

  • The median of the column

  • The mode of the column

  • Using a simple constant value

Let us understand this with an example. Consider the above dataset (before we used the first method). Let us arrange the data into feature matrix (X) and target vector (y). The target vector y will have only the values of the Buy column.

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

You feature matrix X will look like – feature_matrix_X.png

Scikit-learn provides us with the SimpleImputer class within impute module of scikit-learn. Here is how we import it.

from sklearn.impute import SimpleImputer

Now, let us start with the missing values in the Yearly Amount Spent column. How about we fill all the missing values with the mean of all the existing values in that column. To do this, we create an instance of the SimpleImputer class. By default, the SimpleImputer class is used to replace missing values. Since we want to replace the missing values with the average value of the Yearly Amount Spent column, we specify the imputation strategy (using the strategy parameter) as “mean”. Here is how you do it.

yearly_amount_imputer = SimpleImputer(strategy='mean')

Then we use yearly_amount_imputer (the object of the SimpleImputer class) to apply this imputer to our data using the .fit() method. We specify X[:, 3:4] as it takes into account only the values of the Yearly Amount Spent column. The .fit() method looks at the missing values in the argument we specified and computes the average of those values. For the replacement, we call the .transform() method.

yearly_amount_imputer.fit(X[:,3:4])
X[:,3:4] = yearly_amount_imputer.transform(X[:,3:4])

SimpleImputer_example_1.png

And as you can see, our missing values in the Yearly Amount Spent column have been replaced with the average value of that column i.e, 7230.0.

We can still see the missing values in the Premium Member column. Let us use an imputer to fill the values of that column with a constant value - ‘missing’. We will need to specify our imputation strategy as ‘constant’ and specify ‘missing’ as the fill value of the imputer. Here is how you do this -

premium_member_imputer = SimpleImputer(strategy='constant', fill_value = 'missing')
premium_member_imputer.fit(X[:, -1:])
X[:, -1:] = premium_member_imputer.transform(X[:, -1:])

SimpleImputer_example_2.png

And as you can see, we have handled all our missing values.

Note:

  • An alternative to Imputer is using the .fillna() function of pandas. So the above steps can be done as follows –
    df[‘Yearly Amount Spent’].fillna(df[‘Yearly Amount Spent’].mean(), inplace=True)
    df[‘Premium Member’].fillna(‘missing’, inplace=True)
    
    I would however encourage you not to use them if you are handling missing data for machine learning. One reason being using SimpleImputer is faster than .fillna(). Using .fillna() is okay if your intention is data analysis only but for machine learning, prefer using the imputer. The second reason is mentioned in the next point.
  • Split your data into training and test data before you do imputation or any other preprocessing step. A very nice explanation of this is given at this link.

Conclusion

Thanks for reading! I hope you enjoyed this article and learnt from it. I would love to see your comments and feedback on this article. Let me know if you think I missed any other way we can handle missing values before using machine learning algorithms. You can connect with me on LinkedIn or Twitter.