How to Find and Input Missing Values in a Dataset

Datasets may have missing values, and this can cause problems for many machine learning algorithms. As such, it is good practice to identify and replace missing values for each column in your dataset prior to modeling your prediction task.

Find Missing Values in a Dataset

Finding missing values in a dataset is not very complicated. You just have to read your dataset das pandas DataFrame an all missing values have a cell “value” of “NaN”. For the data preprocessing it is essential to know how many values of a particular column are missing, because if only a few samples are missing (for example 1%) you would simply delete these samples, but if a lot of samples are missing (for example 50%) you might want to delete the whole column. Therefore I created a handy python function that shows how many values are missing in a dataset. The following python code shows how I use the Boston House Prices dataset from the Kaggle website to find the missing values in the training data.

import pandas as pd

df = pd.read_csv(
    filepath_or_buffer = 'find_input_missing_values/train_bostonhouseprices.csv',
    index_col = "Id"
)

def find_missing_values(df):
    """
    find missing values in the dataframe
    return the features with missing values, the total number of missing values and the percentage of missing values
    """
    total = df.isnull().sum().sort_values(ascending=False) # compute the total number of missing values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False) # compute the percentage of missing values
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) # add all information to one dataframe
    missing_data = missing_data[missing_data['Total']>0] # filter the dataframe to only the features with missing values
    return missing_data

df_missing_values = find_missing_values(df)
df_missing_values

"""
	Total	Percent
PoolQC	1453	0.995205
MiscFeature	1406	0.963014
Alley	1369	0.937671
Fence	1179	0.807534
FireplaceQu	690	0.472603
LotFrontage	259	0.177397
GarageYrBlt	81	0.055479
GarageCond	81	0.055479
GarageType	81	0.055479
GarageFinish	81	0.055479
GarageQual	81	0.055479
BsmtExposure	38	0.026027
BsmtFinType2	38	0.026027
BsmtCond	37	0.025342
BsmtQual	37	0.025342
BsmtFinType1	37	0.025342
MasVnrArea	8	0.005479
MasVnrType	8	0.005479
Electrical	1	0.000685
"""

From the table you see that there are some features (PoolQC, MiscFeature, Alley, Fence, FireplaceQu, LotFrontage) with more than 15% of missing data. I would suggest to delete these columns completely from the dataset. For all other columns you could delete the samples or you use the univariate or multivariate imputation strategy.

Remove Missing Values

The easiest way to get rid of missing value is to delete the column with missing values or the samples that contain missing values. The following program code shows how to first delete the columns with more than 15% missing values and then delete the missing samples from the dataset.

Do not drop a row for a Kaggle Test dataset

You can not delete rows in a Kaggle test dataset, because you need the output for every sample of the test dataset for the submission on the Kaggle website.

# drop features with missing values
df = df.drop(df_missing_values[df_missing_values.Percent > 0.15].index, axis=1)

# drop rows with missing values
df = df.dropna(axis=0)

Univariate Feature Imputation

For the univariate feature imputation only non-missing values in that feature dimension are used to calculate missing values in this feature dimension. The computation can be the mean or median for numeric features and the most frequent value for categories. In a machine learning project it is important to only fit the values of the training set and save the trained imputer. You will need the imputer later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data.

The following example shows how to use the sklearn library to input all numeric missing values with the median and all categoric missing values with the most frequent value.

def univariate_feature_imputation(df):

    from sklearn.impute import SimpleImputer

    # create an Imputer class with a strategy mean / median / most_frequent
    imputer_num = SimpleImputer(strategy="median")
    imputer_cat = SimpleImputer(strategy="most_frequent")
    
    # split the features into numeric and categoric features to apply different imputations
    df_num = df[list(df.describe())]
    df_cat = df[list(df.describe(include="O"))]
    
    # fit and transform the Imputer to the selected features
    # transform only the test data
    X_num = imputer_num.fit_transform(df_num)
    X_cat = imputer_cat.fit_transform(df_cat)
    
    # use only the numeric or only categoric features and drop all categories
    df_num = df[list(df.describe())]
    df_cat = df[list(df.describe(include="O"))]
    
    # create the transformed DataFrame with the same column names from array
    df_num_tr = pd.DataFrame(X_num , columns=df_num.columns)
    df_cat_tr = pd.DataFrame(X_cat , columns=df_cat.columns)
    
    # join the numeric and categoric features back to one dataset
    df_tr = pd.concat([df_num_tr, df_cat_tr], axis=1)
    
    # test if the new dataset still contrains missing values
    df_missing_values_tr = find_missing_values(df_tr)
    
    if df_missing_values_tr.empty:
        print("No Missing Values left")
        
    return df_tr

You can also add a little bit more complexity to the univariate feature imputation by first splitting the dataset into different subsets. For example if some values of the sales price were missing, I could replace the missing values with the median of all sales prices. A more advanced solution would be that I split the dataset into subsets for each year the house was build, because I assume to know that the building year has an influence on the sales price. Therefore I have the advantage that the sales price of older houses is the median of all older houses and that the sales price of a new house is not part of the median computation.

Multivariate Feature Imputation

You can use multivariate imputation algorithms that use the entire set of available features to estimate the missing values. Therefore most of the times especially numeric features can be inserted with a higher complexity.

Find Similarities in the Dataset (few Missing Values)

If you have only a few missing values, the easiest option is to take a closer look at the missing values and try to find similarities in the features. What features are the same? Based on the same features you can try to find the right value for the missing values based on the data where the missing feature is present.

For example: Two passengers of the Titanic dataset have a missing port of embarkation. Both passengers have the “Pclass” of 1 and a “Fare” of 80. We create a boxplot with “Embarked” as x-axis, “Fare” as y-axis and separated by the “Pclass”. From the boxplot we see that passengers that embarked from Cherbourg had a median fare of 80 of the 1 passenger class.

Input Missing Values via Machine Learning Algorithm

A more complex technique to input missing values is to use a machine learning algorithm. The samples with the missing values are handled like a test dataset in a machine learning problem. If your missing values are numeric, you have to solve a regression problem, if you have a category with only two unique values, you have a binary classification problem and if you have more then two unique classes then you solve a multi-class classification problem.

The handling of the data is Independent of you feature that contains missing values and described in the following steps:

  1. The first step is to drop all features that have too much missing values (I like to choose a threshold of around 15%)
  2. Drop all samples with missing values but not the one that are part of the feature where we want to input the missing values.
  3. If there is now a feature that has missing values for samples that you want to input, you have to drop also these features and if you are in a Kaggle competition you have to delete the predicted variable, because this column will not be present when the algorithm is used to input the missing values of the test set.
  4. Now we can define your input and output for the training an test dataset to input the missing values
    1. df_train contains all the samples with no missing values
      1. df_train_y (output of the training set) is the series from the training set with the feature that contains missing values
      2. df_train_x (input of the training set) is the pandas DataFrame with all features that contain no missing data but not the column of df_train_y
    2. df_test contains all the samples with missing values
      1. df_test_x (input of the test set) is the pandas DataFrame with all test samples and the same columns like df_train_x
  5. After we defined the input and output for the training and test dataset, we only have to apply a machine learning algorithm and predict the test output.
  6. This output of the test dataset is then used to fill the missing values for the feature where we want to fill in the missing values.

The following python script shows how to input the missing values of “MasVnrType” with the KNN algorithm.

Find other examples in my GitHub repository

You find examples to input categories with the KNN, Logistic Regression in combination with OneVsRestClassifier and SVC as well as one example to input numeric values with Random Forest Regressor on my GitHub repository.

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

df = pd.read_csv(
    filepath_or_buffer = 'train_bostonhouseprices.csv',
    index_col = "Id"
)

def find_missing_values(df):
    """
    find missing values in the dataframe
    return the features with missing values, the total number of missing values and the percentage of missing values
    """
    total = df.isnull().sum().sort_values(ascending=False) # compute the total number of missing values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False) # compute the percentage of missing values
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) # add all information to one dataframe
    missing_data = missing_data[missing_data['Total']>0] # filter the dataframe to only the features with missing values
    return missing_data

df_missing_values = find_missing_values(df)

# drop features with missing values
df = df.drop(df_missing_values[df_missing_values.Percent > 0.15].index, axis=1)

# find the columns with missing values and store the column name as list
cols_nan = df.columns[df.isna().any()].tolist()

# remove out feature from the list that we want to predict
cols_nan.remove("MasVnrType")

# drop all other features with missing values to that only missing values of MasVnrType are in the dataset
df_tmp = df.drop(cols_nan, axis="columns")

# the test data are the samples with missing values
df_test = df_tmp.loc[(df['MasVnrType'].isnull())]  # null samples for feature
df_train = df_tmp.loc[(df['MasVnrType'].notnull())]  # known samples for feature

df_train_y = df_train[['MasVnrType']]
df_train_x = df_train.drop('MasVnrType', axis=1)
df_train_x = df_train.dropna(axis='columns') # drop the columns that contain missing values

df_test_x = df_test[df_train_x.columns] # the test dataset can only contain features from the training

print("Shape of df_train_x: " + str(df_train_x.shape))
print("Shape of df_test_x: " + str(df_test_x.shape))
print("Shape of df_train_y: " + str(df_train_y.shape))


# categoric lables have to be encoded with the label encoder
le = LabelEncoder()
df_train_y = le.fit_transform(df_train_y)

# use the KNN regression algorithm to fill the missing values
knr = KNeighborsClassifier(n_neighbors=5)

# define the categoric and numeric features for the column transformation
categorical_features = list(df_train_x.describe(include="O"))
numeric_features = list(df_train_x.describe())

col_transform = ColumnTransformer(
        transformers=[
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
            ('num', StandardScaler(), numeric_features)
        ]
    )

# define pipeline
pipeline = Pipeline(steps=[
    ('columnprep', col_transform),
    ('classification', knr)
])

# train the KNN Classifier
pipeline.fit(df_train_x, df_train_y)

# predict the missing values
df_test_y = pipeline.predict(df_test_x)

# retransform the missing values into labels
df_test_y = le.inverse_transform(df_test_y)

# fill the missing values in the original dataset with the predicted values
df.loc[(df['MasVnrType'].isnull()), "MasVnrType"] = df_test_y

df_test_y

Leave a Comment