Miss Forest Imputation- The Best way to handle Missing Data(Feature Engineering Techniques)

Prabhat Rawat
5 min readDec 29, 2020

--

Missing values are everywhere in real world Datasets and the first thing one have to do after receiving the data is feature engineering to deal with the corruptness of Data. Imputing the missing values are always very first step towards Feature Engineering as without that there is no further analysis can be taken forward.

Our motivation is to introduce a method of imputation which can handle any type of input data and makes as few as possible assumptions about structural aspects of the data.

In this article we are going to learn about the best technique for handling missing values “Miss Forest Imputation” with both theoretical and Programming implementation.

The original paper, Stekhoven & Buhlmann (2011), perfectly shows the preformance of missing data at rates of 10%, 20% & 30% into 10 publicly available datasets, comparing the accuracy of the imputations (on basis of Normalized RMSE) to some of the renowned and most widely used imputation algorithms available such as KNNimpute, MICE, etc.

Almost at each stage of missing dataset, MissForest out-performed these other approches, even in some cases reducing the imputation error by more than 50%. The best part about this techniques is that it doesn’t requires tuning(because random forests are so effective at default parameters).

How Miss Forest Imputation Works ?

Miss Forest is a random forest imputation algorithm and works very well for both continuous and discrete attributes. It initially imputes all missing data using the “Mean-Median-Mode Imputaion” or “Random Sampling Imputation”, then for each attribute with missing entries, Miss Forest fits a random forest on the observed part and then predicts the missing part. The Steps Mentioned below descibes the complete working :

Step 1: Identify the Missing Attributes inside the Dataset.

Step 2 : Fill the null entries with Mean-Median-Mode Imputation or Random Sampling Imputation, and create an addition attribute in order to capture the existance of null data entries inside data-frame as shown below

Marking Null Values as predict and other as training

Step 3: After this Create a Random Forest Model and use the data points labelled “Training” for Training Purpose and the data labelled “Predict” as testing Purpose Data.

Step 4: After the Successful training of Model try to Predict the output for testing Data and replace the output with the respective data points(having Predict as label).

Predicting output for missing values

Step 5: Repeat the Step 3 & Step 4 for some desired iteration or upto some defined condition to arise.

Programming Implementation

We have used “Python Programming” to demonstrate this approch with “Google Colaboratory” as IDE. This demonstration executed on this Dataset.

The Dataset is being loaded by the famous framework Pandas and it also shows the existance of Null Values in our Dataset.

Dataset

Inorder to get the complete idea about the null entries of dataset we made a simple visualization and also, we have dropped the attributes having almost 80% of null data entries in them(as it’s almost impossible to impute them).

Null Values to handle

As Discussed in the second step above we have to initially impute the null entries with “mean-median-mode Imputation” and also have to add label accordingly to the data.

Here, “add_label” acting as a function used to labelled the data using as per the discussed in above explanation.

“mean_median_imputation” is used to imputing mean or median values initially to the continious Data.

“mode_imputation” is used to imputing mode values initially to the catagorical Data.

Now we will create out main function used for the implementation of Miss Forest Algorithum with random forest Regressor or Classifier

Now, We have to perform the Imputation and we decided to run the algorithum iteratively for 10 times(it can vary) and storing the output results into a seperate Dictionary as shown

Performing Imputation

The Imputation is Successfully Completed and Now it’s time to make a look at our output

After Imputation

Make a look at Data Distributions :

Data After The Successful Imputation

Hurray, we did it!

Finally We are able to implement the Miss Forest Imputation from scratch in Python and that works very well as compare to other imputation techniques.

Final Words

Miss-Forest is a random forest imputation algorithm for missing data and Can be applied to mixed data types (missings in numeric & categorical variables). It have Excellent predictive power and well for high dimensionality Data.

But still there are some cons of this technique such as Imputation time, which increases with the number of observations, predictors and number of predictors containing missing values. And Also It is an algorithm, not a model object you can store somewhere. This means it has to run each time missing data has to be imputed, which could be problematic in some production environments.

But overall this technique is amazing…

If You want the complete code show above it’ll be available at my github repository and for any doubts please feel free to contact at my mail : ragvenderrawat@gmail.com

Thanks For Your Time, Have a nice Day :-)

--

--

Prabhat Rawat

Azure Data Engineer|| ETL Developer|| Deep Learning Enthusiastic|| Machine Learning Enthusiastic|| Python Programmer