Home Data Mining Data Preprocessing in Data Mining

Data Preprocessing in Data Mining

27 min read

What is Data preprocessing:

Data is growing exponentially from multiple sources in multiple formats .Real world data is too dirty(raw data)and cannot be directly fed into machine learning model as it may contain errors,incomplete,noisy and unstructured data. So,it is necessary to make the raw data understandable to machine learning model to get useful insights.
Data pre-processing simply means to convert raw text into a format that is easily understandable for machines

Role of data mining in data pre-processing:

Data mining helps in discovering the hidden patterns of scattered data and extracts the useful information turning it into knowledge.
The extracted information gives new patterns and relationships among the entities

Steps of data pre-processing:

Data pre-processing includes following steps:

  1. Data cleaning
  2. Data integration
  3. Data reduction
  4. Data transformation

Data cleaning:

Data cleaning deals with missing values,noisy and inconsistent data.

  1. Dealing with missing values:

Missing values can be deal in following different ways

  • Removing the tuples whose label is missing (it may lead to dataloss)
  • Filling missing values manually (not recommended for larger datasets)
  • Using a standard value to replace the missing values(such as N/A,unknown)
  • Taking mean,median ,mode(central tendency)for attributes to replace missing values
  • Normalizing noisy data

Noisy data is meaningless data for machines as it don’t explain anything about the feature itself having incorrect feature value

It can be deal using following approaches:

  • Binning:Sort and divide the data into bins(equal frequency) and smooth the data by taking mean or median of bin’s value and replace each bin by mean and median of bin’s values or smooth by bin’s boundaries
  • Regression:Fit into regression function to smoothen the data as regression aims to find the best fit line
  • Clustering:Make clusters of data and remove outliers( that reside too far from the clusters)
  1. Handling inconsistent data

Data integration:

  1. Data consolidation

Data integration combines the data from multiple data sources into a single data store. Data sources may be data cubes,multiple databases etc. The goal of data consolidation is to reduced number of storage locations. Data from these sources is cleaned up from errors and redundancies using extract,transform and load technology and is stored in one location like a data warehouse or data hub.

  1. Data propagation

In data propagation ,data is transferred from one location(data warehouse) to different other(data marts)locations. As data is continuously being updated in warehouse ,changes are propagated in source data marts synchronously or asynchronously


  1. Data virtualization

Data virtualization combines data from multiple sources and presents it as a unified view to front end applications and users i.e: enterprise applications,web and mobile users etc regardless of how data is gathered and formatted and where it is located.

  1. Data reduction:

Data reduction  makes it easier to analyze the data by reducing data volume and preserving data   integrity. It should present same or more close analytic results as on original data . It can be  reduced either by:

  • Reducing number of attributes
  • Reducing number of tuples

Data can be reduced in following different ways:

  1. Dimensionality reduction

Dimensional reduction helps to reduce the size of data by  filtering out irrelevant redundant     or weakly relevant attributes and finding the most relevant attributes in data preserving the   maximum information of original data .PCA and wavelet  transforms are two effective methods for dimensional reduction

  • Data cube aggregation:

 It is the process of presenting gathered information in a summary form. Every dimension of cube represents specific characteristics of database such as monthly, yearly sales etc. Data cubes store aggregated information from multiple dimensions .Each cell stores aggregate data value.The resulting data is smaller in size without loss of information


Suppose we have data consists of sales per quarter for year 2017 an 2018.We are interested in sales per year instead of sales per quarter .So,we can aggregate the sales per quarter and get sales per year. The resulting data have all information without any loss.

  1. Numerosity reduction:

Data is replaced by estimated /reduced set of representation

  1. Data compression
  • Size of data can be reduced by compressing it through encoding mechanism. The compressed data after reconstruction can be lossy or lossless.
  • If original data is retrieved after reconstruction ,it is called lossless reduction otherwise it will be a lossy reduction
  1. Missing values ratio

A threshold value is set and all those attributes having more missing values less than that threshold are removed.

  • Filter low variance attributes:

            In this ,all normalized attributes having variance less than a threshold are removed

  • Filter high correlation attributes

Filter out all normalized attributes having correlation coefficient  high than a set threshold value as high correlated attributes tends to carry similar information

Data transformation:

  1. Smoothing:Data is smoothened by removing noise
  2. Feature construction:New features are created from the given data
  3. Aggregation:Data is aggregated by applying aggregation operations I.e:Weekly sales can be aggregated to calculate the monthly sales and monthly sales can be summed up to calculate yearly sales(data cube aggregation)
  4. Normalization:Values are scaled up in between a specific range  I.e:- 1.0 – 1.0 (min-max scaling,z-score normalization)

Techniques in data mining:

  1. Association rule mining
  2. Association rule mining aims to uncover the association and correlation between different items and discovers the patterns. It tends to find which items co-occur together in a transaction. It works on if then statements means if a user buys this ,then he is most likely to buy that as well.
  3. Association rules are made by searching frequent if-then patterns and supporting parameters support and confidence .Supports tells the number of times an items appears in a data and confidence indicates how many times if-then statements are found true.


  • Market basket analysis is one of the famous technique of association rule mining which helps retailers to analyze which products customers tends to purchase together means  if a customer buys a diaper he is then most likely to buy a baby powder as well.
  • Items in stores are placed on shelves using association rule mining.
  • Classification
  • Classification is most commonly used technique in data mining  based on machine learning  that helps to classify data in to predefined categories and predicts  unknown records using mathematical techniques such as decision trees ,neural network etc.

             It has two parts:

  1. Building the model(training or learning phase)

                         In this phase,a classifier learns using training data with associated class labels.                                                  

                         Classification algorithms like decision trees,naive Bayes etc. are applied to make the                                model learn from given data. These algorithms apply classification rules to find the                                          relationships between the values of the predictors and the values of the target

  1. Classification using trained model(testing phase)

                        Once the model is trained,its accuracy is tested using testing data by matching actual                                label with predicted label. If accuracy is satisfactory,then classifier is used to predict the                           category of the new  records whose class label is unknown.


                   Classification using data mining has following applications

  1. Risk prediction:

                                It may help bank officers to predict whether the loan applicants are risky or safe       

                                providing their data of income age etc.

  • Churn detection

It helps identifying the probability of customer to quit a product ,service etc .         

  1. Clustering

 Clustering technique is used to partition specific objects into groups and helps to find their similarity and differences. Objects within a cluster are very similar to each other but dissimilar with those residing in other cluster


Customer segmentation:

 Clustering can help marketers to target their right audiences by segmenting customers                               based on their purchase history,interests ,browsing activity and money spent. This would             help company to target specific clusters of users for their campaigns.

  • Identifying crime localities

Specific insights can be gain about crime-prone areas using city,area of crime and type of  crime.

  1. Outlier detection or outlier mining:
  • Outlier refers to the observation that is distant away from the rest of observations. It is also known as anomaly(noise,deviation,exception).This gives vital sign that something unexpected has happened and require special attention. Organization can find the cause of anomalies in their data and can resolve the issues to meet their business objectives


  • Credit card transaction is a suitable example. For example, a normal person will have a limit of credit card transaction and if there is any deviation from normal limit means suddenly a transaction of very large amount,then system will detect this huge spike.


  • Fraud detection
  • Health monitoring
  • Intrusion detection
  1. Regression analysis:
  2. Regression aims to extract the relationship between variables means how dependent variable changes with the change of independent variables. It is generally used in aspects of prediction and forecasting


  • House price prediction:Given set of attributes say house area,number of bedrooms,no of floors and locality(near bay,ocean)user can get the price of house


  • Market forecasting
  • Salary forecasting
  • Health insurance cost prediction

Other techniques:

  • Decision tree
  • Sequential patterns
  • Predictions

Preprocessing in python:


Here are some data preprocessing steps in python

  1. Import libraries:

Import necessary libraries

  1. Import dataset and separate dependent and independent variables:

            Import the dataset you want to work on,depending on the type of problem you want to             solve(regression,classification etc)

Explore the data:

Have a closer look at your data ,how it looks like,what are the attributes.

  • Print info of data to check no of missing values and types of attributes
  1. Dealing with missing values:

Missing values can be deal by dropping the row having null values  or column(if have missing value greater than 75 %) .But it can only be used when there are more samples in dataset. Missing values can also  be deal either by taking the mean,median of by filling with most frequent value. For this purpose use SimpleImputer from sklearn

Following code can be used to deal with missing data

  • Missing values (nan)are replaced by mean
  1. Dealing with categorical variables:

            Some machine learning models can’t directly work on categorical variables so they just need to   be converted into numerical data. If they output variable needs to be categorical then ,it can be          can be again converted back to categorical from numerical on output .

            Label encoder can be used to convert categorical values in numbers .It has an issue that is             assigns labels to categories as 0,1,2 and model may think that categories have values higher         than other as 0<1<2 .To resolve this issue ,we use one hot encoder which places one to represent          particular feature and 0 for absence of feature.

  • Categorical variables (countries names)have been replaced by categories 0,1,2

            One hot encoding:

  • It will make three columns for three countries names and will place 1 for present  country else 0          for absent country
  • Categorical features after one hot encoding will look like :
  1. Split the data into train and test set

Split the data into train and test set usually 70/30 or 80 /20 ratio respectively is use

  • Data can be split into train and test set using following code
  • Data set after split into train and test will look like this
  1. Feature scaling

Feature scaling helps to bring all data in a specific range such as 0-1 as data may have different range of values say age of employees vs salary of employees both have different ranges .We can use min-max scalar or standardscaler for this purpose.

  • Feature scaling can be done in following way:

Data after scaling will look like this:


  1. Preprocessing is crucial for data analysis task
  2. Data mining plays very important role as it extracts hidden pattern from data and turns the extracted information into knowledge
  3. Data preprocessing has different steps like Data cleaning,Data integration,Data reduction and Data transformation that convert the raw data into a machine understandable format  further for analysis
  4. Data mining techniques have extensively been used for various purposes such as classification,outlier detection,regression analysis and many more
  5. Data Preprocessing in python makes it easier to clean the data using predefined libraries
  • K-Means Clustering

    What is K-Means Clustering? K-means clustering is amongst the most popular unsupervised ma…
Load More Related Articles
Load More By ai-admin
  • K-Means Clustering

    What is K-Means Clustering? K-means clustering is amongst the most popular unsupervised ma…
Load More In Data Mining

Leave a Reply

Your email address will not be published. Required fields are marked *

Check Also

On-Chip/ Off-Chip Memory Storage in Artificial Intelligence

What is on chip memory? When we talk about on chip memory it means when the processor need…