What is Data preprocessing?
Data Preprocessing: Data is growing exponentially from multiple sources in multiple formats .Real world data is too dirty(raw data)and cannot be directly fed into machine learning model as it may contain errors,incomplete,noisy and unstructured data. So,it is necessary to make the raw data understandable to machine learning model to get useful insights.
Data preprocessing simply means to convert raw text into a format that is easily understandable for machines
Role of data mining in data pre-processing:
Data mining helps in discovering the hidden patterns of scattered data and extracts the useful information turning it into knowledge.
The extracted information gives new patterns and relationships among the entities. Data mining helps in exploring and analyze large amount of information and it have a big role in Artifical Intelligence.
Steps of data pre-processing:
This includes following steps:
- Data cleaning
- A Data integration
- The Data reduction
- Data transformation
It deals with missing values,noisy and inconsistent data.
- Dealing with missing values:
Missing values can be deal in following different ways
- Removing the tuples whose label is missing (it may lead to dataloss)
- Filling missing values manually (not recommended for larger datasets)
- Using a standard value to replace the missing values(such as N/A,unknown)
- Taking mean,median ,mode(central tendency)for attributes to replace missing values
- Normalizing noisy data
Noisy data is meaningless data for machines as it don’t explain anything about the feature itself having incorrect feature value
It can be deal using following approaches:
- Binning:Sort and divide the data into bins(equal frequency) and smooth the data by taking mean or median of bin’s value and replace each bin by mean and median of bin’s values or smooth by bin’s boundaries
The data Regression:
Fit into regression function to smoothen the data as regression aims to find the best fit line
The data Clustering:
Make clusters of data and remove outliers( that reside too far from the clusters)
- Handling inconsistent data
The data Integration:
- Data consolidation
It combines the data from multiple data sources into a single data store. Data sources may be data cubes,multiple databases etc. The goal of data consolidation is to reduced number of storage locations. Data from these sources is cleaned up from errors and redundancies using extract,transform and load technology and is stored in one location like a data warehouse or data hub.
In data propagation ,data is transferred from one location(data warehouse) to different other(data marts)locations. As data is continuously being updated in warehouse ,changes are propagated in source data marts synchronously or asynchronously
It combines data from multiple sources and presents it as a unified view to front end applications and users i.e: enterprise applications,web and mobile users etc regardless of how data is gathered and formatted and where it is located.
It makes it easier to analyze the data by reducing data volume and preserving data integrity. it should present same or more close analytic results as on original data . It can be reduced either by:
- Reducing number of attributes
- Reducing number of tuples
Data can be reduced in following different ways:
Dimensional reduction helps to reduce the size of data by filtering out irrelevant redundant or weakly relevant attributes and finding the most relevant attributes in data preserving the maximum information of original data .PCA and wavelet transforms are two effective methods for dimensional reduction
Data cube aggregation:
It is the process of presenting gathered information in a summary form. Every dimension of cube represents specific characteristics of database such as monthly, yearly sales etc. Data cubes store aggregated information from multiple dimensions .Each cell stores aggregate data value.The resulting data is smaller in size without loss of information
Suppose we have data consists of sales per quarter for year 2017 an 2018.We are interested in sales per year instead of sales per quarter .So,we can aggregate the sales per quarter and get sales per year. The resulting data have all information without any loss.
Data is replaced by estimated /reduced set of representation
- Size of data can reduced by compressing it through encoding mechanism. The compressed data after reconstruction can be lossy or lossless.
- If original data is retrieved after reconstruction ,it is called lossless reduction otherwise it will be a lossy reduction
1-Missing values ratio
A threshold value is set and all those attributes having more missing values less than that threshold are removed.
1-Filter low variance attributes:
In this ,all normalized attributes having variance less than a threshold removed
2-Filter high correlation attributes
Filter out all normalized attributes having correlation coefficient high than a set threshold value as high correlated attributes tends to carry similar information
- Smoothing:Data is smoothened by removing noise
- Feature construction: New features created from the given data
- Aggregation: Data aggregated by applying aggregation operations I.e:Weekly sales can aggregated to calculate the monthly sales and monthly sales can summed up to calculate yearly sales(data cube aggregation)
- Normalization:Values scaled up in between a specific range I.e:- 1.0 – 1.0 (min-max scaling,z-score normalization)
Techniques in data mining:
- Association rule mining
- Association rule mining aims to uncover the association and correlation between different items and discovers the patterns. It tends to find which items co-occur together in a transaction. It works on if then statements means if a user buys this ,then he is most likely to buy that as well.
- Association rules made searching frequent if-then patterns and supporting parameters support and confidence .Supports tells the number of times an items appears in a data and confidence indicates how many times if-then statements are found true.
- Market basket analysis is one of the famous technique of association rule mining which helps retailers to analyze which products customers tends to purchase together means if a customer buys a diaper he is then most likely to buy a baby powder as well.
- Items in stores placed on shelves using association rule mining.
- Classification most commonly used technique in data mining based on machine learning that helps to classify data in to predefined categories and predicts unknown records using mathematical techniques such as decision trees ,neural network etc.
It has two parts:
Building the model(training or learning phase)
Classification algorithms like decision trees, naive Bayes etc, applied to make the model learn from given data. These algorithms apply classification rules to find the relationships between the values of the predictors and the values of the target
Classification using trained model(testing phase)
Once the model trained, its accuracy tested using testing data by matching actual label with predicted label. If accuracy satisfactory, then classifier used to predict the category of the new records whose class label unknown.
Classification using data mining has following applications
It may help bank officers to predict whether the loan applicants are risky or safe providing their data of income age etc
It helps identifying the probability of customer to quit a product ,service etc .
Clustering technique used to partition specific objects into groups and helps to find their similarity and differences. Objects within a cluster are very similar to each other but dissimilar with those residing in other cluster
Clustering can help marketers to target their right audiences by segmenting customers based on their purchase history,interests ,browsing activity and money spent. This would help company to target specific clusters of users for their campaigns.
Identifying crime localities
Specific insights can be gain about crime-prone areas using city,area of crime and type of crime.
Outlier detection or outlier mining:
- Outlier refers to the observation that is distant away from the rest of observations. It known as anomaly(noise,deviation,exception). This gives vital sign that something unexpected has happened and require special attention. Organization can find the cause of anomalies in their data and can resolve the issues to meet their business objectives
- Credit card transaction is a suitable example. For example, a normal person will have a limit of credit card transaction and if there is any deviation from normal limit means suddenly a transaction of very large amount,then system will detect this huge spike.
- Fraud detection
- Health monitoring
- Intrusion detection
Regression aims to extract the relationship between variables means how dependent variable changes with the change of independent variables. It generally used in aspects of prediction and forecasting
- House price prediction:Given set of attributes say house area,number of bedrooms,no of floors and locality(near bay,ocean)user can get the price of house
- Market forecasting
- Salary forecasting
- Health insurance cost prediction
- Decision tree
- Sequential patterns
Preprocessing in python:
Here are some data preprocessing steps in python
Import dataset and separate dependent and independent variables:
you want to work on,depending on the type of problem you want to solve(regression,classification etc)
Explore the data:
Have a closer look at your data ,how it looks like,what are the attributes.
Print info of data to check no of missing values and types of attributes
Dealing with missing values:
Missing values can be deal by dropping the row having null values or column(if have missing value greater than 75 %) .But it can only be used when there are more samples in dataset. Missing values can also be deal either by taking the mean,median of by filling with most frequent value. For this purpose use SimpleImputer from sklearn
Following code can used to deal with missing data
Missing values (nan) replaced by mean
Dealing with categorical variables:
Some machine learning models can’t directly work on categorical variables so they just need to be converted into numerical data. If they output variable needs to be categorical then ,it can again converted back to categorical from numerical on output .
Label encoder can be used to convert categorical values in numbers .It has an issue that is assigns labels to categories as 0,1,2 and model may think that categories have values higher than other as 0<1<2 .To resolve this issue ,we use one hot encoder which places one to represent particular feature and 0 for absence of feature.
Categorical variables (countries names)have replaced by categories 0,1,2
One hot encoding:
- It will make three columns for three countries names and will place 1 for present country else 0 for absent country
Categorical features after one hot encoding will look like :
Split the data into train and test set
Split the data into train and test set usually 70/30 or 80 /20 ratio respectively is use
Data can split into train and test set using following code
Data set after split into train and test will look like this
It helps to bring all data in a specific range such as 0-1 as data may have different range of values say age of employees vs salary of employees both have different ranges .We can use min-max scalar or standard scaler for this purpose.
It can done in following way:
Data after scaling will look like this:
- Preprocessing is crucial for data analysis task
- Data mining plays very important role as it extracts hidden pattern from data and turns the extracted information into knowledge
- The Data preprocessing has different steps like Data cleaning, a Data integration,Data reduction and Data transformation that convert the raw data into a machine understandable format further for analysis
- A Data mining techniques have extensively used for various purposes such as classification,outlier detection,regression analysis and many more
- Data Preprocessing in python makes it easier to clean the data using predefined libraries
Learn How to Human Activity and data Recognition playing a deep role in Machine Learning.