Features Selection and Extraction In Machine Learning
Garbage In, Garbage out phenomenon has much importance in the domain of machine learning, which states that in order to get quality output, quality input is the necessary thing. Quality input in the machine learning refers to the selection of features which are definitely dependent on the domain of problem we are trying to solve.Machine Learning, in small problems feature selection process can be done manually, but problem where number of features are greater than number of instances and problem is referring to unknown domain, then features selection is a tedious job in Machine Learning.
Like Sparse auto encoder which is used for Features extraction and it is not only the process of extracting list of words from the textual data but also transforming these words in to acceptable form for the classifier, which is basically a vector form containing numerical values.
2: Machine Learning Techniques
There are Fifteen famous techniques have been discussed in this document. Ten techniques are for features selection and five techniques are for features extraction.
2.1 Feature Selection 2.2 Balanced Accuracy Measure (ACC2)
In this technique the difference of true positives andalso false positives of feature is calculated. That works fine with balanced dataset but does not perform well with unbalanced dataset. This issue can be resolved by taking absolute difference instead of just difference where tp and fp stands for true positive and false positive respectively.
AccuracyMeasure = ACC = tp − fp
BalancedAccuracyMeasure = ACC2 = |tpr − fpr|

Normalized Difference Measure (NDM)
Balanced Accuracy measure method ACC2 treats with diagonal and axis features equally but NDM treats them differently because according to NDM diagonal features are less important than the features near axis. Thus for that purpose in NDM features are normalized with minimum of tpr and fpr as explained in the equation below.

2.4 Information Gain(IG)
Information Gain technique works on the policy of counting the information after the addition or removal of any term from feature subset. If information increase after addition of the specific term then keep that term in feature set otherwise discard that term from the feature set.
IGt = e(p;n)[Pwe(tp;fp) + Pw−e(fn;tn)]
Chi-Square
Chi-Square is another method used for the features selection in categorical features. In this we calculate Chi Square of all the features with respect to the target variables and select the features with best Chi-Square scores. As these scores depicts either that feature is independent of the target variable or not.
Odds Ratio
Odds ratio deals with the occurrence of feature in the class. It give high priority to the features which are in the specific class but not in the other classes too. In this method features which occurs in multiple classes are discard.It does not consider irrelevant and redundant features.

Distinguishing feature selector (DFS)
In DFS there are three probabilistic cases can be consider for the selection of the features. Count occurrence of terms in the classes and decide for features on that basis. Like terms present in number of classes should be ranked higher than other terms, terms present in only one class and rarely occur in the other classes are irrelevant and should be ranked low and terms frequently present in single class and does not present in the other classes should be ranked higher.

Bi-Normal Separation (BNS)
Another way to rank the features is Bi-Normal Separation. In this method cumulative distribution of tpr and fpr is calculated. Features are ranked on the bases of Cumulative distribution also known as reverse z-score.
BNS = |Fc−1(tpr) − Fc−1(fpr)|
Gini index (GINI)
It is mathematical technique where Gini index value of features is calculated. Gini index value is in range of 0 to 1. If a value is 0 then it means that all elements or features belongs to certain class, if value is 1 then it means that features belongs to various classes with random distribution. One case where value can be 0.5 representing that elements are equally distributed.

2.10 Poisson ratio (POIS)
The main idea is to use poison distribution for each feature. After that calculation we can decide that either that feature is effective or not. If feature exists father from the poison distribution then that feature is effective otherwise it can be discard.
Machine Learning: MMR Max-min ratio
In NDM technique if the denominator is very low then features ranking is rank wrong. So to resolve this issue we can take its product with maximum of tpr and fpr as shown in the equation below.

3 Feature Extraction
3.1 Bag Of Words
Bag of words also known as unigram is the simplest technique for features extraction, where text is represented in the vectors form. Bag of words vector contains all the words in the text available and duplicate words are write once. Next you iterate through the vector and mark 1 if word is present in the text, otherwise mark 0 which represent absentees.
example:
Texts:
T1: The food was terrible, I hated it. (7 words)
T2: The restaurant was very far away, I hated it. (9 words)
T3: The pasta was delicious, will come back again. (8 words)
Derived Corpus): The, food, was, terrible, I, hated, it, restaurant, very, far, away, pasta, delicious, will,come, back, again (17 words)
T1 Vector: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
T2 Vector: 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0
T3 Vector: 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1

3.2 TF-IDF
Words with higher frequency are considered dominant in Bag of Words technique, so if word is not important in the domain but exist many times then that word will be given preference than the others which does not exist frequently but have importance in domain. In that scenario bag of words will not work, so TFIDF captured that scenario by counting words not only in the corresponding document but in the all documents too. Term Frequency (TF) is the frequency of the word in the current document. Inverse Document Frequency (IDF) is the score of the words among all the documents.

3.3 Pearson
Pearson co-relation as name suggested is the technique of finding relation between linear words. For example, if we talk about textual data than for the purpose of context capturing pearson co-relation will be beneficial, because it can find the statistical relation between words available in the corpus. The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation which is -1, and full positive correlation which is 1. A value of 0 means no correlation. For example, you can use a Pearson correlation to examine whether increases in temperature at your production facility are related with decreasing thickness of your chocolate coating.

Machine Learning: Spearman
The Spearman correlation coefficient based on the rank values for each variable rather than the raw data, which means variables are ranked first and then correlation is calculate between two variable. For example, you can use a Spearman correlation to examine whether the order in which employees complete a test exercise is related to the number of months they have been employed.

3.5 Kendall Rank
Spearman correlation calculated values are very larger, which can cause errors in calculation of standard deviation and co-variance. Kednall Rank technique is a non-parametric test that is use to calculate the strength of association between two variables. Values calculated by Kendall Rank are quite smaller than Spearman correlation, which help in complex calculations and probability of error in calculation becomes low.

Machine Learning: Conclusion
Data does not available in the desired from so, data processing make it possible for analysis and visualization. In order to get good results, feature selection and extraction techniques are beneficial which can help in useful features selection and extraction.
You may also know Comparison of Artificial Intelligence in Big Data Science