Feonomics in Machine laerning


Genomic is the study of all aspect of genomes like DNA, RNA and protein. Genomic information is retrieved from DNA sequence that control all the functionality of an organism. DNA sequence consist of 4 nucleotide base pairs (T, C, G, A). DNA sequence is divides into 23 different types of chromosomes. These chromosomes are further arranged into different segments called genes. Genes make protein or encode protein. Different machine learning models is used with gene data to diagnose the patient health, hierarchal information, and predict how much chances of raising a certain disease. Genomic data with machine learning models is also used found the criminal, security systems and also for the early prediction of diseases. Lets further discuss What is Geonomics in Machine Learning is.?


Today, Machine learning is the emerging field of computer science. All the systems are automated by using the machine learning techniques also the deep learning. Machine learning techniques are generally use to automate the system, handle large amount of data, data mining and retrieving knowledge from raw data. As the DNA sequences data increasing day by day according to genbank [1] in fig 1.

DNA sequence consists of thousands of nucleotides (T, C, G, A). By increasing the sequence information, it is difficult to compute the gene knowledge manually. Also, the protein encoded by genes are very complex in its nature due to the thousands of atoms and bonds. Hence, there are millions of possibilities of a single protein structure.  To handle all these aspects of genomics, there is need to machine learning models. Machine learning models trains on training data and then make predictions about unseen protein and genes. Machine learning model is evaluated by the accuracies, how much they are accurate? But as the accuracy is an important measure it is also important to verify that what type of model is suitable for our data.

Figure 1 Sequence growth rate

Modeling Techniques

Genomic data is usually evaluated with three modeling techniques (frequency base modeling technique, network base and machine learning base modeling technique. In frequency base (FB) modeling technique happing events is counted and most frequently happening event consider most significant. FB technique is generally used to find the key genes that play important role in specific disease.

Genes from the sequence data that are highly mutated (changed) is consider as key gene in FB base modeling technique. GISTIC [2], OncodriveCLUST [3] and MutsigCV [4] are the frequency base tools that find the key genes.  Secondly, network base (NB) technique that make a network of all genes. In gene network two connected genes contains the similar features. NB model is used for disease prediction and also for key genes prediction. All the diseases and genes are map on a network and related diseases and genes and connected with each other. Resultantly network predicts the possible similar disease and genes that can raise after raising the one disease.

Modeling Technqiues of Geonomics in Machine Learning

Machine learning (ML) modeling technique is divide into three categories supervise ML, unsupervised ML, and semi supervise ML. In supervise machine learning every sequence is label with specific disease or any other biological information. ML model firstly trained on label data and then predict the diseases of unseen sequences. In unsupervised ML learning data is not label with any data. It uses to grouped the data into different classes on the basis on similar features and properties. a single use of unsupervised ML is to group the same disease patients into different disease stages. Unsupervised ML model takes the sequence information of all patients and then grouped them into classes like the initial stage patient, middle stage patients and last stage patients.  Semi supervised machine learning is between the supervised and unsupervised machine learning. In semi supervised machine learning label data is to predict the unlabeled data.

Future of Genomics in Machine Learning

Genomic data with machine learning play an important role in many developments. Most important application of genomic data in machine learning is the identification of key genes. Key genes are those gene that play significant role in any disease or genes that cause of raising the disease. Another application is the discovery of drugs for the key genes. In drug discovery application, machine learning model predict the best drug for the given of key genes.

Different types of genomic data (Gene expression, RNA sequence, and somatic mutation) is used for early diagnose of disease. It was not possible to evaluate the genomic data without using machine learning models. Genomic data with machine learning also facilitate the researchers by predicting the sequence information, protein structure, key genes identification and drugs formula. Genomic information in medical is use in security system in which security is maintained by finger prints. It is also use for find the hierarchy of a person.

In future, by using the genomic information with machine learning model we may be able to predict the chances of happening criminal act by a specific person, chances of raising the specific disease at the time of birth, sketch of new born baby. By using genome data, we improve the quality of crops and find that what type of crop is suitable for suspected land.

Conclusion of Geonomics in Machine Learning

Genomic is the vast area of biology but conducting any research in genomics without machine learning creates many hurdles. Machine learning make possible to genetic research any many other applications of genomics.  Machine learning with genomics is use in multiple fields including agriculture, medical, farming, and security.  Moreover, text mining, system application, micro array pattern recognition is the application of machine learning in genomics.


In future we can used the genetic data in medical to early diagnosed and prediction about patient health. We can also implement machine learning on genetic data of plants, animals, birds and crops. Which crop or plant are suitable in suspected soil and land, in which type of environment animals and birds will show their maximum growth and what type of food will require for the healthy life of human are all the future application of genomics that can be do with machine learning.


[1]         “GenBank Statistics (ca. 2008).” [Online]. Available: https://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/. [Accessed: 16-Dec-2019].

[2]         C. H. Mermel, S. E. Schumacher, B. Hill, M. L. Meyerson, R. Beroukhim, and G. Getz, “GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers,” Genome Biol., vol. 12, no. 4, Apr. 2011.

[3]         D. Tamborero, A. Gonzalez-Perez, and N. Lopez-Bigas, “OncodriveCLUST: Exploiting the positional clustering of somatic mutations to identify cancer genes,” Bioinformatics, vol. 29, no. 18, pp. 2238–2244, Sep. 2013.

[4]         M. S. Lawrence et al., “Mutational heterogeneity in cancer and the search for new cancer-associated genes,” Nature, vol. 499, no. 7457, pp. 214–218, 2013.


Please enter your comment!
Please enter your name here

four × three =