The genome of a living organism or a virus is the entirety of all gene sequences in the DNA and of the non-coding DNA sequences. It encodes the synthesis of RNA or proteins that direct all activities in an organism or in a virus. Therefore, sequencing the DNA and determining coding sequences give information about the genetic blueprint of organisms. The human genome for example consists of nearly 3 billion DNA base pairs and only about 2% of the genome are protein-coding genes, so the genome analysis is a complex process.
The DNA is composed of two bonded strands of nucleotides, which are characterized by the bases they contain, C (cytosine), G (guanine), A (adenine) and T (thymine), where only A with T and G with C can pair. Thus, the DNA can be understood as a string containing these four letters. The long sequences of often repeating letters cause the issue, that it is complex to recognize separate gene sequences and to distinguish between gene and non-coding sequences, which can differ by only one base pair. Therefore, a machine learning model has to be able to memorize the majority of past letters (bases) and to capture their dependencies among each other.
Potential solution approaches
The extraction of information from the text-like DNA sequence data requires a machine learning model, that is able to capture the long term dependencies on previous letters. Such tasks can be solved by bidirectional recurrent neural networks (BRNN), e.g. a long short-term memory (LSTM) network. These algorithms include for the calculation of the recent output previous outputs and state variables.
By determining the relations between the bases, the model is able to identify gene sequences and to classify them by their functionality. It can also detect irregularities in the DNA sequence supporting disease diagnostics and enabling more personalized medical treatment.