In this post, I am just re-publishing a part of my PhD thesis that kind of summarises, in extremely simplistic terms, the “genome” part that was sufficient for me, a CS type of person, to get started with the algorithms in bioinformatics.
DNA, RNA, and Protein Sequences:
DNA (Deoxyribonucleic acid) is a biomolecule carrying the genetic information necessary for reproduction, growth, and functioning of living organisms (and some viruses). It is a chain of building blocks called nucleotides; each nucleotide contains one of the four bases – cytosine ( C ), guanine ( G ), adenine ( A ) or thymine ( T ). From an informatics-perspective, DNA can be seen as a string over an alphabet Σ = { A , C , G , T }. DNA usually occurs in a double stranded form (i.e. two strands or chains) intertwined in a double helical structure. The pairing between bases – A with C and G with T – keeps the double helix stable. As a result, the strands are complimentary to each other i.e. one strand can be obtained from the other by simply replacing A with C (and vice versa) and G with T (and vice versa).
Proteins are biomolecules responsible for a wide range of essential functions required in a life-form. A protein is a chain of smaller units called amino acids folded into complex three-dimensional structures. Most proteins are made up of up to 20 different amino acids. Thus, a protein molecule can be primarily thought of as a string over an alphabet consisting of 20 letters. Protein sequences are encoded in subsequences of DNA; such subsequences are called genes. Typically, in complex life forms, a gene consists of short substrings called exons interspersed by large substrings called introns. An ordered subset of exons, called a transcript, typically corresponds to one protein. As a result, the same gene can have multiple transcripts and thus encode multiple proteins. Encoding from DNA to protein is usually a three step process – transcription, splicing, and translation. In transcription, the two strands open up and a complementary (with respect to one of the strands) RNA molecule is produced; chemically, RNA is same as DNA with the only difference being that T is replaced with another base U (uracil). Transcription is followed by splicing (cutting off) introns to combine subsets of exons so as to produce one or more transcripts. The result of the splicing process is called mRNA (matured messenger RNA). In the final step i.e. translation, the mRNA is read sequentially from left to right encoding a triplet of bases (called a codon) into specific amino acids which are chained together to form the corresponding protein. The translation table associating such triplets to amino acids is shared by most of organisms and is called the genetic code.
The rate of transcription is controlled (inhibited or enhanced) by the binding of specific proteins called transcription factors in specific regions called regulatory regions. The DNA substrings to which transcription factors bind are called transcription factor binding sites (TFBS). These are located in either the promoter region (a 100-1000 base pair long region, which initiates the transcription process, situated near the site at which the transcription of a gene starts) or at a large sequential distance from the gene.
Genome:
Physically, DNA is usually present in a condensed form called chromosomes, and the complete set of all the DNA sequences of an organism is called a genome (In most viruses, genome is composed of RNA (rather than DNA) sequences). A genome can be as long as a few million base pairs (in bacteria) or more than a hundred billion base pairs (human genome is about 3 billion base pairs long). The genomes of individuals belonging to the same species typically have the same number of chromosomes and by and large the same base sequences in a chromosome. Consequently, a consensus or reference genome can represent a typical genome associated with a species. However, mutations (permanent alteration of sequence of a gene) and recombination (random cross over of chromosomes inherited from mother and father in sexual reproduction) can cause genetic variations as the genome is copied from cell to cell or from individual to individual across generations. Variations are usually small scale – mostly consisting of changes in single bases (single nucleotide polymorphism or SNPs) and less frequently, insertion or deletions of bases (InDels). Every possible variant found at some specific position in a chromosome is called an allele (i.e. a different form of the same gene). Genomics, being the branch of molecular biology focussing on the structure, function, evolution, mapping etc. of genomes, entails sequencing, assembling, and analysis of genomes.
Resources
- A quick summary can be gleaned from Modules 9 to 11 (DNA Structure and Replication; DNA Transcription and Translation; Gene Expression) of this course .
- A more detailed version may be learnt from the lectures (Week 5 to Week 9: DNA and replication; Transcription, Translation, and Variations; Recombinant DNA, Genomics) of this course–Introduction to Biology: The Secret of Life–on edX by Prof. Eric Lander.
- Animation videos
- Cell division: Mitosis, Meiosis
- Structure: DNA Structure, Packaging
- DNA replication: 3-D animation, Explanation (Part 1 and 2).
- Protein Synthesis: Overview, Transcription & translation, Realistic looking (transcription, splicing, translation)
- rDNA Technology etc: Overview, animated overview, 3D (PCR, rDNA)
Remember, it is just the process that we need to understand. With “BioInformatics-work” in mind, there is no need to overwhelm yourself with the technical details (names of enzymes, functional groups etc. or structure of bases, and so on).