Mikhail Gelfand ::: Biography

COMPUTATIONAL GENOMICS:
FROM THE WET LAB TO COMPUTER AND BACK

It has been widely accepted that the development of sequencing techniques*, oligonucleotide arrays** and mass-spectrometry transformed molecular biology from a hypothesis-driven science into a data-driven one. As is the case with any general statement, it is if not exactly untrue, but still too strong. At the same time, the availability of huge amounts of data leads to the emergence of completely new approaches based on computer analysis. These approaches are commonly known as bioinformatics or computational biology***. The first name relates mainly to technical aspects of computer use in molecular biology: planning and support of experiments, creation of databases, etc., whereas in the second case the emphasis is on obtaining new biological knowledge using computer analysis of genomic sequences, expression data, protein structures, protein-protein interactions etc.
Indeed, the total size of sequenced DNA grows much faster than the number of experimental results. (Fig. 1) It is clear that most genes and even complete genomes will never be studied experimentally. The same avalanche of data exists in many other areas of molecular biology and genetics. Thus the only way to deal with this situation is theoretical analysis of the obtained data. At that, the most interesting and crucial observations can be directly confirmed by experiment. On the other hand, global experiments necessarily involve computational analysis; some examples will be given below. Finally, some areas of computational biology, e.g. theory of molecular evolution, do not imply experiment at all, as the latter would require too much - not even historical, but astronomical - time. However, non-trivial statements can be made here as well.



Fig. 1. The number of papers on molecular biology (based on PubMed data) and sequence fragments in the databank of nucleotide sequences GenBank published in 1982-2001.



Fig. 2. It is one of the canons from Bach's Musical Offering. Bach has written it in this form: the players themselves have to determine the timing and order of individual voices.

So, what biological knowledge can be extracted from symbol sequences representing the genome or proteome, or a set of numbers showing the expression level of genes in various conditions, or from a graph of protein-protein interactions? Let us start with a standard example: a complete genome of some organism (for simplicity's sake, a bacterium) has been sequenced, and it is the only data about this organism, as no experiments with it have even been done. What can we say about the bacterium given its genome?

This on-line version of the book "Biomediale. Contemporary Society and Genomic Culture" is not full. The unabridged edition can be purchased in printed form as anthology. Requests should be sent to: bulatov@ncca.koenig.ru (full information) or in written form: 236000, Russia, Kaliningrad, 18, Marx str., The National Publishing House “Yantarny Skaz”. Phone requests: Kaliningrad +7(0112)216251, Saint-Petersburg +7(812)3885881, Moscow +7(095)2867666. On-line bookshop (in Russian): http://www.yantskaz.ru. Full reference to this book: "Biomediale. Contemporary Society and Genomic Culture". Edited and curated by Dmitry Bulatov. The National Centre for Contemporary art (Kaliningrad branch, Russia), The National Publishing House “Yantarny Skaz”: Kaliningrad, 2004. ISBN 5-7406-0853-7

The comparative method is the basic approach to prediction of protein function. This is the next, and probably, the main step of genome annotation. If a database contains a similar protein with known function, one can make conclusions about the function of the studied protein. At that, the level of detail and reliability of the prediction depends on the degree of similarity between the proteins, the completeness of genomes (functional assignments for genes from complete genomes are more reliable than those for short fragments), and on coverage of proteins by the similarity segments. Analysis of a large number of proteins having the same function in different organisms allows one to define so-called functional motifs, that is, patterns common to these proteins and forming the structural cores or reaction centers. This analysis is based on the natural assumption that evolutionarily conserved positions or structural elements in proteins and DNA are functionally important.
In a slightly more rigorous formulation this assumption looks as follows: Spontaneously arising mutations are subject to selection pressure. Mutations that interfere with protein function are removed, whereas neutral mutations can be fixed and inherited by subsequent generations. Thus in proteins from two species arising from a common ancestor, random changes are acquired. In addition, the neutrality of mutations is not an absolute notion. A mutation can only slightly affect the protein function, or even improve the fitness of a protein (and hence the organism) to changed conditions, or, say, decrease the fitness to some conditions and increase it to others. Besides, one mutation can compensate for the effect of another. Finally, some mutations in one position can be neutral (e.g. changes in amino acid sequences with similar physical-chemical properties), and others can be deleterious. Thus, aligning related proteins, one sees that the degree of conservation of alignment columns varies: some positions are absolutely invariant, and no mutations are tolerated there, some are semi-conserved, and some are selectively neutral, as many diverse amino acids can be observed there.
There exist numerous databases of proteins, protein alignments, and functional motifs. To find related proteins in the databases and identify motifs present in a given protein, one can use specific programs based on fast, but sufficiently reliable algorithms. As the size of the databases and the number of queries increase at a very fast rate, and new types of data arise (e.g. alignment libraries), development of such algorithms remains an actual problem of bioinformatics.
In addition to functional motifs, one can analyze other protein features: transmembrane segments (anchors used by surface proteins, e.g. transporters and outer membrane receptors, to attach to the membrane), signal sequences determining cellular localization of proteins, etc. Indeed, each specific function influences the statistical properties of an amino acid sequence. In particular, transmembrane segments function not in water, but in a membrane lipid environment. Thus they avoid hydrophilic (preferring contacts with water) and prefer hydrophobic (contacting lipids) amino acids. Finally, one can try to predict the three-dimensional structure of a protein, this subject will be considered in some more detail below.
Thus the comparative analysis can establish the cellular roles of up to two thirds of proteins, and partially characterize an additional one fourth of the proteome. For instance, analysis of motifs can lead to assignment of a generic biochemical function. In this case, we approximately know what reaction is catalyzed by a given enzyme, but cannot predict its specificity. The fraction of characterized proteins in eukaryotes is usually lower, but still sufficient to describe the main cellular functions.



Fig. 3. Fragment of alignment of transcription factors from the LacI family. Invariant (*) and conserved positions are marked.

The core of such description is metabolic reconstruction: a catalog, or rather a graph, of all chemical reactions occurring in the cell. Reactions, catalyzed by proteins encoded in the genome, as predicted by similarity analysis, are mapped onto the universal metabolic map (a graph of all reactions ever observed in living systems; this map is electronically stored in a database).
At the next step, the obtained organism's metabolic map is analyzed in order to identify missing steps and contradictions. Indeed, there are some natural criteria of consistency of a metabolic map. The simplest criterion is the absence of dead-ends, that is, reactions whose products are never used or whose substrates are neither transported, nor synthesized. A special and most frequent example of such a situation is a missing link in a linear chain of reactions where the product of every reaction is directly used by the subsequent one.
If such a missing link is discovered, this means that we have not identified the gene encoding the corresponding enzyme. This might mean that our similarity criteria were too strict, or that this reaction is catalyzed by a novel protein having no known homologues. In the first case, it is sufficient to repeat the similarity search, weakening the criteria for accepting a putatively related protein. In the second case, more sophisticated techniques are applied.
These techniques are based on analysis of regulation or gene positioning on the chromosome. In the first case, one identifies candidate regulatory sites in DNA that constitute a common signal for all genes in the considered metabolic pathway. Such signals are usually binding sites (operators) for transcription regulatory factors. The latter are specific proteins that react to changes in the environment or the chemical content of the cell and, dependent on these changes, bind to operators and thus switch on and off gene expression. Such physiological parameters could be low concentration of some necessary compound, heat or cold shock, introduction into the host organism (for pathogens), introduction to the external medium of some substrate that could be used as the source of energy, over-crowding, etc. It is clear that if some compound is found in the external medium, genes responsible for transport and utilization of this compound should be turned on. Similarly, low concentration of some compound in the cytoplasm should lead to switching on the corresponding biosynthetic genes or, as in the previous case, the genes encoding the compound's transporters.
It is also clear that these genes should be switched on and off in concert. This is why one signal normally regulates all genes of the pathway. Thus, the gene encoding the missing enzyme also should have a similar regulatory site. Thus, after identification of the signal and construction of a recognition rule, one then scans the genome and selects genes with upstream sites satisfying the constructed rule: the desired gene is among the selected ones. The problem here is that it is usually impossible to construct a sufficiently specific recognition rule and the list of candidates is too large: tens and even low hundreds of potentially regulated genes, with only a few of them being relevant.



Fig. 4. In accordance with the programme commands, double DNA spiral being referred to as the programme, a cell creates very most complicated chains of protein molecules built from amino acids.



Fig. 5. A good prediction based of a large number of independent observations can be more reliable than some experiments.

However, one can use the following consideration. Spurious sites (false positives) are scattered at random, whereas the true sites occur upstream of orthologous genes (the same gene in many genomes). Indeed, according to the main assumption, the set of regulated genes corresponds to a metabolic pathway, which, normally, is conserved, at least in sufficiently close species. Thus we obtain a consistency condition: those genes are regulated, that have candidate operators in several genomes.
One more technique, also involving the analysis of a large number of distantly related genomes, is based on positional analysis. It has been demonstrated that genes with linked functions, e.g. genes encoding enzymes catalyzing successive steps of a metabolic pathway, are positioned in the same chromosomal loci. There are natural functional and evolutionary reasons for that (in particular, it simplifies co-regulation), however, these reasons are rather weak, and thus co-localization of functionally linked genes is a tendency rather than a universal rule. Besides, a chromosome is a one-dimensional object and thus genes can be adjacent at random. Analyzing one or several closely related genomes (where the gene order does not differ much), one cannot distinguish between spurious and significant co-localization of genes. However, if two genes are adjacent in a considerable number of unrelated genomes, it is a good indication of a possible functional link between these genes.

This on-line version of the book "Biomediale. Contemporary Society and Genomic Culture" is not full. The unabridged edition can be purchased in printed form as anthology. Requests should be sent to: bulatov@ncca.koenig.ru (full information) or in written form: 236000, Russia, Kaliningrad, 18, Marx str., The National Publishing House “Yantarny Skaz”. Phone requests: Kaliningrad +7(0112)216251, Saint-Petersburg +7(812)3885881, Moscow +7(095)2867666. On-line bookshop (in Russian): http://www.yantskaz.ru. Full reference to this book: "Biomediale. Contemporary Society and Genomic Culture". Edited and curated by Dmitry Bulatov. The National Centre for Contemporary art (Kaliningrad branch, Russia), The National Publishing House “Yantarny Skaz”: Kaliningrad, 2004. ISBN 5-7406-0853-7

When this study had been published the prediction, regulation and specificity of the transporter were verified in the experiment. Moreover, detailed analysis of the structure of the regulatory element allowed us to predict a unique regulatory mechanism; however, this is not within the scope of this paper, see Notes for a detailed discussion.



Fig. 6. Phylogenetic tree of arginine, histidine, and glutamine bacterial transporters. Known specificities are indicated in a transporter of Bacillus subtilis whose specificity has been predicted based on the analysis of regulation, and then confirmed in experiment.

In fact, many results obtained with advanced methods of computational genomics are about the specificity of transporters. One of the reasons for that is that transporters are difficult to work with in experiment and because of that they are, in general, less studied than proteins from other functional classes, e.g. enzymes and regulatory proteins. Besides, the specificity of transporters to compounds is very unstable evolutionarily, and thus simple protein similarity analysis does not allow one to make reliable detailed predictions. One more example is the family of transport systems importing the amino acids arginine, histidine and glutamine into a bacterial cell. All proteins from this family are very similar and no clustering by similarity that would coincide with the natural division by specific function, when known, can be done. (Fig. 3) On the other hand, it is possible to construct a recognition rule for signals, regulating the expression of the gene for the arginine biosynthesis pathway. Moreover, there are a variety of such signals in different genomes. After that it turns out that the expression of only a fraction of the proteins in this family is regulated by these arginine-related signal. Thus, only regulatory analysis allows for the reliable prediction of transporter specificities. Again, some of these predictions have been confirmed in independent experiments.
However, in some areas of computational genomics experimental verification is impossible in principle. Probably the main such area is the theory of molecular evolution. Of course, the first studies in this area appeared long before the beginning of mass sequencing of DNA, in the sixties. However, only the emergence of complete genomes allowed one to pose really fundamental problems. But still it is clear that the time scale of evolutionary events is incomparable with the experiment times.
The main method of the theory of molecular evolution is construction of phylogenetic trees, that is, reconstruction of the history of protein families. As noted above, all proteins are subject to mutation. After divergence of the species from the common ancestor, these mutations occur independently, and differences between the proteins of two species are stronger if the last common ancestor of these species lived earlier (in a sense it is similar to the divergence of human languages). However, the mutation rate is not uniform. In particular, it depends on the importance of particular proteins to the organism. Thus, proteins participating in the main information processes, such as replication, transcription and translation, are on average more conserved than enzymes, and the latter are more conserved than, e.g., proteins of the outer membrane. The second complication is that the proteins diverge not only after speciation events, but also following intragenomic duplications. After that, the genome encodes two copies of the protein, and if they exist for a sufficiently long time, their functions start to diverge (in particular, such is the history of the arginine-histidine-glutamine transporters mentioned above). Thus the analysis of individual protein families does not allow one to make conclusions about the evolution of species, and only studies of complete genomes and consideration of data obtained for various families makes it possible to construct the species' phylogenetic trees with some reliability, and thus to reconstruct the history of Life.
We have said above that the characteristic evolutionary time scale is incomparable with the experimental one. This is not universally true, as exemplified by the evolution of viruses. The process of replication in viruses is error-prone, and thus mutations occur frequently. Moreover, the selection towards genome stability is weaker, as, say, variability of the envelope proteins is a feature allowing the viruses to avoid the host immune system. Thus viral evolution can be studied in a laboratory using bacterial viruses, phages. Another possibility is to consider the epidemiology of human viruses, e.g. various influenza strains. During epidemics of influenza in distant isolated communities, e.g. in Southern America, it is possible to trace the development of an epidemic from seaports to the mainland. Even more spectacular was the so-called "case of the dentist," who had infected a number of his patients with AIDS. In this case it was possible to trace the entire history of the infections that coincided with the phylogenetic tree of the virus strains****.



Fig. 8. A successful conclusion of the first part of HGP - completion of the human genome sequence - was announced in June 2000. During the next stage scientists should discover, localize and provide functional description of human genes.



Fig. 9. Genes whose expression level depends on the circadian rhythm. Horizontal axis: time in hours. Each horizontal line corresponds to one gene. The relative expression level is shown by tone (light: low expression, dark: high expression).

In some case, the history of human populations can also be considered in this light. Early studies relied on analysis of frequencies of blood types and similar genetic markers in different ethnic groups. Currently a large-scale international project has been launched which aims at analysis of single nucleotide polymorphisms (SNPs), positions where differences between individual genomes are observed and that minor variants (alleles) are sufficiently frequent (constitute not less than 1% of the population). In other words, SNP is a position where at least 1% of all people have a nucleotide differing from the majority of the population (for comparison note that on average genomes of two unrelated people have 1 difference per 1000 positions, whereas the genomes of the human and the chimpanzee differ at 1 per 100 positions). The practical result of this project is that SNPs are used to map disease genes, as they serve as fixed landmarks in the genome. On the other hand, it is possible to find combinations of SNPs (haplotypes) that are specific for some ethnic groups. Moreover, as there are male-specific (Y-chromosome) and female-specific (mitochondria) fractions of the genome, it is possible to separately reconstruct the male and female human history.
Until now we have considered the genome as something static, a text containing instructions for cell function. Current techniques in molecular biology allow one to study changes in gene expression in response to various stimuli. This is done using so-called oligonucleotide chips. One such chip measures the concentrations of mRNAs transcribed from thousands of genes, e.g. all genes of a bacterium or yeast, or a considerable fraction of human genes.

This on-line version of the book "Biomediale. Contemporary Society and Genomic Culture" is not full. The unabridged edition can be purchased in printed form as anthology. Requests should be sent to: bulatov@ncca.koenig.ru (full information) or in written form: 236000, Russia, Kaliningrad, 18, Marx str., The National Publishing House “Yantarny Skaz”. Phone requests: Kaliningrad +7(0112)216251, Saint-Petersburg +7(812)3885881, Moscow +7(095)2867666. On-line bookshop (in Russian): http://www.yantskaz.ru. Full reference to this book: "Biomediale. Contemporary Society and Genomic Culture". Edited and curated by Dmitry Bulatov. The National Centre for Contemporary art (Kaliningrad branch, Russia), The National Publishing House “Yantarny Skaz”: Kaliningrad, 2004. ISBN 5-7406-0853-7

However, oligonucleotide chips measure concentration of mRNAs, but not proteins. This is not the same thing, as the translation rate of mRNAs can be different, and besides expression of some genes is regulated on the translation step. Thus the concentration of proteins is not proportional to the concentration of mRNAs. Lately methods appeared that allow one to directly measure protein concentrations. These methods are based on mass-spectrometry, that is, measuring the molecular masses of protein fragments. Mass-spectrometry experiments are among the most computer-dependent techniques, as they require automated support at all steps, from the initial collection of data through specialized database similarity searches. The latter problem is to identify a protein given masses of its fragments. Among other proteomics techniques, one should mention extensively developed methods like two-hybrid analysis that allows one to create a map of protein-protein interactions, both stable, in complex protein structural and enzymatic complexes, and transient, in signal transduction pathways.
We complete this short discussion of proteomics by problems related to prediction of the protein three-dimensional structure. The most traditional of these problems is prediction of the standard structural elements given protein sequence. To predict the complete spatial structure given sequence alone seems to be impossible, and the problem is not even formulated this way. However, it has been noted that the majority of known 3D structures can be clustered into several dozen standard architectures. Thus the prediction problem reduces to assigning a protein to one of the structural classes (threading) or predicting that this protein should form a new class. In the latter case it is highly likely that the structure of this protein will be solved by standard X-ray crystallographic methods. Moreover, there are several research programs aiming at completing the list of structural classes. Finally, we should mention the problem of interaction of proteins and small ligands (docking). This problem is very important, as it is used in drug discovery: the search for inhibitors of pathogen proteins or modulators (both inhibitors and activators) of human ones.
Thus we have made a short tour to computational genomics and proteomics. Many areas have only been briefly mentioned, although in all cases we have tried to describe the types of data generated in different experiments and the arising computational problems. More traditional areas, especially those that are somewhat analogous to linguistics, have been given more attention, although we have specifically avoided direct parallels.
And a few words about the epigraph: It is one of the canons from Bach's Musical Offering. Bach has written it in this form: the players themselves have to determine the timing and order of individual voices. The same holds for researchers of the genome, the text of Nature whose sense we are trying to uncover.

* See Glossary. (Editor's note)
** For a brief overview, see: "Biochips and industrial biology" by Irina Grigorjan and Vsevolod Makeev, this issue.
*** Currently there exist several textbooks on computational molecular biology. A general review for developers is given in [Clote, P. and Blackofen, R. Computational Molecular Biology, Wiley, 2000], the algorithmic aspects of bioinformatics are described in [Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997], and a user-oriented introduction can be found in [Baxevanis, A.D. and Ouellette, B.F.F. (eds.), Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition, Wiley, 2001] and [Mount, D.W. Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory, 2001]. A popular introduction into bioinformatics formed "the theme of the issue" of Computerra (2001) 36:413 (in Russian). A less popular introduction and discussion of perspectives is [Mironov, A.A., Gelfand, M.S. "Computational biology between decades," Molecualr Biology 33: 854-868, 1999]. An instructive comparison of experimental and computational errors is [Iyer L.M., Aravind, L., Bork, P., Hofmann, K., Mushegian, A.R., Zhulin, I.B., Koonin, E.V."Quod erat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences," Genome Biology 2/12/research/0051, 2001], see also [Galperin, M.Y., Koonin, E.V., "Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption," In Silico Biology 1: 55-67, 1998]. (Author's note)
**** "The dantist's case," the first application of the theory of molecular evolution in the courtroom, was considered in [C.-Y.Ou et al., "Molecular epidemiology of HIV transmission in a dental practice," Science 256: 1165-1171, 1992]. A contemporary problem of identification of the anthrax strains is treated in [Cummings, C.A. and Relman, D.A., "Molecular forensics - cross-examining pathogens," Science 296: 1976-1979, 2002]. (Author's note)





HOME    РУССКИЙ

How to purchase this book

COLOPHON

CONTENTS:

I. LABORATORY: science and technology

Svetlana Borinskaya. Genomics and Biotechnology: Science at the Beginning of the Third Millennium.

Mikhail Gelfand. Computational Genomics: from the Wet Lab to Computer and Back.

Irina Grigorjan, Vsevolod Makeev. Biochips and Industrial Biology.

Valery Shumakov, Alexander Tonevitsky. Xenotransplantation as a Scientific and Ethic Problem.

Abraham Iojrish. Legal Aspects of Gene Engineering.

Pavel Tishchenko. Genomics: New Science in the New Cultural Situation.
II. FORUM: society and genomic culture

Eugene Thacker. Darwin's Waiting Room.

Critical Art Ensemble. The Promissory Rhetoric of Biotechnology in the Public Sphere.

SubRosa. Sex and Gender in the Biotech Century.

Ricardo Dominguez. Nano-Fest Destiny 3.0: Fragments from the Post-Biotech Era.

Birgit Richard. Clones and Doppelgangers. Multiplications and Reproductions of the Self in Film.

Sven Druehl. Chimaera Phylogeny: From Antiquity to the Present.
III. TOPOLOGY: from biopolitics to bioaesthetics

Boris Groys. Art in the Age of Biopolitics.

Stephen Wilson. Art and Science as Cultural Acts.

Melentie Pandilovski. On the Phenomenology of Consciousness, Technology, and Genetic Culture.

Roy Ascott. Interactive Art: Doorway to the Post-Biological Culture.
IV. INTERACTION CODE: artificial life

Mark Bedau. Artificial Life Illuminates Human Hyper-creativity.

Louis Bec. Artificial Life under Tension.

Alan Dorin. Virtual Animals in Virtual Environments.

Christa Sommerer, Laurent Mignonneau. The Application of Artificial Life to Interactive Computer Installations.
V. MODERN THEATRE: ars genetica

George Gessert. A History of Art Involving DNA.

Kathleen Rogers. The Imagination of Matter.

Brandon Ballengee. The Origins of Artificial Selection.

Marta de Menezes. The Laboratory as an Art Studio.

Adam Zaretsky. Workhorse Zoo Art and Bioethics Quiz.
VI. IMAGE TECHNOLOGY: ars chimaera

Joe Davis. Monsters, Maps, Signals and Codes.

David Kremers. The Delbruck Paradox. Version 3.0.

Eduardo Kac. GFP Bunny.

Dmitry Bulatov. Ars Chimaera.

Valery Podoroga. Rene Descartes and Ars Chimaera.
VII. METABOLA: tissue culture and art

Ionat Zurr. Complicating Notions of Life - Semi-Living Entities.

Oron Catts. Fragments of Designed Life - the Wet Palette of Tissue Engineering.
VIII. P.S.

Dmitry Prigov. Speaking of Unutterable.

Wet art gallery

Biographies

Bibliography

Webliography

Glossary


© ncca kb. yevgeny palamarchuck | Jaybe.ru