Dartmouth
Computer Science Department
- Dartmouth Chemistry Department
- Dartmouth Biological Sciences
Department
- Dartmouth
Center for Structural Biology and Computational Chemistry
Algorithmic Challenges in Structural Molecular Biology and
Proteomics
This paper reviews our research in computational biology and chemistry. Some of the most challenging and influential opportunities for Physical Geometric Algorithms (PGA) arise in developing and applying information technology to understand the molecular machinery of the cell. Our recent work (e.g., [1-20]) shows that many PGA techniques may be fruitfully applied to the challenges of computational molecular biology. PGA research may lead to computer systems and algorithms that are useful in structural molecular biology, proteomics, and rational drug design.
Concomitantly, a wealth of interesting computational problems arise in proposed methods for discovering new pharmaceuticals. I'll briefly discuss some recent results from my lab, including new algorithms for interpreting X-ray crystallography [14,17,16] and NMR (nuclear magnetic resonance) data [3,9,6,19,10,5,7,18,4], disease classification using mass spectrometry of human serum [12], and protein redesign [13]. Our algorithms have recently been used, respectively, to reveal the enzymatic architecture of organisms high on the CDC bioterrorism watch-list [17,16], for probabilistic cancer classification from human peripheral blood [12], and to redesign an antibiotic-producing enzyme to bind a novel substrate [13]. I'll overview these projects, and highlight some of the algorithmic and computational challenges.
In the post-genomic era, key problems in molecular biology center on the determination and exploitation of three-dimensional protein structure and function. For example, modern drug design techniques use protein structure to understand how a drug can bind to an enzyme and inhibit its function. Structural proteomics will require high-throughput experimental techniques, coupled with sophisticated computer algorithms for data analysis and experiment planning.
My laboratory develops novel computational methods to enable high-throughput structural and functional studies of proteins. A key focus is structural genomics, whose goal is (in the broadest terms) to determine the three-dimensional structures of all proteins in nature, through a combination of direct experiments and theoretical analysis. Proteins are the worker molecules in every living thing. By determining the structures of proteins, we are better able to understand how each protein functions normally and how faulty protein structures can cause disease. Scientists can use the structures of disease-related proteins to help develop new medicines and diagnostic techniques.
PGA research in computational structural biology and proteomics can assist in our long-range goal of understanding biopolymer interactions in systems of significant biochemical as well as pharmacological interest. At the molecular level, many genes provide the blueprint for proteins, and it remains very expensive and time-consuming to determine what these proteins do, and how they do it. This paper reviews some novel algorithms to build three-dimensional models of proteins to better understand protein mechanism and function.
Our work spans a number of projects in computational structural biology, including algorithms for automated assignment and structure determination in NMR structural biology [3,9,6,19,10,5,7,18,4], protein redesign [13], computer-aided drug design [1,15,13], computational molecular replacement in X-ray crystallography [14,17,16], and other related projects [2,11,8]. To pursue research in the field of structural genomics is to study the geometric structures of proteins. Structural genomics is a field born of the the marriage between computer science and biology. The goal is to develop new technology - specifically, computer algorithms - to enable the determination of 10,000 new protein structures in 10 years. This would have an enormous impact on our understanding of disease mechanisms, and our ability to target drugs to specific protein targets. If successful, the impact would be comparable to that of the human genome project.
Modern automated techniques are revolutionizing many aspects of
biology, for example, supporting extremely fast gene sequencing and
massively-parallel gene expression testing. Protein structure
determination, however, remains a long, hard, and expensive
task. High-throughput, automated, algorithmic methods are required in
order to apply modern techniques such as computer-aided drug design on
a much larger scale. For example, to analyze non-crystallographic
symmetry in X-ray diffraction data of biopolymers, one must
``recognize" a finite subgroup of (the Lie group of 3D
rotations) out of a large set of molecular orientations. The problem
may be reduced to clustering in
modulo a finite group, and
solved efficiently by ``factoring" into a clustering on the unit
circle followed by clustering on the 2-sphere
, plus some
group-theoretic calculations [14]. This yields a
polynomial-time algorithm that is efficient in practice, and which
enabled us to collaborate with biological crystallographers to reveal
the architecture of a parasite's enzyme, which could help researchers
reduce the threat of certain diseases among those with weak immune
systems [17,16]. As another example, we
recently employed geometric techniques from statistical estimation and
machine learning to develop an algorithm for cancer proteomics, in
which we use data from a mass spectrometer to distinguish between
healthy and diseased blood in humans [12]. Geometry
pervades our work: [12] can be viewed as an investigation
of the geometry of the oncoproteome (the space of cancer proteins) as
projected onto the mass-to-charge ratios of their proteolytic digest.
Nuclear vector replacement for automated NMR resonance
assignments [6,10] is essentially a
matching problem on a quotient space of orientations, induced by a
quadratic form
on
.
is parameterized by
, and
3D structural homology detection from unassigned NMR data
(enabling rapid fold determination) can be performed by combinatorial
optimization, searching over
to minimize a functional that
compares distributions generated by
's image of the bond vectors
from putative database protein models [9,10,5,7].
In each of our research projects, computational techniques are central, and the applications present intriguing problems to computer scientists who design algorithms and implement systems. For example, the techniques we introduced for automated NMR resonance assignment [6,10,3,9,7] and protein structure determination [19,18,5,7] are an instance a general approach to combinatorial problem solving in which constraints on the solution are enforced in an order determined by the strength of the evidence for them. This approach, which has analogies to the Celera whole-genome shotgun sequencing algorithm, also presents a flock of fascinating questions from the point of view of theoretical computer science (cf. Richard Karp's Keynote address, Computational Systems Bioinformatics Conference, 2003).
In this section (2), I attempt to illustrate the general
themes introduced above, by way of specific examples and results. The
NMR biophysics requires some technical description; some readers may
prefer to skip to the higher-level discussion on Protein Fold
Determination (p. ) or even to
Section 3 (p.
).
In contrast to (e.g.) simulated annealing approaches, our algorithm
is combinatorially-precise [18], and is built upon
the exact solutions for computing backbone
angles from RDC data and systematic
search. [19] is the first NMR structure determination
algorithm that simultaneously uses exact solutions, systematic search
and only 2 RDCs per residue. (A systematic search is a search over all
possible conformations (solutions) that employs a provable pruning
strategy that guarantees pruned conformations need not be considered
further). Our first contribution is the derivation of low-degree
polynomial equations for computing, exactly and
in constant time, dihedral
angles from RDCs measured on
a single internuclear vector
in two different aligning
media. The easily computable exact solutions eliminate the need for
one-dimensional grid-search previously employed to compute the
directions of
or two-dimensional grid-search to compute
angles. Furthermore, these equations are very general and can
easily be extended to compute both the backbone and sidechain dihedral
angles from RDC data measured on any single vector in two aligning
media. And, our method can be applied mutatis mutandis to
derive similar equations for computing dihedral angles from RDCs in
nucleic acids. Compared with other algorithms for computing backbone
structures using RDCs, our algorithm achieves similar accuracies but
requires less data, relies less on statistics from the PDB and does
not depend on molecular dynamics. Since RDCs can be acquired and
assigned much more quickly than NOEs in general, our results show it
is possible to compute structures very rapidly and inexpensively using
mainly RDC restraints.
Most of the work described above concentrates on algorithms and system
for analyzing biological data and biological problems. However,
the techniques we develop can also be applied to synthetic
problems such as a protein engineering. For example, in
collaboration with Prof. Amy Anderson, we recently developed a
novel ensemble-based scoring and search algorithm for protein
redesign, and applied it to modify the substrate specificity of an
antibiotic-producing enzyme in the non-ribosomal peptide
synthetase (NRPS) pathway [13]. Realization of novel
molecular function requires the ability to alter molecular complex
formation. Enzymatic function can be altered by changing
enzyme-substrate interactions via modification of an enzyme's active
site. A redesigned enzyme may either perform a novel reaction on its
native substrates or its native reaction on novel substrates. A number
of computational approaches have been developed to address the
combinatorial nature of the protein redesign problem. These approaches
typically search for the global minimum energy conformation among an
exponential number of protein conformations. We developed a novel
algorithm for protein redesign, which combines a statistical
mechanics-derived ensemble-based approach to computing the binding
constant with the speed and completeness of a branch-and-bound pruning
algorithm. In addition, we developed an efficient deterministic
approximation algorithm, capable of approximating our scoring function
to arbitrary precision. Our algorithm is the first provable
-approximation algorithm for estimating partition functions
for protein flexibility.
In practice, our approximation algorithm decreases the execution time of the mutation search by a factor of ten. To test our method, we examined the Phe-specific adenylation domain of the NRPS gramicidin synthetase A (GrsA-PheA). We used ensemble scoring, via a rotameric approximation to the partition functions of the bound and unbound states for GrsA-PheA, to predict binding of the wildtype protein and a previously described mutant (selective for leucine), and second, to switch the enzyme specificity toward leucine, using two novel active site sequences computationally predicted by searching through the space of possible active site mutations. The top scoring in silico mutants were created in the wetlab and dissociation/binding constants were determined by fluorescence quenching. These tested mutations exhibit the desired change in specificity from Phe to Leu. Our ensemble-based algorithm which flexibly models both protein and ligand using rotamer-based partition functions, has application in enzyme redesign, the prediction of protein-ligand binding, and computer-aided drug design.
This result represents a computational approach to reprogramming enzyme specificity, with the ultimate goal of combinatorial biosynthesis for small-molecule diversity. We are studying a family of enzymes responsible for the biosynthesis of hundreds of pharmaceutically-active peptide-like products. Understanding these enzyme functions will elucidate how natural biological products (e.g., antibiotics) are synthesized in vivo. To modify enzyme function, we are developing computational techniques to plan structurally-based site-directed mutations. By reëngineering the active site(s) to operate on different substrates, we hope to modify these different enzymatic steps with an eye to potential reprogramming of those steps for combinatorial biosynthesis. This opens the door to the possibility of using our redesigned enzymes for in vivo combinatorial chemistry, to create candidate drug leads for new antibiotics and other drugs. We are applying our algorithms to NRPS modules, whose products include natural antibiotics, antifungals, antivirals, immunosuppressants, and siderophores. NRPS have multiple domains with individual functions acting in an assembly-line fashion. We are modifying the active sites to switch the specificity of the amino acid-accepting domains from their natural substrates, to different amino acids. The modifications are planned and analyzed in silico, by developing new algorithms based on techniques from geometric algorithms, robotics, machine vision, and scientific computation. Our ``enzyme reprogramming" could allow the modified NRPS to synthesize different modified peptides. Exploration of the combinatorial space of new NRPS ``programs" will generate a large number of new compounds, which could then be screened for pharmaceutical activity.
There is much to be done. A primary focus will be to explore novel computational methods in structural biology, specifically, new algorithms for NMR resonance assignments and protein structure determination, with applications to structural genomics. I am particularly interested in new algorithms for structural biology using only a minimal number of inexpensive, fast experiments. I'm also interested in collaborating with structural biologists to develop novel algorithms and apply them to biological systems of significant biochemical and pharmacological importance. A model for this kind of work is our collaboration on the structure of dihydrofolate reductase-thymidylate synthase (DHFR-TS) from Cryptosporidium hominis [14,17,16]. Cryptosporidium is an organism high on the bioterrorism list of the Center for Disease Control (CDC), a Category B bioterrorist threat. Agents/diseases that fall under this category are given the second-highest priority, because they are moderately easy to disseminate; result in moderate morbidity rates and low mortality rates; and require specific enhancements of CDC's diagnostic and treatment capacities. There is currently no drug therapy for cryptosporidosis. The enzyme DHFR-TS is in the sole de novo biosynthetic pathway for the pyrimidine deoxyribonucleotide dTMP, and therefore an attractive drug target. Solving the structure of DHFR-TS from C. hominis opens the door to species-specific drug design, exploiting both structural and biophysical differences between the human enzyme and Cryptosporidium DHFR-TS.
We collaborated with Dr. Amy Anderson's lab to determine the structure of DHFR-TS from C. hominis, revealing a unique linker domain containing an 11-residue alpha helix that has extensive interactions with the opposite DHFR-TS monomer of the homodimeric enzyme [17,16]. Analysis of the structure of DHFR-TS from C. hominis and of previously solved structures of DHFR-TS from Plasmodium falciparum (a.k.a. malaria) and Leishmania major reveals that the linker domain primarily controls the relative orientation of the DHFR and TS domains. Using the tertiary structure of the linker domains, we have been able to place a number of protozoa in two distinct and dissimilar structural families corresponding to two evolutionary families and provide the first structural evidence validating the use of DHFR-TS as a tool of phylogenetic classification.
I am also interested more broadly in proteomics and functional genomics, both in developing novel algorithms for proteomic problems and bringing to bear structural, geometric, and biophysical insights (and algorithms) for proteomics. For example, in collaboration with the Norris-Cotton Cancer Center at Dartmouth, we are exploring oncoproteomic target selection using mass spectrometry [12]. We developed an algorithm called Q5 for probabilistic classification of healthy vs. disease whole serum samples using mass spectrometry. Q5 is the first closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. It employs a discriminant back-projection algorithm to compute clues as to the molecular identities of differentially-expressed proteins and peptides. Q5 analyzes whole spectrum Surface-Enhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) Mass Spectrometry (MS) data, and was demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. We achieved sensitivity, specificity, and positive predictive values above 97% on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques, and represents the first attempt to compute the molecular identities of the differentially-expressed proteins in two important MS data sets for ovarian and prostate cancer. Further investigation of our lead proteins and peptide fragments may enhance our understanding of the molecular basis of oncogenisis and could potentially lead to new therapeutic targets.
As discussed above (Sec. 3), we are designing and implementing planning algorithms to propose site-directed mutations for protein redesign. We are developing a general planner that can reprogram the specificity of many NRPS domains, from many biological systems. The results of our algorithms are being compared to known crystal structures, to biochemical activity assays, and to crystal structures of the modified domains bound to the proposed substrates. We are developing a predictive model for when and how well our planner will work, by characterizing the complexity, correctness, and completeness of our algorithms. We believe these algorithms may be generally useful to the structural biology community, for studies of protein-ligand binding and protein redesign. Our provable approximation algorithm represents a new technique for computer-assisted drug design, and a novel approach for docking flexible ligands to flexible active sites [13]. In the future, we will extend and apply our algorithms to the redesign of other enzymes, including polyketide synthase (PKS) systems, which synthesize polyketide products such as erythromycin, rapamycin and tetracycline. I believe there are broad potential applications of our techniques for modeling protein flexibility, redesigning enzymes, and evaluating the biophysical processes of binding and catalysis in protein biochemistry, and that these goals can (hopefully) be realized after a lot more hard work in developing geometric algorithms, provably-good approximation algorithms, statistical methods, and an array of algorithmic techniques for handling noise and uncertainty in combinatorial geometry and computational biophysics.
This work is supported by grants from the National Institutes of Health (R01 GM-65982 and R01 GM-67542), and the National Science Foundation (IIS-9906790, EIA-0102710, EIA-0102712, EIA-9818299, EIA-9802068, and EIA-0305444).
Bruce Randall Donald 2004-06-22