|
Publication Abstracts
- Chandonia JM. 2007. StrBioLib: a Java library for development of custom computational structural biology applications. Bioinformatics [Preprint PDF]
SUMMARY: StrBioLib is a library of Java classes useful for developing
software for computational structural biology research. StrBioLib contains
classes to represent and manipulate protein structures, biopolymer
sequences, sets of biopolymer sequences, and alignments between
biopolymers based on either sequence or structure. Interfaces are provided
to interact with commonly used bioinformatics applications, including
(PSI)-BLAST, MODELLER, MUSCLE, and Primer3, and tools are provided to read
and write many file formats used to represent bioinformatic data. The
library includes a general-purpose neural network object with multiple
training algorithms, the Hooke and Jeeves nonlinear optimization
algorithm, and tools for efficient C-style string parsing and formatting.
StrBioLib is the basis for the Pred2ary secondary structure prediction
program, is used to build the ASTRAL compendium for sequence and structure
analysis, and has been extensively tested through use in many smaller
projects. Examples and documentation are available at the site below.
AVAILABILITY: StrBioLib may be obtained under the terms of the GNU LGPL
license from http://strbio.sourceforge.net/
Click here to go back to the publication index
- Lowery TJ, Pelton JG, Chandonia JM, Kim R, Yokota H, Wemmer DE. 2007. NMR structure of the N-terminal domain of the replication initiator protein DnaA. J Struct Funct Genomics [PDF]
DnaA is an essential component in the initiation of bacterial chromosomal
replication. DnaA binds to a series of 9 base pair repeats leading to
oligomerization, recruitment of the DnaBC helicase, and the assembly of
the replication fork machinery. The structure of the N-terminal domain
(residues 1-100) of DnaA from Mycoplasma genitalium was determined by NMR
spectroscopy. The backbone r.m.s.d. for the first 86 residues was 0.6 +/-
0.2 A based on 742 NOE, 50 hydrogen bond, 46 backbone angle, and 88
residual dipolar coupling restraints. Ultracentrifugation studies revealed
that the domain is monomeric in solution. Features on the protein surface
include a hydrophobic cleft flanked by several negative residues on one
side, and positive residues on the other. A negatively charged ridge is
present on the opposite face of the protein. These surfaces may be
important sites of interaction with other proteins involved in the
replication process. Together, the structure and NMR assignments should
facilitate the design of new experiments to probe the protein-protein
interactions essential for the initiation of DNA replication.
Click here to go back to the publication index
- Shin DH, Hou J, Chandonia JM, Das D, Choi IG, Kim R, Kim SH. 2007. Structure-based inference of molecular functions of proteins of unknown function from Berkeley Structural Genomics Center. J Struct Funct Genomics [PDF]
Advances in sequence genomics have resulted in an accumulation of a huge
number of protein sequences derived from genome sequences. However, the
functions of a large portion of them cannot be inferred based on the
current methods of sequence homology detection to proteins of known
functions. Three-dimensional structure can have an important impact in
providing inference of molecular function (physical and chemical function)
of a protein of unknown function. Structural genomics centers worldwide
have been determining many 3-D structures of the proteins of unknown
functions, and possible molecular functions of them have been inferred
based on their structures. Combined with bioinformatics and enzymatic
assay tools, the successful acceleration of the process of protein
structure determination through high throughput pipelines enables the
rapid functional annotation of a large fraction of hypothetical proteins.
We present a brief summary of the process we used at the Berkeley
Structural Genomics Center to infer molecular functions of proteins of
unknown function.
Click here to go back to the publication index
- Yooseph S, ... (13 authors) ..., Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, ... (6 authors) ..., Brenner SE, ... (6 authors) ..., Venter JC. 2007. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biol 5:e16. [PDF]
Metagenomics projects based on shotgun sequencing of populations of
micro-organisms yield insight into protein families. We used sequence
similarity clustering to explore proteins with a comprehensive dataset
consisting of sequences from available databases together with 6.12
million proteins predicted from an assembly of 7.7 million Global Ocean
Sampling (GOS) sequences. The GOS dataset covers nearly all known
prokaryotic protein families. A total of 3,995 medium- and large-sized
clusters consisting of only GOS sequences are identified, out of which
1,700 have no detectable homology to known families. The GOS-only clusters
contain a higher than expected proportion of sequences of viral origin,
thus reflecting a poor sampling of viral diversity until now. Protein
domain distributions in the GOS dataset and current protein databases show
distinct biases. Several protein domains that were previously categorized
as kingdom specific are shown to have GOS examples in other kingdoms.
About 6,000 sequences (ORFans) from the literature that heretofore lacked
similarity to known proteins have matches in the GOS data. The GOS dataset
is also used to improve remote homology detection. Overall, besides nearly
doubling the number of current proteins, the predicted GOS proteins also
add a great deal of diversity to known protein families and shed light on
their evolution. These observations are illustrated using several protein
families, including phosphatases, proteases, ultraviolet-irradiation DNA
damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity
added by GOS data has implications for choosing targets for experimental
structure characterization as part of structural genomics efforts. Our
analysis indicates that new families are being discovered at a rate that
is linear or almost linear with the addition of new sequences, implying
that we are still far from discovering all protein families in nature.
Click here to go back to the publication index
- Chandonia JM, Brenner SE. 2006. The impact of structural genomics: expectations and outcomes. Science 311:347-51. [PDF]|[Supplementary Info]|[Table of Contents Page]
Structural genomics (SG) projects aim to expand our structural knowledge
of biological macromolecules while lowering the average costs of structure
determination. We quantitatively analyzed the novelty, cost, and impact of
structures solved by SG centers, and we contrast these results with
traditional structural biology. The first structure identified in a
protein family enables inference of the fold and of ancient relationships
to other proteins; in the year ending 31 January 2005, about half of such
structures were solved at a SG center rather than in a traditional
laboratory. Furthermore, the cost of solving a structure at the most
efficient SG center in the United States has dropped to one-quarter of the
estimated cost of solving a structure by traditional methods. However, the
efficiency of the top structural biology laboratories-even though they
work on very challenging structures-is comparable to that of SG centers;
moreover, traditional structural biology papers are cited significantly
more often, suggesting greater current impact.
Click here to go back to the publication index
- Chandonia JM, Kim SH, Brenner SE. 2006. Target selection and deselection at the Berkeley Structural Genomics Center. Proteins 62:356-370. [PDF]|[Supplementary Info]
At the Berkeley Structural Genomics Center (BSGC), our goal is to obtain a
near-complete structural complement of proteins in the minimal organisms
Mycoplasma genitalium and M. pneumoniae, two closely related pathogens.
Current targets for structure determination have been selected in six
major stages, starting with those predicted to be most tractable to high
throughput study and likely to yield new structural information. We report
on the process used to select these proteins, as well as our target
deselection procedure. Target deselection reduces experimental effort by
eliminating targets similar to those recently solved by the structural
biology community or other centers. We measure the impact of the 69
structures solved at the BSGC as of July 2004 on structure prediction
coverage of the M. pneumoniae and M. genitalium proteomes. The number of
Mycoplasma proteins for which the fold could first be reliably assigned
based on structures solved at the BSGC (24 M. pneumoniae and 21 M.
genitalium) is approximately 25% of the total resulting from work at all
structural genomics centers and the worldwide structural biology community
(94 M. pneumoniae and 86 M. genitalium) during the same period. As the
number of structures contributed by the BSGC during that period is less
than 1% of the total worldwide output, the benefits of a focused target
selection strategy are apparent. If the structures of all current targets
were solved, the percentage of M. pneumoniae proteins for which folds
could be reliably assigned would increase from approximately 57% (391 of
687) at present to around 80% (550 of 687), and the percentage of the
proteome that could be accurately modeled would increase from around 37%
(254 of 687) to about 64% (438 of 687). In M. genitalium, the percentage
of the proteome that could be structurally annotated based on structures
of our remaining targets would rise from 72% (348 of 486) to around 76%
(371 of 486), with the percentage of accurately modeled proteins would
rise from 50% (243 of 486) to 58% (283 of 486). Sequences and data on
experimental progress on our targets are available in the public databases
TargetDB and PEPCdb. Proteins 2006. (c) 2005 Wiley-Liss, Inc.
Click here to go back to the publication index
- Smith A, Chandonia JM, Brenner SE. 2006. ANDY: a general, fault-tolerant tool for database searching on computer clusters. Bioinformatics 22:618-620. [PDF]|[Supplementary Info]
SUMMARY: ANDY (seArch coordination aND analYsis) is a set of Perl programs
and modules for distributing large biological database searches, and in
general any sequence of commands, across the nodes of a Linux computer
cluster. ANDY is compatible with several commonly used Distributed
Resource Management (DRM) systems, and it can be easily extended to new
DRMs. A distinctive feature of ANDY is the choice of either dedicated or
fair-use operation: ANDY is almost as efficient as single-purpose tools
that require a dedicated cluster, but it runs on a general-purpose cluster
along with any other jobs scheduled by a DRM. Other features include
communication through named pipes for performance, flexible customizable
routines for error-checking and summarizing results, and multiple
fault-tolerance mechanisms. AVAILABILITY: ANDY is freely available and may
be obtained from http://compbio.berkeley.edu/proj/andy; this site also
contains supplemental data and figures and a more detailed overview of the
software.
Click here to go back to the publication index
- Chandonia JM, Brenner SE. 2005. Update on the Pfam5000 Strategy for Selection of Structural Genomics Targets. Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China [PDF]
Structural Genomics is an international effort to
determine the three-dimensional shapes of all important
biological macromolecules, with a primary focus on proteins.
Target proteins should be selected according to a strategy that is
medically and biologically relevant, of good financial value, and
tractable. In 2003, we presented the "Pfam5000" strategy, which
involves selecting the 5,000 most important families from the Pfam
database as sources for targets. In this update, we show that although
both the Pfam database and the number of sequenced genomes have
increased in size, the expected benefits of the Pfam5000 strategy have
not changed substantially. Solving the structures of proteins from
the 5,000 largest Pfam families would allow accurate fold assignment
for approximately 65% of all prokaryotic proteins (covering 54% of
residues) and 63% of eukaryotic proteins (42% of residues). Fewer
than 2,300 of the largest families on this list remain to be solved,
making the project feasible in the next five years given the expected
throughput to be achieved in the production phase of the Protein
Structure Initiative.
Click here to go back to the publication index
- Chandonia JM, Brenner SE. 2005. Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins 58:166-79. [PDF]|[Supplementary Info]
Structural genomics is an international effort to determine the
three-dimensional shapes of all important biological macromolecules, with
a primary focus on proteins. Target proteins should be selected according
to a strategy that is medically and biologically relevant, of good value,
and tractable. As an option to consider, we present the "Pfam5000"
strategy, which involves selecting the 5000 most important families from
the Pfam database as sources for targets. We compare the Pfam5000 strategy
to several other proposed strategies that would require similar numbers of
targets. These strategies include complete solution of several small to
moderately sized bacterial proteomes, partial coverage of the human
proteome, and random selection of approximately 5000 targets from
sequenced genomes. We measure the impact that successful implementation of
these strategies would have upon structural interpretation of the proteins
in Swiss-Prot, TrEMBL, and 131 complete proteomes (including 10 of
eukaryotes) from the Proteome Analysis database at the European
Bioinformatics Institute (EBI). Solving the structures of proteins from
the 5000 largest Pfam families would allow accurate fold assignment for
approximately 68% of all prokaryotic proteins (covering 59% of residues)
and 61% of eukaryotic proteins (40% of residues). More fine-grained
coverage that would allow accurate modeling of these proteins would
require an order of magnitude more targets. The Pfam5000 strategy may be
modified in several ways, for example, to focus on larger families,
bacterial sequences, or eukaryotic sequences; as long as secondary
consideration is given to large families within Pfam, coverage results
vary only slightly. In contrast, focusing structural genomics on a single
tractable genome would have only a limited impact in structural knowledge
of other proteomes: A significant fraction (about 30-40% of the proteins
and 40-60% of the residues) of each proteome is classified in small
families, which may have little overlap with other species of interest.
Random selection of targets from one or more genomes is similar to the
Pfam5000 strategy in that proteins from larger families are more likely to
be chosen, but substantial effort would be spent on small families.
Click here to go back to the publication index
- Zhang Y, Chandonia JM, Ding C, Holbrook SR. 2005. Comparative mapping of sequence-based and structure-based protein domains. BMC Bioinformatics 6: 77. [PDF]
BACKGROUND: Protein domains have long been an ill-defined concept in biology. They are generally described as autonomous folding units with evolutionary and functional independence. Both structure-based and sequence-based domain definitions have been widely used. But whether these types of models alone can capture all essential features of domains is still an open question. METHODS: Here we provide insight on domain definitions through comparative mapping of two domain classification databases, one sequence-based (Pfam) and the other structure-based (SCOP). A mapping score is defined to indicate the significance of the mapping, and the properties of the mapping matrices are studied. RESULTS: The mapping results show a general agreement between the two databases, as well as many interesting areas of disagreement. In the cases of disagreement, the functional and evolutionary characteristics of the domains are examined to determine which domain definition is biologically more informative.
Click here to go back to the index
- Chandonia JM, Brenner SE. 2005. Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins 58:166-79. [PDF]|[Supplementary data]
Structural genomics is an international effort to determine the three-dimensional shapes of all important biological macromolecules, with a primary focus on proteins. Target proteins should be selected according to a strategy that is medically and biologically relevant, of good value, and tractable. As an option to consider, we present the "Pfam5000" strategy, which involves selecting the 5000 most important families from the Pfam database as sources for targets. We compare the Pfam5000 strategy to several other proposed strategies that would require similar numbers of targets. These strategies include complete solution of several small to moderately sized bacterial proteomes, partial coverage of the human proteome, and random selection of approximately 5000 targets from sequenced genomes. We measure the impact that successful implementation of these strategies would have upon structural interpretation of the proteins in Swiss-Prot, TrEMBL, and 131 complete proteomes (including 10 of eukaryotes) from the Proteome Analysis database at the European Bioinformatics Institute (EBI). Solving the structures of proteins from the 5000 largest Pfam families would allow accurate fold assignment for approximately 68% of all prokaryotic proteins (covering 59% of residues) and 61% of eukaryotic proteins (40% of residues). More fine-grained coverage that would allow accurate modeling of these proteins would require an order of magnitude more targets. The Pfam5000 strategy may be modified in several ways, for example, to focus on larger families, bacterial sequences, or eukaryotic sequences; as long as secondary consideration is given to large families within Pfam, coverage results vary only slightly. In contrast, focusing structural genomics on a single tractable genome would have only a limited impact in structural knowledge of other proteomes: A significant fraction (about 30-40% of the proteins and 40-60% of the residues) of each proteome is classified in small families, which may have little overlap with other species of interest. Random selection of targets from one or more genomes is similar to the Pfam5000 strategy in that proteins from larger families are more likely to be chosen, but substantial effort would be spent on small families.
Click here to go back to the index
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. 2004. The ASTRAL Compendium in 2004. Nucleic Acids Res 32 Database issue:D189-92. [PDF]
The ASTRAL Compendium provides several databases and tools to aid in the
analysis of protein structures, particularly through the use of their
sequences. Partially derived from the SCOP database of protein structure
domains, it includes sequences for each domain and other resources useful
for studying these sequences and domain structures. The current release of
ASTRAL contains 54,745 domains, more than three times as many as the
initial release 4 years ago. ASTRAL has undergone major transformations in
the past 2 years. In addition to several complete updates each year,
ASTRAL is now updated on a weekly basis with preliminary classifications
of domains from newly released PDB structures. These classifications are
available as a stand-alone database, as well as integrated into other
ASTRAL databases such as representative subsets. To enhance the utility of
ASTRAL to structural biologists, all SCOP domains are now made available
as PDB-style coordinate files as well as sequences. In addition to
sequences and representative subsets based on SCOP domains, sequences and
subsets based on PDB chains are newly included in ASTRAL. Several search
tools have been added to ASTRAL to facilitate retrieval of data by
individual users and automated methods. ASTRAL may be accessed at
http://astral.stanford.edu/.
Click here to go back to the index
- Crooks GE, Hon G, Chandonia JM, Brenner SE. 2004. WebLogo: a sequence logo generator. Genome Res 14:1188-90. [PDF]
WebLogo generates sequence logos, graphical representations of the
patterns within a multiple sequence alignment. Sequence logos provide a
richer and more precise description of sequence similarity than consensus
sequences and can rapidly reveal significant features of the alignment
otherwise difficult to perceive. Each logo consists of stacks of letters,
one stack for each position in the sequence. The overall height of each
stack indicates the sequence conservation at that position (measured in
bits), whereas the height of symbols within the stack reflects the
relative frequency of the corresponding amino or nucleic acid at that
position. WebLogo has been enhanced recently with additional features and
options, to provide a convenient and highly configurable sequence logo
generator. A command line interface and the complete, open WebLogo source
code are available for local installation and customization.
Click here to go back to the index
- Chandonia JM, Cohen FE. 2003. New local potential useful for genome annotation and 3D modeling. J Mol Biol 332:835-50. [PDF]
A new potential energy function representing the conformational
preferences of sequentially local regions of a protein backbone is
presented. This potential is derived from secondary structure
probabilities such as those produced by neural network-based prediction
methods. The potential is applied to the problem of remote homolog
identification, in combination with a distance-dependent inter-residue
potential and position-based scoring matrices. This fold recognition jury
is implemented in a Java application called JThread. These methods are
benchmarked on several test sets, including one released entirely after
development and parameterization of JThread. In benchmark tests to
identify known folds structurally similar to (but not identical with) the
native structure of a sequence, JThread performs significantly better than
PSI-BLAST, with 10% more structures identified correctly as the most
likely structural match in a fold library, and 20% more structures
correctly narrowed down to a set of five possible candidates. JThread also
improves the average sequence alignment accuracy significantly, from 53%
to 62% of residues aligned correctly. Reliable fold assignments and
alignments are identified, making the method useful for genome annotation.
JThread is applied to predicted open reading frames (ORFs) from the
genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying
20 new structural annotations in the former and 801 in the latter.
Click here to go back to the index
- Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. 2002. ASTRAL compendium enhancements. Nucleic Acids Res 30:260-3. [PDF]
The ASTRAL compendium provides several databases and tools to aid in the
analysis of protein structures, particularly through the use of their
sequences. It is partially derived from the SCOP database of protein
domains, and it includes sequences for each domain as well as other
resources useful for studying these sequences and domain structures.
Several major improvements have been made to the ASTRAL compendium since
its initial release 2 years ago. The number of protein domain sequences
included has doubled from 15 190 to 30 867, and additional databases have
been added. The Rapid Access Format (RAF) database contains manually
curated mappings linking the biological amino acid sequences described in
the SEQRES records of PDB entries to the amino acid sequences structurally
observed (provided in the ATOM records) in a format designed for rapid
access by automated tools. This information is used to derive sequences
for protein domains in the SCOP database. In cases where a SCOP domain
spans several protein chains, all of which can be traced back to a single
genetic source, a 'genetic domain' sequence is created by concatenating
the sequences of each chain in the order found in the original gene
sequence. Both the original-style library of SCOP sequences and a new
library including genetic domain sequences are available. Selected
representative subsets of each of these libraries, based on multiple
criteria and degrees of similarity, are also included. ASTRAL may be
accessed at http://astral.stanford.edu/.
Click here to go back to the index
- Chandonia JM, Karplus M. 1999. New methods for accurate prediction of protein secondary structure. Proteins 35:293-306. [PDF]
A primary and a secondary neural network are applied to secondary
structure and structural class prediction for a database of 681
non-homologous protein chains. A new method of decoding the outputs of the
secondary structure prediction network is used to produce an estimate of
the probability of finding each type of secondary structure at every
position in the sequence. In addition to providing a reliable estimate of
the accuracy of the predictions, this method gives a more accurate Q3
(74.6%) than the cutoff method which is commonly used. Use of these
predictions in jury methods improves the Q3 to 74.8%, the best available
at present. On a database of 126 proteins commonly used for comparison of
prediction methods, the jury predictions are 76.6% accurate. An estimate
of the overall Q3 for a given sequence is made by averaging the estimated
accuracy of the prediction over all residues in the sequence. As an
example, the analysis is applied to the target beta-cryptogein, which was
a difficult target for ab initio predictions in the CASP2 study; it shows
that the prediction made with the present method (62% of residues correct)
is close to the expected accuracy (66%) for this protein. The larger
database and use of a new network training protocol also improve
structural class prediction accuracy to 86%, relative to 80% obtained
previously. Secondary structure content is predicted with accuracy
comparable to that obtained with spectroscopic methods, such as
vibrational or electronic circular dichroism and Fourier transform
infrared spectroscopy.
Click here to go back to the index
- Schwartz HL, Chandonia JM, Kash SF, Kanaani J, Tunnell E, ... Richter W, Baekkeskov S. 1999. High-resolution autoreactive epitope mapping and structural modeling of
the 65 kDa form of human glutamic acid decarboxylase. J Mol Biol 287:983-99. [PDF]
The smaller isoform of the GABA-synthesizing enzyme, glutamic acid
decarboxylase 65 (GAD65), is unusually susceptible to becoming a target of
autoimmunity affecting its major sites of expression, GABA-ergic neurons
and pancreatic beta-cells. In contrast, a highly homologous isoform,
GAD67, is not an autoantigen. We used homolog-scanning mutagenesis to
identify GAD65-specific amino acid residues which form autoreactive B-cell
epitopes in this molecule. Detailed mapping of 13 conformational epitopes,
recognized by human monoclonal antibodies derived from patients, together
with two and three-dimensional structure prediction led to a model of the
GAD65 dimer. GAD65 has structural similarities to ornithine decarboxylase
in the pyridoxal-5'-phosphate-binding middle domain (residues 201-460) and
to dialkylglycine decarboxylase in the C-terminal domain (residues
461-585). Six distinct conformational and one linear epitopes cluster on
the hydrophilic face of three amphipathic alpha-helices in exons 14-16 in
the C-terminal domain. Two of those epitopes also require amino acids in
exon 4 in the N-terminal domain. Two distinct epitopes reside entirely in
the N-terminal domain. In the middle domain, four distinct conformational
epitopes cluster on a charged patch formed by amino acids from three
alpha-helices away from the active site, and a fifth epitope resides at
the back of the pyridoxal 5'-phosphate binding site and involves amino
acid residues in exons 6 and 11-12. The epitopes localize to multiple
hydrophilic patches, several of which also harbor DR*0401-restricted
T-cell epitopes, and cover most of the surface of the protein. The results
reveal a remarkable spectrum of human autoreactivity to GAD65, targeting
almost the entire surface, and suggest that native folded GAD65 is the
immunogen for autoreactive B-cells.
Click here to go back to the index
- Chandonia JM, Karplus M. 1996. The importance of larger data sets for protein secondary structure
prediction with neural networks. Protein Sci 5:768-74. [OCR PDF]|[PDF]
A neural network algorithm is applied to secondary structure and
structural class prediction for a database of 318 nonhomologous protein
chains. Significant improvement in accuracy is obtained as compared with
performance on smaller databases. A systematic study of the effects of
network topology shows that, for the larger database, better results are
obtained with more units in the hidden layer. In a 32-fold cross validated
test, secondary structure prediction accuracy is 67.0%, relative to 62.6%
obtained previously, without any evolutionary information on the sequence.
Introduction of sequence profiles increases this value to 72.9%,
suggesting that the two types of information are essentially independent.
Tertiary structural class is predicted with 80.2% accuracy, relative to
73.9% obtained previously. The use of a larger database is facilitated by
the introduction of a scaled conjugate gradient algorithm for optimizing
the neural network. This algorithm is about 10-20 times as fast as the
standard steepest descent algorithm.
Click here to go back to the index
- Chandonia JM, Karplus M. 1995. Neural networks for secondary structure and structural class predictions. Protein Sci 4:275-85. [OCR PDF]|[PDF]
A pair of neural network-based algorithms is presented for predicting the
tertiary structural class and the secondary structure of proteins. Each
algorithm realizes improvements in accuracy based on information provided
by the other. Structural class prediction of proteins nonhomologous to any
in the training set is improved significantly, from 62.3% to 73.9%, and
secondary structure prediction accuracy improves slightly, from 62.26% to
62.64%. A number of aspects of neural network optimization and testing are
examined. They include network overtraining and an output filter based on
a rolling average. Secondary structure prediction results vary greatly
depending on the particular proteins chosen for the training and test
sets; consequently, an appropriate measure of accuracy reflects the more
unbiased approach of "jackknife" cross-validation
(testing each protein in the data-base individually).
Click here to go back to the index
|
|