# Stability Threshold as a Selection Principle for Protein Design

Natural proteins are made up of sequences of amino acids that fold rapidly into specific compact structures corresponding to minimum free energy states called native states [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The structure of the native state conformation controls the functionality of the protein. Because the number of possible random amino acid sequences and the number of possible conformations are gigantic, an important issue is an understanding of the selection principles that apply to protein sequences and/or native state structures.

The two key ingredients of evolution are diversity (afforded by the availability of 20 amino acids) and stability (a functionally useful sequence should not be mutated away). The stability of the occupancy of the native state on increasing the temperature (i.e. the thermodynamic stability) has been argued to be a characteristic of a good folder [4, 5, 11, 12]. Here, we consider a different kind of stability against mutations of the sequence or equivalently perturbations in the effective interaction potential between amino acids. We demonstrate that the two types of stabilities are related in the sense that each one implies the other. We model the mechanism of evolution through natural selection in proteins and discuss its implication in the protein design problem. We show that the native states of random heteropolymers are not stable against mutations, whereas sequences designed to be thermodynamically stable are characterized by a non-zero stability threshold. Conversely, an evolution-like design scheme that attempts to maximize the stability threshold is shown to lead to greater thermodynamic stability as well. Our work provides a characterization of the “twilight zone” and the observation that proteins form families according to the spatial conformation of their native states [14]. As suggested by Li et al. [15], structure selection is an appealing complementary view to sequence selection. We show, however, that selection processes involving sequences could be as important as structure selection. The effects of destabilizing factors such as variations in the denaturant concentration and site directed mutagenesis on the kinetics of folding have been recently investigated in Refs. [16]. A careful analysis of such perturbations has proven helpful in the clarification of the folding funnel in terms of collective coordinates [8, 9, 16].

We consider self-avoiding chains of monomers on a 2D square lattice. The Hamiltonian is

(1) |

where is the coupling between monomer and and is nonzero (and equal to 1) only if and are are adjacent sites on the lattice and and are not next to each other in sequence. This hamiltonian is well known in protein modelling [2, 6]. For a given sequence (and ) with , we enumerate the energies of all possible conformations and are able to determine the native state (ground state) conformation exactly.

We consider two versions of the model. In the first (which we will call the model), the monomers are assumed to be distinct. The matrix is symmetric and has elements. In order to obtain a random heteropolymer, these elements are drawn from a Gaussian distribution with mean value -2 and variance 1. Effectively, the matrix represents a certain sequence. The model is identical to that studied by Dinner et al. [6, 18, 19]. The random contact energies are in approximate correspondence to a more realistic parametrization of the contact energies by Miyazawa and Jernigan [20, 21] or by Kolinski, Godzik and Skolnik [22]. In order to model evolved sequences with a large stability gap [9], we follow the rank-ordered procedure outlined by Shrivastava et al. [23] of shuffling the entries to assign the most favorable attractive interactions to the native contacts of the maximally compact native fold chosen as fixed target.

In the second model (denoted as the MJ model), each monomer is chosen to represent one of the twenty amino acids with the interactions determined by Miyazawa and Jernigan [20, 21]. A random sequence would correspond to a random choice of the amino acids. In order to mimic the relevant feature of sequences selected by evolution it is no longer possible to follow the rank ordering procedure because the entries cannot be shuffled at will. Instead, after having fixed a target fold [13], we have used a recently proposed protein design procedure [12] entailing an optimization scheme in sequence space which allows one to obtain sequences with a desired native state conformation and a required measure of thermodynamic stability, enforced by fixing a “design temperature” and selecting those sequences with .

Our calculations begin with the selection of two distinct interaction matrices which we shall call and . We shall consider 4 choices for : the random and the evolved and MJ models. is chosen randomly. The ground states of the and sequences are generally distinct. We now consider mutations of the sequence along a trajectory parametrized by a mixing coefficient that changes the interaction matrix from to [17]:

(2) |

The coefficient is a measure of the distance in sequence space between and . This is a quite general perturbation of which the natural occurring ones are a subset. The structural similarity of the ground state conformations of these two sequences is given by the normalized distance , where the distance is defined by

(3) |

where and are the Euclidean distances between amino acids and in the the two native states of sequences and respectively. Note that has been normalized so that it is 1 when , as long as the ground states of and are distinct.

Our primary probe of the stability to mutations is via a study of the dependence of on . Qualitatively similar trends are found for both the models – the signature of the selection in sequence space is in the quite distinct behavior of random and evolved sequences. A summary of our results for the behavior of the average as a function of for is shown in Fig. 1. The curves have been obtained as an average over 1000 realizations of independently chosen and for each of them over 1000 realizations of for the model and over 10 realizations of and for each of them over 1000 realizations of for the MJ model. The average stability threshold is zero for random heteropolymers [24] and is distinctly non-zero for the evolved cases.

Furthermore, the stability threshold goes up with the overall thermodynamic stability – in Figure 1, for the MJ model, the region of stability against mutations increases with the design temperature . The threshold is somewhat reduced but is clearly non-zero when one considers rank ordered sequences that have native states in conformations that are not maximally compact. One increasing , the number of monomers, comprising the evolved sequences, the stable phase along the -axis increases in size along with a sharpening of the curve suggestive of a sharp phase transition at the onset of instability in the thermodynamic limit.

One may also define an individual stability threshold in the strength of the perturbation above which becomes non-zero for the first time – the native structure of sequence is destabilized. Normalized probability distributions of the individual stability thresholds for the random and evolved models are shown in Fig. 2. They underscore the different behaviors in the two cases. Our results, in the random case, are marginally related to a study of Bryngelson [25] and in the evolved case, to a recent study of Pande et al. [7], where the authors addressed the issue of stability of the ground state against inaccuracies in the potential. Bryngelson [25] used a mean field theory to estimate the probability of predicting the correct structure of a sequence of monomers if the interatomic potential is known only to an accuracy of . A non-zero could arise from variations in the solvent properties or due to the imperfect parametrization or determination of the potential between amino acids or, as in our case, from mutations in the sequences. Pande et al. [7] showed analytically that the ground state of designed sequences is significantly robust against the introduction of random noise in the interaction matrix.

Imperfectly folded proteins within the cell are demolished by proteolytic enzymes [26]. Unfolded proteins can be present either as a product of a destabilizing mutation or due to a variation of the solvent properties. Thus the folded structure must be robust against perturbations in order to survive in the cell. The role of sequence selection is to produce robust sequences which rapidly fold into a target conformation with a specific function. Can one develop a design scheme starting from existing functional sequences and produce artificial homologues with better functionality?

The design scheme works as follows. 1) Select an initial random sequence. 2) Compute its MJ matrix and its stability threshold by extracting a set of 100 realizations of the perturbation . 3) Subject the sequence to Monte Carlo optimization procedure: monomers are swapped and the new sequence is accepted if its stability threshold is increased. 4) After 1000 such Monte Carlo steps, stop the optimization and compute the folding transition temperature of the resulting sequence. As shown in Fig. 3, averaged over sequences correlates well with the threshold.

We turn now to a recent study of Li et al. [15] who suggested that certain highly designable structures that are the unique native states of a large number of sequences are special in that they are thermodynamically more stable than other structures and are stable against mutations in the sequence. Their study was of chains comprising 27 monomers made up of two kinds of amino acids, hydrophobic (H) and polar (P), on a simple cubic lattice. The bare values of their interaction parameters were expressed (after scaling and shifting) in convenient units so that E(P,P)=0, E(H,P)=-1 and E(H,H)=-2.3. Our own studies of the identical model (without an overall attractive shift in the bare interaction parameters that would promote maximally compact structures) necessitate the consideration of non-maximally compact conformations as well. We have studied a two dimensional version of the model with the bare unshifted interaction parameters considered by Li et al. and with 16 monomers. As suggested by them, we find that the conformations come with varying degrees of designability. There are many conformations that only a few sequences have as a unique native state while there are few that are the native states of a large number of sequences. Fig. 4 shows a plot of the thermodynamic stability as measured by a Z score [27] (see figure caption for a definition) versus the number of sequences that design the structure. The three curves correspond to the highest, the mean and the lowest Z score. There is no evidence of a jump in the thermodynamic stability beyond a certain value of as found by Li et al. for the case in which all native states are necessarily maximally compact. Indeed, for a given structure, there are variations in the stability on tuning the sequences (as in the MJ curves in Fig. 1). Strikingly, highly designable conformations in 2D are always maximally compact. In 3D, with the same choice of unshifted interaction parameters, while globular structures with no holes, resembling real protein structures, are generally the ground state of HP sequences, these structures are not necessarily maximally compact. These findings show that, at least for interactions that do not always lead to maximally compact native states, the selection process primarily involves the sequences and not the structure. The present study has concentrated on the thermodynamic effects of mutations. The two-way link between resistance against mutations and thermodynamic stability demonstrated here could also have ramifications on the dynamics of folding, for example, in the key role that specific amino acids have in nucleation mechanisms of folding [16, 28].

We are grateful to Hao Li and Chao Tang for useful discussions. This work was supported by NASA, NATO, NSF, The Petroleum Research Fund administered by the American Chemical Society and the Center for Academic Computing at Penn State.

## References

- [1] C. Anfinsen, Science 181, 223 (1973).
- [2] E. I. Shakhnovich and A. M. Gutin, Biophys. Chem. 34, 187 (1989).
- [3] H. S. Chan and K. A. Dill, Physics Today, 46, 24 (1993).
- [4] E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. U.S.A. 90, 7195 (1993).
- [5] V. S. Pande, A. Yu. Grosberg, and T. Tanaka, Proc. Natl. Acad. Sci. U.S.A. 91, 12972 (1994).
- [6] A. Sali, E. I. Shakhnovich, and M. Karplus, Nature 369, 248 (1994).
- [7] V. S. Pande, A. Yu. Grosberg, and T. Tanaka, J. Chem. Phys. 103, 9482 (1995).
- [8] P. G. Wolynes, J. N. Onuchic, and D. Thirumalai, Science 267, 1619 (1995)
- [9] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes, Proteins Struct. Funct. Genet. 21, 167 (1995).
- [10] K. D. Klimov and D. Thirumalai, Phys. Rev. Lett. 76, 4070 (1996).
- [11] J. Deutsch and T. Kurosky, Phys. Rev. Lett. 76, 323 (1996).
- [12] F. Seno, M. Vendruscolo, A. Maritan, and J. R. Banavar, Phys. Rev. Lett. 77, 1901 (1996).
- [13] For both the and MJ models, we have repeated the analysis for three different target folds, with no significant difference in the results.
- [14] C. A. Orengo, D. T. Jones, and J. M. Thornton, Nature 372, 631 (1994).
- [15] H. Li, R. Hellig, C. Tang, and N. Wingreen, Science 273, 666 (1996)
- [16] N. D. Socci, J. N. Onuchic, and P. G. Wolynes, J. Chem. Phys. 104, 5860 (1996). J. N. Onuchic, N. D. Socci, Z Luthey-Shulten, and P. G. Wolynes, Folding and Design 1, 441 (1996).
- [17] In the case of the model, a more appropriate definition would be . With such a definition, the variance of would be unaffected and therefore the scale of the folding transition temperature would be unchanged. See Refs. [2, 7].
- [18] A. Dinner, A. Sali, M. Karplus, and E. I. Shakhnovich J. Chem. Phys. 101, 1444 (1994).
- [19] M. Cieplak, S. Vishveshwara, and J. R. Banavar, Phys. Rev. Lett. 77, 3681 (1996).
- [20] S. Miyazawa and R. Jernigan, Macromolecules 18, 534 (1985).
- [21] S. Miyazawa and R. Jernigan, J. Mol. Biol. 256, 623 (1996).
- [22] A. Kolinski, A. Godzik, and J. Skolnik, J. Chem. Phys. 98, 7420 (1993).
- [23] I. Shrivastava, S. Vishveshwara, M. Cieplak, A. Maritan, and J. R. Banavar, Proc. Natl. Acad. Sci. U.S.A. 92, 9206 (1995).
- [24] A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich, cond-mat/9606136 preprint.
- [25] J. D. Bryngelson, J. Chem. Phys. 100, 6038 (1994).
- [26] T. E. Creighton, Proteins. Structures and Molecular Properties. W. H. Freeman & Company (New York), (1993).
- [27] J. U. Bowie, R. Lüthy, and D. Eisenberg, Science 253, 164 (1991).
- [28] E. I. Shakhnovich, V. I. Abkevich, and O. B. Ptitsyn, Nature 379, 96 (1996).