summary of data analysis of protein structures

 here is the summary of responses I got this time for my query on
 data analysis of protein crystal structures. I am grate ful to all who
 responded especially erich baur who wants to have regular communuication
 with me. However, I have not been able to contact him personally because
 my mails to him are bouncing back. Through this medium, I apologise to him
 for my failure in this regard and assure that I shall keep try untill my mail
 stop bouncing back. My special thanks to K. Stewarts and Dr. Phillipe for
 the references.
                        your's cordially
                        sandeep kumar
                        san -x- at -x-
 P.S.  I will welcome futher comments and suggestions.
 **********START of SUMMRY*********************************************
 From:!erich.bauer (Erich Bauer)
 Message-Id: <9502091923.AA00383 -x- at -x- renoir>
 Subject: Re: CCL:data analysis of protein crystal structure
 To: san -x- at -x-
 Date: Thu, 9 Feb 1995 20:23:02 +0100 (MEZ)
 Cc: san -x- at -x-
 In-Reply-To: <9502091622.AA16267 -x- at -x-> from "san
 -x- at -x-" at Feb 9, 95 09:52:49 pm
 X-Mailer: ELM [version 2.4 PL2]
 Mime-Version: 1.0
 Content-Type: text/plain; charset=US-ASCII
 Content-Length: 8528
 Status: R
 san -x- at -x-
 > Hi!
 >     About Ten or fifteen days back I posted a query regarding the
 > personal experiences of the people involved in the data analysis of
 > the crystal structures of biomolecules esp. proteins.  I am sorry to
 > state that my query evoked only two responses.
 >  one of the advices that came my way was about the selection of dataset.
 > I agree with the responder that this is a problem.
 dear sandeep,
 i got the message.
 Interestingly, i am at the moment concerned with the same questions.
 Being new to CCL and PDB I am, however not experienced in these fields.
 I will try to give you some hints from what I learned on the
 subjects during the last months.
 Not all of it is on my mind right now, but, if you agree, we can stay
 in touch and post ideas whenever they come to our minds.
 It makes no sense to bother ALL CCLers with that.
 ( If you don't, just send an email saying "OH NO , NOT yOU AGAIN,
 STOP, PLEASE STOP !!", i won't feel insulted.)
 >   I have found out that in PDB there is a directory called user_group
 > which contains different useful subdirectories.  One of them call
 > subset_list contains the list of PDB entries for proteins which
 > have been selected using some criteria.  one of the list is from
 > Jane Richardson's lab for different structural motifs. Another from
 > Chris Sander's lab lists the proteins with less than 30% sequence
 > homology.  I hope tihs information is very useful to the people involved
 > in this area.
 it is, indeed.
 if you need the original references, how to obtain such datasets:
  -x- at -x- Article{Hobohm:91a,
   author =       "Uwe Hobohm and Michael Scharf and Reinhard Schneider and
                   Chris Sander",
   title =        "Selection of representative protein data sets",
   journal =      "Protein Science",
   year =         "1992",
   volume =       "1",
   OPTnumber =    "",
   pages =        "409 - 417",
   OPTnote =      "eb00",
   OPTannote =    "P-DB"
  -x- at -x- Article{Boberg:92a,
   author =       "Jorma Boberg and Tapio Salakoski and Mauno Vihinen",
   title =        "Selection of a Representative Set of Structures from
                   Brookhaven Protein Data Bank",
   journal =      "Proteins",
   year =         "1992",
   volume =       "14",
   OPTnumber =    "",
   pages =        "265 - 276",
   OPTnote =      "eb00",
   OPTannote =    "P-DB"
 If you want, I can send you the ftp sites to obtain the newest datasets.
 ( i would have to gather the papers from home ... )
 The whole subject seems to be a little bit involved:
 If you want to choose data f.i. for secondary structure prediction,
 you should try to avoid ( rare ) data from membrane proteins.
 So you have to focus on let's say globular proteins. If you do so,
 you put in additional information that in turn has been taken from the
 knowledge of tertiary structur to classify the protein.
 This can be repeated at any level of description ( certain class of
 proteins, certain folds,domains etc...) and i  don't see any REAL
 answer to that problem.
 >            I renew my request to the people involved in data analysis
 > to come forward and share their experiences with one another.  I may
 > also be useful to the outsiders also as it can give them a glimpse
 > of current situation in this important field.  I am basically interested
 > in discussing how statistics can be exploited to shed some light on the
 > hidden principles  and properties of protein structures which are being
 If you need programms for data analysis, i might be able to point you to.
 For the reasons mentioned above and as only a very restricted set of
 structure proteins and some enzxymes are being crystallized from pharma,
 meds etc. PDB itself is a strongly selected dataset.
 > diposited in the databanks.  What all can we learn from data analysis?
 > I am sure the are many more protein motifs still left undiscovered in
 Most pdb files contain classification for secondary structures.
 Several programs for optimisation and displaing structures are of different
 I am presently doing an exhaustive search in *.pdb and we are looking at all
 kind of
 representations for measures, f.i. not only the values of bond-angles,
 plantwists etc. but also higher moment and norms etc. etc.
 ( my boss is mathematican, he probably knows which representations make
 sense on what )
 > these PDB files. There may be answers hidden in these files which could
 > tell us why protein secondary structure prediction is still inaccurate.
 This question could serve as a basis for a newsgroup or a mail-reflector on its
 I can provide you SOME impressions right away, for more please don't hesitate
 to ask fori, most of the subject is still obsured to me, comments are welcome.
 1) The matter is related with experimental data exploitation and with the
    subject of global optimisation: probably it is non local interactions,
    f.i. neighbouring of other domains that enable/disable formation of
    secondary structure that, in turn are only the local outcome of non-local
    ( Several attemp have made with more global approaches: then we have a
    problem of tromendous high dimension even for small molecules.
    ( we are working on a global optimization packege that might do well for
    medium sized molecules, however accuracy of potential functions might
    be much to poor ) ).
 2) One could try to predict molecule's structure with force filed calculations:
    too slow even for small molecules.
    ( paramatrisation of MD programs takes a LOOOOOOOOOOOOOOOOOOOONG time.
    only large companies, consortia etc. ( BIOSYM/SanDiego) can cope with that.)
 3) MD is, in principle just an inadequate method.
    why shold f.i. 2 H behave differently in the same distance from each other
    when they have a different ditance along the backbone?
 4) if you work with ab initio-methods to fit data, they take an even LOOOOOO
    OOOOOOOOOOOOOOOOOOOOOOOOOnger time to fit fro small molecules.
 5) Measured data are not very accurate, sometimes the complete topology
    is wrong as X-RAY pictures have to be interpreted and sometimes people
    need much experience and intuition to guess the right folding pattern.
    ( Sometimes they have to remove one or the other .pdb file as results
    turn out to be TOTALLY wrongi as I was told.
    Not to talk about the errors that are not being detected and noone
    can prove ... )
 6) SecStr prediction is basicly heuristics with neural nets etc, based on
    mentioned before datasets..
    helices f.i. are often considered a hydrophobic nucleation site - somtimes
    they are however strongly influenced by long range interactions.
 > Only things we need to do is to be frank and informally discuss the
 > various problems one generally encounters when taking up such a project.
 > How best one can cope with it.  I will even welcome sharing of latest
 > litrature on the subjects amongst the people involved in such projects.
 On what topics ?
 if you want , I can send you my literature list i read during the last months.
 there is also a mail-server, where you can send your sequence and they send it
 back within some hours with sec.str. prdiction.
 >                            yours' cordially
 >                            sandeep kumar
 > -------This is added Automatically by the Software--------
 > -- Original Sender Envelope Address: san -x- at -x-
 > -- Original Sender From: Address: san -x- at -x-
 > CHEMISTRY -x- at -x- -- everyone     | CHEMISTRY-REQUEST -x- at -x- -- coordinator
 > MAILSERV -x- at -x- HELP CHEMISTRY  | Gopher: 73
 > Anon. ftp     | CHEMISTRY-SEARCH -x- at -x- -- archive
 > |     for info send: HELP
 your's sincerelly
 | namenet : Erich  Bornberg - Bauer                        |
 | snailnet: Inst. f. theor. Chem., | Inst. fuer Mathematik |
 |         : Waehringerstr. 17/308  | Strudlhofg. 4         |
 |         : Univ. Wien,   A - 1090,  Vienna / Austria      |
 | voicenet: *43-1-40 480 - 667, 677| (nonet yet, try drums)|
 | faxnet  :             402 85 25  | (  ---     "   ---  ) |
 | internet: erich -x- at -x- | erich -x- at -x-|
 project: making every problem NP-complete ...
 Message-Id: <9502112000.AA03278 -x- at -x->
 Content-Type: text
 Apparently-To: san -x- at -x-
 Status: R
 Hello Sandeep!
 Extremely sorry for not responding earlier to your discussion. Actually I was
 busy preparing the manuscript of a paper. How are you ? Have you received any
 responses so far? The point that you have raised, about the relevance of the
 three dimensional structures predicted from crystal structure data analysis to
 experimental biologists is quite an important one. So many methods have been
 developed so far for predicting the sec. structures and many more may be still
 coming up. I think what is more important from the experimental point of view
 is the tertiary structure of proteins. So the efforts have to be concentrated
 on making more efficient and effective use of the crystal structure data for
 predicting tertiary structures. I personally feel that given a sufficiently
 large data set of highly resolved structures, one can get a lot of information
 on the tert. structures (thus, the size of data set and quality of structures
 being the limiting factors). In today's world of networks and high performance
 computing, the tools using cellular automata theory and neural networks can
 can be exploited to derive biologically significant structures. My thoughts may
 sound too optimistic but then that is what I sincerely feel. I would welcome
 further discussion on this topic.
 What else? How's your work going on? Do write. I would have relatively less
 time next week, hopefully. Also send me the responses from others. Bye for now.
 -Sangeeta.11 Feb. '95.
 Received: from ( by
 with SMTP id AA20723
   (5.65c/IDA-1.4.4 for <san -x- at -x->); Sat, 11 Feb
 1995 12:49:58 -0600
 Received: from DECNET-MAIL (STEWARTK -x- at -x- CMDA)
  by RANDB.PPRD.Abbott.Com (PMDF V4.3-13 #5551)
  id <01HMX9NS5Y688YZ76E -x- at -x- RANDB.PPRD.Abbott.Com>; Sat,
  11 Feb 1995 12:52:49 -0600 (CST)
 Date: Sat, 11 Feb 1995 12:52:49 -0600 (CST)
 Subject: Re: CCL:data analysis of protein crystal structure
 To: san -x- at -x-
 Message-Id: <01HMX9NS7K1U8YZ76E -x- at -x- RANDB.PPRD.Abbott.Com>
 X-Vms-To: RANDB::IN%"san -x- at -x-"
 Mime-Version: 1.0
 Content-Transfer-Encoding: 7BIT
 Status: R
 Sandeep:  Here are some 1994 references in the area of Protein
 Structure Analysis:
 Protein Science 3, 1927-1937, 1994
 FASAB J 8, 1237-1239 and 1240-1247, 1994
 FEBS Letters 355, 213-219, 1994
 CABIOS 10, 545-546, 1994
 J. Mol. Biol. 242, 321-329, 1994
 Curr. Opinion in Struct. Biol., 4 422-428, 1994
 Proteins: Struct. Funct. Genet., 19, 222-229, 1994
 Proteins: Struct. Funct. Genet., 19, 85-97 and 165-173
 J. Mol. biol. 239, 306-314, 1994.
 Of course, all publications by Chris Sander and Janet Thornton
 are important in this area.  Tom Blundell is also
 a researcher whose name comes to mind for important work
 in this area.
 Kent Stewart
 Department of Structural Biology
 Abbott Laboratories
 Chicago, IL,   USA
 From:!youkha (Philippe Youkharibache)
 Message-Id: <9502091857.AA15243 -x- at -x->
 To: <san -x- at -x->
 Subject: Re:  CCL:data analysis of protein crystal structure
 Status: R
 I do not have the refs at hand, right now.
 however check in particular the work from
 Manfred Sippl
 Steve Bryant
 who derive residue base force fields for threading and in
 the longer term protein folding prediction, from
 statistical analysis of known protein structures
 Good luck
 Dr. Philippe Youkharibache 			e-mail: youkha -x- at -x-
 Biosym Technologies Inc.
 9685 Scranton Road				tel: (619) 546 5562
 San Diego, CA 92121				fax: (619) 458 0136
 From:!toni (Toni Kazic)
 Message-Id: <9502091953.AA14776 -x- at -x->
 To: san -x- at -x-
 In-Reply-To: <9502091622.AA16267 -x- at -x-> (san -x- at -x-
 Subject: Re: CCL:data analysis of protein crystal structure
 Status: R
 Dear Sandeep,
 I am not directly involved in that area, but I strongly encourage you in
 asking such straightforward and practical questions.  There is too much
 special-casing and not enough analysis!  Good luck,
 Toni Kazic
 Institute for Biomedical Computing
 Washington University
 From: "William T. Winter" <!wtwinter>
 Subject: Re: CCL:data analysis of protein crystal structure
 To: san -x- at -x-
 In-Reply-To: <9502091622.AA16267 -x- at -x->
 Message-Id: <Pine.3.89.9502091700.A1093-0100000 -x- at -x->
 Mime-Version: 1.0
 Content-Type: TEXT/PLAIN; charset=US-ASCII
 Status: R
 > state that my query evoked only two responses.
 Thiis board is still viewed by many as quantum chemistry and hence is
 probably not widly followe by protein crystallographers.Subscribe to
 bionet.xtallography on you nearest internet newstand and send them your
 Dr. William T. Winter                  Phone: (315)470-6876
 315 Baker Lab                          FAX:   (315)470-6856
 SUNY-ESF                               Internet: wtwinter -x- at -x-
 Syracuse, NY 13210-2786
 ******************end of summary************************************