Margaret Dayhoff, a founder of the field of bioinformatics

23 March, 2009 (22:01) | General Science, Genomics Research | By: Mary

If you are a biomedical researcher, have you ever used protein databases like UniProt to get information about proteins that you are interested in?  Do you know how that database got there?  I don’t mean today, I mean decades ago—how did a resource like this come to even exist at all?  When researchers search a protein database or align amino acid sequences, frequently they’ll come across a name helped start it all years ago. Margaret Dayhoff was one of the people that pioneered this crucial functionality, a true founder in the field of bioinformatics.  But in some histories and timelines of bioinformatics she barely gets a mention.  To celebrate Ada Lovelace Day, I’m going to introduce you to Dr. Dayhoff and I hope to raise awareness of her important fundamental contributions to the field of bioinformatics.

Because we can access all the protein information we can stand with a few keystrokes today, it is easy to forget that this data 1) didn’t always exist, and 2) when it did exist, it wasn’t easy to find and work with.  In the 1960s, only a handful of protein sequences were known.  But it was clear that more of this data would be incredibly useful in a number of ways, and was certainly going to be generated at an increasingly faster pace.  And soon it would overwhelm any one person’s ability to analyze and retain.  DNA sequences…don’t even go there yet….

use_of_computersBut there were some prepared minds ready to begin thinking about these data and the associated opportunities around them.  They were also aware that computers might help with these problems.  Robert Ledley was one of them.  Ledley had trained as a dentist, but obtained a degree in physics and became increasingly interested in the possibilities of applying computational resources to biomedical problems.  A report authored by Ledley is one of the earliest studies of biomedical computation, and can be viewed on Google Books today.

Working with Ledley at the National Biomedical Research Foundation was a woman named Margaret Dayhoff.  With an undergraduate degree in mathematics and graduate studies in chemistry, Dayhoff had pioneered work with punch cards and data processing machines to evaluate molecular resonance energies of organic molecules.  She obtained a Watson Computing Laboratory Fellowship to pursue the work to complete her PhD, which is described by a biographer as:

The process was iterative and required manually carrying cards from one type of machine to another (4 types), as no single machine could do the whole iteration. Convergence was slow and several months could be required for a result.

800px-punch-card-5081I imagine she was using machines similar to the antiques we can see in an article contemporary with Dayhoff’s fellowship, wherein Miss Eleanor Krawitz, Tabulating Supervisor, offers a tour of the punch cards and the processes in the Columbia Engineering Quarterly in 1949.  (That article also notes that Miss Krawitz was “the first feminine author to contribute to the COLUMBIA ENGINEERING QUARTERLY.”)

So Dayhoff was someone who had understood and actually used “automatic computing methods and equipment” to generate data (Krawitz).  Paired with Ledley, she had the opportunity to move the work to protein analysis.  In 1962, Dayhoff and Ledley wrote:

In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.

The IBM 7090 can be viewed on the web in a number of places.  It looks like something out of a sci-fi movie.  A monstrous collection of metal bins with spinning tape disks.  But at least it had transistors instead of the vacuum tubes at this point.  And it worked.

comprotein_title

The program that Dayhoff and Ledley described was called COMPROTEIN.  It was actually a “programming system” which was comprised of six individual programs:  MAXLAP, MERGE, PEPT , SEARCH, QLIST, and LOGRED.  The paper offers the theoretical framework for assembling protein chain data from peptide digests, and provides typewritten flow diagrams to explain each one of the individual programs.  It is almost excruciating to read at this point because it all seems so basic.  And to know that it would take so long to actually generate and run them makes my head hurt….

The idea was conceived by us in 1958, but actual programming was not initiated until late 1960.

And the paper was published in 1962.  Egads, I could teach myself enough Perl to do this in a weekend now.

But I know, it wasn’t easy, and I don’t mean to suggest that.  And it was HUGELY important work.  It formed the major foundation for everything I do every day now in bioinformatics.  The end of this COMPROTEIN paper says:

Just as the proteins are composed of chains of the same types of molecules, the genetic substances desoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are composed of chains of only 4 different types of molecules called the nucleotide bases.  It is possible that the order of the molecules in these substances can also be determined by the aid of this computer program and some computer experiments in this direction have been made.  However, application of these techniques to DNA and RNA still awaits further development in the chemical experimental methods.

I know Margaret would have loved next-gen sequencing, the high-throughput, high-volume, huge data generating capacity we have today.

But this was only the beginning of her work in bioinformatics.  You may be familiar with her one-letter code for amino acids that required less punch card punching.  Dayhoff used computers to develop algorithms and analyze the protein sequences she had available and made huge strides in understanding evolutionary relationships.  She created scoring methods and matrices that are still foundational in this field–and if you do sequence alignments you may see her name in the output!  She was enormously respected for this work, was supported by the granting agencies for it.

atlas_cover_1965There was a separate aspect of her work, though, that was less well supported by funding groups.  She began to collect and publish regularly the Atlas of Protein Sequence and Structure books.  The first edition contained 65 sequences. It seems that funding agencies were not keen on funding work that some perceived as “stamp collecting” rather than experimentation.  The atlas morphed into a database that Dayhoff made available by subscription in order to support this work.  However, this subscription aspect created tension among biomedical researchers who thought that since the protein sequences were freely available, charging for a database was unwarranted.

Bruno Strasser’s study of this period is a fascinating look (pdf) at the history, attitudes, and framework in which this all occurred.  At a talk for the Anniversary of GenBank, Strasser explored both the visionary work of Dayhoff, the database she established and other parallel database development projects in molecular biology, and the tension around the value of database curation.

strasser_talk(http://videocast.nih.gov/Summary.asp?File=14412 Strasser’s talk begins at approximately 1:09 and ends around 1:45. You can drag the progress bar to get to the right place and start to watch.)

From the paper and the talk, we hear Dayhoff speak to the importance of the work she was doing:

As she explained to a colleague: “There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.”

(Dayhoff 1967, from Strasser pg. 111)

I would encourage you to watch the video where Strasser explains this, and read the companion paper—it is a fascinating look at a time that established the world of bioinformatics as we know it today.  It presages and informs much of the battle around open source software and data as we know it.  After I learned these details about this period, my understanding of the framework and discussions of  the open source world in which we find ourselves today became much deeper.  I discovered in an obituary that the “stable, adequate, long-term funding” for PIR (the direct descendant of the Atlas) came through finally a few months after she died.

If you use PIR, the Protein Information Resource today, or UniProt , or any of a number of other databases and analysis tools for sequence comparisons, or if you rely on biomedical research for your health and well-being, you should appreciate the life of Margaret Oakley Dayhoff as well.

I’ll let Margaret Dayhoff close with this, and I wish I could tell her how important her link in the chain was to me:

We sift over our fingers the first grains of this great outpouring of information and say to ourselves that the world be helped by it. The Atlas is one small link in the chain from biochemistry and mathematics to sociology and medicine.

(Dayhoff 1968, from Strasser pg. 112)

References:

•    Dayhoff, M. O. and G. E. Kimball. Punched Card Calculation of Resonance Energies J. Chem. Phys. 17, 706-717, Ph.D. Thesis, Columbia University, Graduate School of Chemistry, 1949. DOI:10.1063/1.1747374

•    Dayhoff, M. O. and R. S. Ledley. Comprotein: A Computer Program to Aid Primary Protein Structure Determination. In Proceedings of the Fall Joint Computer Conference, 1962, 262-274. Santa Monica, CA: American Federation of Information Processing Societies, 1962. http://portal.acm.org/citation.cfm?id=1461546

•    Dayhoff, M. O. 1965. Computer aids to protein sequence determination. J. Theor. Biol. 8: 97-112. doi:10.1016/0022-5193(65)90096-2

•    Krawitz, E. The Watson Scientific Computing Laboratory: A Center for Scientific Research Using Calculating Machines.  Columbia Engineering Quarterly, November 1949. http://www.columbia.edu/acis/history/krawitz/index.html

•    Ledley, R.S. Report on the Use of Computers in Biology and Medicine.  National Research Council (U.S.). Advisory Committee on Electronic Computers in Biology and Medicine, National Research Council (U.S.). Division of Medical Sciences.  Published by National Academy of Sciences – National Research Council, 1960. http://books.google.com/books?id=J5grAAAAYAAJ&output=html

•    Strasser, B.J. “Collecting and Experimenting: The moral economies of biological research, 1960s-1980s.”, Preprints of the Max-Planck Institute for the History of Science, 310, 105-23. 2006. http://www.yale.edu/history/faculty/materials/strasser-mpi-2006.pdf

Other Sources:

http://www.dayhoff.cc/ Dr. Margaret Oakley Dayhoff — Pioneer in Bioinformatics; has more extensive bibliographies and biographical information. And family photos.
http://www.springerlink.com/content/9w1118639vl11603/ Margart Oakley Dayhoff 1925-1983

http://books.google.com/books?id=J5grAAAAYAAJ&pg=PP1&output=html Use of computers in biology and medicine report.
http://en.wikipedia.org/wiki/Punch_cards Punch card image
http://www.columbia.edu/acis/history/krawitz/index.html punch card machines
http://www.uniprot.org/ UniProt
http://pir.georgetown.edu/pirwww/index.shtml PIR
http://www.yale.edu/history/faculty/strasser.html Bruno Strasser homepage
http://videocast.nih.gov/Summary.asp?File=14412 GenBank Anniversary talks
http://www.biology.arizona.edu/biochemistry/problem_sets/aa/Dayhoff.html One letter code by Dayhoff
http://www.molecularevolution.org/mbl/resources/models/aamodels.php More on matrices and substitutions
http://www.inf.ethz.ch/personal/gonnet/DarwinManual/node146.html More on matrices and substitutions

+++++++++++

This post was inspired by See Jane Compute, in support of Ada Lovelace Day 2009.  It will also be submitted to the Giant’s Shoulder’s blog carnival.

To see the Mash Up of Ada Lovelace Day posts by location, topic, or as a list go here: http://ada.pint.org.uk/

Comments

Pingback from The Giant’s Shoulders #10 : Stochastic Scribbles
Time April 16, 2009 at 11:03 AM

[...] at the OpenHelix Blog wrote about Margaret Dayhoff and her work on the first computational protein sequence analysis paper written with Robert Ledley [...]

Comment from gioby
Time October 28, 2010 at 10:54 AM

Very nice article, thanks. I named my computer at work after Margaret Dayhoff :-)

Comment from Mary
Time October 28, 2010 at 11:01 AM

Thanks gioby! And you have excellent taste in names.

It was actually quite fun to look for the earliest programs. I had a good time with that. I love the old literature.

Pingback from Protein Structure Analysis – How Far We’ve Come! | The OpenHelix Blog
Time March 28, 2011 at 9:17 AM

[...] As my personal celebration for these releases I have been reading a variety of articles showing the scope of how far our abilities to analyze protein structures have come. The first article is one that Mary pointed me to a while back, which discusses the infancy of bioinformatics, entitled “The Roots of Bioinformatics in Protein Evolution” by RF Doolittle (cited below, as are all articles mentioned). In this wonderful perspective Dr. Doolittle describes a time when DNA sequencing was unimaginable and protein sequencing was laborious, slow, and yet so new that each day was full of excitement as one more amino acid was identified. It is a revealing glimpse at a research era gone by – to quote Doolittle, “Science as an endeavor thrives on obsolescence.” – and mentions the contributions of Margaret Dayhoff, who Mary has blogged about. [...]

Pingback from Tip of the Week: New and Improved OMIM® | The OpenHelix Blog
Time June 1, 2011 at 9:18 AM

[...] been maintained on a computer framework longer. I know of an older protein analysis program that I wrote about once here–from Margaret Dayhoff and Robert Ledley. But as an ongoing repository or catalog that was [...]