Make the link between specimens and genetic products as transparent as possible. #collectingisessential


I think the biggest issue molecular systematists will face in the future will be the pervasiveness of many genetic sequences from misidentified organisms. Our inability to track down the actual animal, plant, or fungus that many of our published genetic sequences come from is already a huge problem. There are plenty of sequences on GenBank and other sources from which we blindly download sequences trusting that this COI sequence from Coryphaena hippurus is actually from a dolphinfish and not an actual dolphin. Although there are lots of checks and balances to make sure that sequence of COI are actually from COI, and that these sequences lack stop codons, there isn’t much to check the accurate identification of the source specimen (a.k.a the voucher). Very few of the millions of sequences on GenBank can be tracked back to an organism deposited in a collection. If you download that sequence of dolphinfish and put it in your phylogeny and it falls out somewhere really weird then you can do a blast search and it will tell you that it is actually probably from a dolphin (if that is the case). But most of us are working at a much finer phylogenetic scale. We are building phylogenies of genera, families, or orders. At this finer scale it is very easy for someone to misidentify a species. (Google images of species from your favorite genus and see how similar congeners are in appearance.) Sequences generated from a misidentified specimen can easily end up on GenBank and often in the phylogeny of the day, and they do.

A wise mentor of mine used to say that every museum label has a question mark after the identification. I would say the same about the identifications on GenBank sequences except the nice thing about the museum specimens is that an expert on the group can one day come and positively identify the species based on the specimens: Not so with most GenBank data. The sequence is out there and if you have the right genus and not the right species it is highly unlikely that someone will catch that error anytime soon. It is quite likely in fact that many others will continue using those sequences, inadvertently perpetuating the error.

My colleagues and I want to put the power of identification back in the hands of taxonomists. I think it is safe to say that most people uploading sequences to GenBank are not experts in the organisms they are studying. They are trusting someone else to do positive identifications for them or they have collected some tissue samples from something they think they know very well but they may not. I see studies all the time where the samples were collected, tissue samples taken and the animals, and the carcasses of those animals were discarded (rather than kept in a permanent collection). That’s really not good, and a waste. First of all names change all the time. What if the study species is later split into two species; which one was the one you collected? What is needed is an unbroken and transparent link between the voucher and the genetic data.

Based on our observations that few researchers provide a link between their genetic data and collections data, either on GenBank or in their publications, a few of us [Carole Baldwin (Curator at the Smithsonian), Larry Page (Curator at the Florida Museum of Natural History) and my student and I] got together and created a nomenclature that should help remind people of the importance of making the link between specimens and genetic products as transparent as possible. The nomenclature is called GenSeq and it works something like this: The quality (trustworthiness if you prefer) of any given genetic sequence is based on how likely the identification of the voucher specimen is correct.  In our nomenclature the highest ranked, best sequences come from primary type specimens, like a holotype. A holotype cannot be misidentified because it is the main specimen chosen to represent the species when it was first described. Sequences from the primary types are ranked as genseq-1, the highest ranking in our system, because those sequences are from specimens with the highest likelihood of being correct identified. Sequences from secondary types (other specimens used in the original description of the species and designated as paratypes or other secondary types) are ranked second, genseq-2. This is followed by specimens from the type locality (same locality as where those primary type specimens were collected), genseq-3. Specimens that are vouchered but not from the type locality or type series as above are in genseq-4. Most sequences from specimens positively identified and deposited in permanent collections will be in that fourth category. The last category, genseq-5, is for photo vouchers. Although not ideal it is sometimes necessary to release an organism that you have taken a tissue sample from (think of something really big, or very rare). Also photo vouchers are necessary in cases where specimens are so small that the body of the organism is destroyed in the process of sampling its DNA. In these rare cases a photo is the best you can do to keep a record for identifying the species in the future. (Read more about photo vouchers versus actual specimens on Twitter using #collectingisessential or (see our paper here); you almost always need more than a photo to positively identify most creatures.

Anything without a clear link to a voucher (specimen or photo) doesn’t get one of these GenSeq tags. We recommend that systematist avoid using genetic data from which the source is unknown or unrecorded (and therefore lacks a GenSeq tag). After all, wouldn’t you prefer to know all the sequences you used were from organisms that were correctly identified by an expert? In an age where there is so much data available from so many species it is time to be picky. If you can download a sequence upload by someone who ‘kind of sorta’ thinks it is from a coelacanth, or the one where the voucher was identified by an expert and the specimen deposited in a permanent collection (where you can check it yourself if you need to) - wouldn’t you always choose the latter.

Right now many systematist, whether they admit it or not, will just not use sequences in a phylogeny if they originally end up in a clearly “wrong” place on the phylogenetic tree. That isn’t very scientific. Knowing it is in the “wrong” place in the first place is subjective. Of course most systematists will do a blast search and try to find alternative sequences, but the point is that the most rigorous approach is to make sure the specimen it came from was positively identified in the first place. We hope this GenSeq nomenclature is a short cut to doing just that.

We have published the GenSeq nomenclature here. And below is a poster we are presenting at some meetings that gives an overview of the idea. We hope people use it (by including a table in their paper linking GenBank #s to voucher #s and adding these GenSeq ranks). Too often people write in their publications something like: “Sequences were uploaded to GenBank and correspond to JK123332-JK678738.” That sentence tells us almost nothing. Not only do we not know what specimen might go with which sequences, we don’t even know which species goes with which sequences. If you then go to GenBank and download those sequences you will have the species - but little else (most people do not report the voucher numbers, if there are any). That’s a shame and something we hope to change. So the take home message is: Always link your genetic data to your voucher’s information! 


 P.S. - the 'Journal of Fish Biology' and' ZooKeys' have added the GenSeq nomenclature recommendations to their Instruction to Authors.