I think the biggest issue molecular systematists will face
in the future will be the pervasiveness of many genetic sequences from
misidentified organisms. Our inability to track down the actual animal, plant,
or fungus that many of our published genetic sequences come from is already a
huge problem. There are plenty of sequences on GenBank and other sources from
which we blindly download sequences trusting that this COI sequence from Coryphaena
hippurus is actually from a dolphinfish and not an actual dolphin.
Although there are lots of checks and balances to make sure that sequence of
COI are actually from COI, and that these sequences lack stop codons, there isn’t
much to check the accurate identification of the source specimen (a.k.a the
voucher). Very few of the millions of sequences on GenBank can be tracked back to
an organism deposited in a collection. If you download that sequence of
dolphinfish and put it in your phylogeny and it falls out somewhere really weird
then you can do a blast search and it will tell you that it is actually
probably from a dolphin (if that is the case). But most of us are working at a
much finer phylogenetic scale. We are building phylogenies of genera, families,
or orders. At this finer scale it is very easy for someone to misidentify a
species. (Google images of species from your favorite genus and see how similar
congeners are in appearance.) Sequences generated from a misidentified specimen
can easily end up on GenBank and often in the phylogeny of the day, and they do.
A wise mentor of mine used to say that every museum label has a
question mark after the identification. I would say the same about the
identifications on GenBank sequences except the nice thing about the museum
specimens is that an expert on the group can one day come and positively identify
the species based on the specimens: Not so with most GenBank data. The sequence
is out there and if you have the right genus and not the right species it is
highly unlikely that someone will catch that error anytime soon. It is quite likely in fact
that many others will continue using those sequences, inadvertently
perpetuating the error.
My colleagues and I want to put the power of identification
back in the hands of taxonomists. I think it is safe to say that most people
uploading sequences to GenBank are not experts in the organisms they are
studying. They are trusting someone else to do positive identifications for
them or they have collected some tissue
samples from something they think they know very well but they may not. I see
studies all the time where the samples were collected, tissue samples taken and
the animals, and the carcasses of those animals were discarded (rather than
kept in a permanent collection). That’s really not good, and a waste. First of all names
change all the time. What if the study species is later split into
two species; which one was the one you collected? What is needed is an unbroken and transparent
link between the voucher and the genetic data.
Based on our observations that few researchers provide a
link between their genetic data and collections data, either on GenBank or in
their publications, a few of us [Carole Baldwin (Curator at the Smithsonian), Larry Page (Curator at the Florida Museum of Natural History) and my student and I] got
together and created a nomenclature that should help remind people of the
importance of making the link between specimens and genetic products as
transparent as possible. The nomenclature is called GenSeq and it works
something like this: The quality
(trustworthiness if you prefer) of any given genetic sequence is based on how
likely the identification of the voucher specimen is correct. In our nomenclature the highest ranked, best
sequences come from primary type specimens, like a holotype. A holotype cannot
be misidentified because it is the main specimen chosen to represent the
species when it was first described. Sequences from the primary types are
ranked as genseq-1, the highest ranking in our system, because those sequences
are from specimens with the highest likelihood of being correct identified.
Sequences from secondary types (other specimens used in the original
description of the species and designated as paratypes or other secondary
types) are ranked second, genseq-2. This is followed by specimens from the type
locality (same locality as where those primary type specimens were collected),
genseq-3. Specimens that are vouchered but not from the type locality or type
series as above are in genseq-4. Most sequences from specimens positively
identified and deposited in permanent collections will be in that fourth category.
The last category, genseq-5, is for photo vouchers. Although not ideal it is
sometimes necessary to release an organism that you have taken a tissue sample
from (think of something really big, or very rare). Also photo vouchers are
necessary in cases where specimens are so small that the body of the organism
is destroyed in the process of sampling its DNA. In these rare cases a photo is
the best you can do to keep a record for identifying the species in the future. (Read more about photo vouchers versus actual specimens on Twitter using #collectingisessential or
(see our paper here); you almost always need more than a photo to positively identify most creatures.
Anything without
a clear link to a voucher (specimen or photo) doesn’t get one of these GenSeq
tags. We recommend that systematist avoid using genetic data from which the
source is unknown or unrecorded (and therefore lacks a GenSeq tag). After all, wouldn’t you prefer to know all the
sequences you used were from organisms that were correctly identified by an
expert? In an age where there is so much data available from so many species it
is time to be picky. If you can download a sequence upload by someone who ‘kind
of sorta’ thinks it is from a coelacanth, or the one where the voucher was
identified by an expert and the specimen deposited in a permanent collection (where
you can check it yourself if you need to) - wouldn’t you always choose the
latter.
Right now many systematist, whether they admit it or not,
will just not use sequences in a phylogeny if they originally end up in a clearly
“wrong” place on the phylogenetic tree. That isn’t very scientific. Knowing it
is in the “wrong” place in the first place is subjective. Of course most systematists
will do a blast search and try to find alternative sequences, but the point is
that the most rigorous approach is to make sure the specimen it came from was
positively identified in the first place. We hope this GenSeq nomenclature is a
short cut to doing just that.
We have published the GenSeq nomenclature here. And below is a poster we are
presenting at some meetings that gives an overview of the idea. We hope people
use it (by including a table in their paper linking GenBank #s to voucher #s
and adding these GenSeq ranks). Too often people write in their publications
something like: “Sequences were uploaded to GenBank and correspond to
JK123332-JK678738.” That sentence tells us almost nothing. Not only do we not
know what specimen might go with which sequences, we don’t even know which
species goes with which sequences. If you then go to GenBank and download those
sequences you will have the species - but little else (most people do not report
the voucher numbers, if there are any). That’s a shame and something we hope to
change. So the take home message is: Always
link your genetic data to your voucher’s information!
P.S. - the 'Journal of Fish Biology' and' ZooKeys' have added the GenSeq nomenclature recommendations to their Instruction to Authors.