Monday, September 10, 2007

Marjanovic & Laurin 2007

Marjanovic, D. and M. Laurin. 2007. Fossils, molecules, divergence times, and the origin of Lissamphibians. Syst. Biol. 56(3): 369-388.

"..a literal interpretation of the fossil record always underestimates the date of appearance of taxa because it can only give a latest possible date of appearance, not an earliest possible date of appearance..."

At first this quote didn't make sense to me, but now I think it means that when you find a fossil and put a date on the fossil, this means that the species had to have existed during this time. Therefore the speciation event could not have happened after this date, but it could have happened before this date. Fossil records cannot give an earliest possible date of appearance of a species because new fossils can always be found that can contradict the earliest possible date.

Some divergence dating methods that the authors used:
1) Multidivtime (Thorne and Kishino 2002)
2) QDate 1.11 (Rambaut and Bromham 1998)
3) r8s 1.71 (Sanderson 2003, 2006) using the penalized likelihood method
4) PATHd8 (Anderson 2006)

They couldn't get Mdt to work, I heard QDate is a bad method, r8s is okay, and I've never heard of Pathd8 and they also said the results didn't make sense.

-neighbour-joining trees are phenograms, not cladograms

On page 383, they discuss something odd. They say that according to Kolaczkowski and Thornton 2004, parsimony does better than ML and bayesian methods because parsimony does not need an assumption on how many rate categories there are. For instance, in many real cases each nucleotide position evolves at its own speed, causing potential problems for approaches that include evoltuion models. I don't think I've heard of this argument before for parsimony and I will have to read the Kolaczkowski paper to get a better handle on this. They also state that the branch lengths of the parsimony tree fit the morphological data better than the likelihood tree. These seem like odd statements to justify using parsimony instead of ML or bayesian for molecular data. I'm surprised that the reviewers didn't catch this.

3 comments:

David Marjanović said...

Hi... I'm the first author. Thanks a lot to Darren Naish for alerting me of your post.

You have correctly understood the quote about what fossils say on divergence dates. A clade can exist before it appears in the fossil record, but the reverse is impossible, so fossils give latest possible dates for cladogeneses. That's ancient wisdom that we merely repeated to make sure everyone understood we had understood it. :-)

Parsimony is more prone to long-branch attraction than ML and Bayesian analysis. Kolaczkowski & Thornton (2004) confirmed this once again. Still, there are conditions when parsimony nevertheless works better than the other methods. This happens when the nucleotides in the dataset have many different speeds of evolution (and when there's not too much long-branch attraction in the dataset, obviously). In ML at least (I'm not familiar with MrBayes), you can tell the program how many rate categories you want it to recognise in your data, but the default is just four, which is certainly ridiculous in many cases, and increasing the number of categories drastically increases the calculation time. (At least one of our calculations already took a week...) Parsimony doesn't have this problem because it doesn't assume that the evolution speeds of any two nucleotides are correlated. That's a short summary of Kolaczkowski & Thornton (2004); do read the paper and the supp. inf..

On the other hand, we -- as well as our reviewers, who were for the most part paleontologists -- were completely unaware of the fact that Kolaczkowski & Thornton (2004) was just the beginning of an enormous battle in the literature that continues to the present day. One day I'll have to start reading up on it. :-]

PATHd8 looks promising based on the papers that have used it so far, but on our dataset -- which was probably just too small for PATHd8 -- it produced zero-time branches (as explained in our lengthy appendices). This is clearly nonsense, so we gave it up.

QDate is a very primitive program. As its name says, all it does is quartet-dating: you give it a symmetric tree with four terminal taxa, you give it dates for the two upper nodes, and it calculates the age of the root node. This is usually unsuitable for real data. Unlike r8s, however, it gives a confidence interval, which is great. We made an entire appendix about the QDate results.

In order to use MultiDivTime you have to be a certified UNIX geek. I'm not one, and Michel isn't one either, so we had to give up, despite its interesting method. On the other hand, some suspect that it generally gives too old dates, and the way it treats calibration points -- as a point estimate with a standard deviation -- is unrealistic (because, as you wrote, the fossil record gives lower bounds, not midpoints).

You have a paragraph that consists only of "-neighbour-joining trees are phenograms, not cladograms". What do you mean by it? (It's true, BTW. Like all distance methods, NJ is phenetics, and we said so, because -- astonishingly -- many of the molecular folks don't know that.)

Oh, and... are you an inspiring systematist, or an aspiring systematist? Or even both? :o)

Tonya said...

Hi David,

Thanks for all the great comments!

I'll definitely have to read the K&T paper because I'm still not convinced that parsimony will work better than other methods for nucleotide data, especially with the use of BaysPhylogenies, where the user can specify the number of partitions in the dataset (i.e. all slow nucleotides in one partition, fast in another).

One dating method that is emerging as a good one is BEAST http://beast.bio.ed.ac.uk/

"-neighbour-joining trees are phenograms, not cladograms"

I just typed that to remind myself. I wasn't contradicting you.:)

David Marjanović said...

As mentioned, it's complicated. On the one hand, parsimony is more prone to long-branch attraction than ML and Bayesian analysis. On the other, parsimony basically gives each nucleotide its own rate category -- "all slow nucleotides in one partition, fast in another" is a vast oversimplification that certainly leads to trouble in some cases, and the usual 4 (rather than 2) rate categories are still a vast oversimplification.

There doesn't seem to be a win-win situation here. I recommend always trying both. :-)

Also, a point can be made that the most parsimonious tree is "what the data say at face value" and that it should just therefore be published, even though it shouldn't necessarily be proclaimed as gospel truth.