Basic diversity dynamics of scleractinian corals with divDyn


Kocsis, Adam

Goals and introduction

The purpose of this vignette is to guide new users through data acquisition and analysis, as well as to demonstrate the basic capabilities of the divDyn R (Kocsis et al. 2019) package. This tutorial will use Scleractininan corals as an example group which was selected for data quality reasons - and as we have readily available additional information about the coral taxa with which to customize the analyses.

Occurrence data download

To run analyses, first the occurrence data have to be downloaded and processed. You can download all PBDB occurrences using the chronosphere R package. The chronosphere is an initiative we have started at GeoZentrum Nordbayern in Erlangen, to facilitate the use, tracability, portability and version controlling of data in scientific analyses.

The package is available from the CRAN repositories and can be installed with the following line of code (you will probably have to select your preferred download mirror).


After the package is installed, it is ready for use, you can attach it with:

Chronosphere - Evolving Earth System Variables
Important: never fetch data as a superuser / with admin. privileges!

Note that the package was split for efficient maintenance and development:
 - Plate tectonic calculations -> package 'rgplates'
 - Arrays of raster and vector spatials -> package 'via'
dat <- fetch("pbdb", ver="20230616")

Accessing chronos registry tables.
Downloading registry tables.

Item no. 676, dat:pbdb, var: occs3, ver: 20230616, res: species.

Downloading import code.

Downloading data file.
MD5 checksum passed.

If you use the data in publications, please cite its
reference(s), as well as that of the 'chronosphere' project.
Please remember to acknowledge the Paleobiology Database ( in your publication.
Alternatively, you can use: The Paleobiology Database Community. (year). Denormalized occurrence table of the Paleobiology Database. Zenodo.

This is a pre-processed version of the PBDB that was downloaded with this API call:

[1] ",Holocene&show=class,classext,genus,subgenus,coll,coords,loc,paleoloc,strat,stratext,lith,env,ref,crmod,timebins,timecompare,refattr,entname,attr,ident,img,plant,abund,ecospace,taphonomy,etbasis,pres,prot,lithext,geo,methods,resgroup,refattr,ent"

You can copy and paste this URL to your browser and the PBDB would compile the current corresponding dataset for you. It normally takes about 15-30 minutes to get the dataset, another couple of minutes to download the .csv, which you still have to read in using read.table().

The PBDB API allows much more flexible download, but this denormalized version of the PBDB is enough for the overwhelming majority of the analyses. There are some cases, when the data you are interested in are not in this subset and you have to use the API yourself.

Data processing

Installing and loading divDyn

To process the stratigraphic information in the downloaded occurrence file, and do the analyses, you have to to install and attach the divDyn package.

The package is available from the CRAN repositories and can be installed with the following line of code (you will probably have to select your preferred download mirror).


After the package is installed, it is ready for use, you can attach it with:

v0.8.2. See latest changes with 'news(package="divDyn")'. Default: 
Time flows forward with 'bin', and backward with 'age' and 'slice()'.

Initial filtering

The downloaded dataset needs to be subsetted to represent the order of scleractinian corals.

dat <- dat[dat$order=="Scleractinia",]

The analyses will be conducted at the level of genera.

The entries that have no valid genus names in the PaleoDB can be omitted at this stage. They will not provide us any information about the evolution of the group.

# omit non-genus entries
dat<- dat[!dat$genus=="", ]

The dataset should be cleaned as much as possible at this stage, for didactic purposes and for keeping things simple, this step is skipped in this tutorial.

Stratigraphic resolution

Every colleciton in the the PaleoDB has a user entered 'early_interval' and 'late_interval' field. These are fields are used to place the collection in stratigraphy/geological time. The 'early_interval' collection is mandatory for every entry while the 'late_interval' is only necessary if the taxon does not fit in the interval specified as 'early_interval'. Otherwise the 'late_interval' field is blank (empty quotes: "").

The divDyn package uses two frequently used, discrete time scales that we use as the finest stratigraphic resolution of our analyses: the stage level time scale (stages, 95 bins) and the ten million year time scale (tens, 49 bins). These can be attached with the data() function:


These will be useful for plotting as they contain the age estimates for the intervals, but they are not too useful for finding which interval the 'early_interval' and 'late_interval' fields point to. This can be found in the keys object.


This list is designed to be used with the categorize() function of the package, that combines the large number of entries and replaces them with the name of the group they belong to. This has to be run on both the 'early_interval' and the 'late_interval' fields.

# the 'stg' entries (lookup)
stgMin <- categorize(dat[ ,"early_interval"], keys$stgInt)
stgMax <- categorize(dat[ ,"late_interval"], keys$stgInt)

The output of this function is either an interval number that corresponds to the stages timescale, -1, indicating that the value was empty quotes, or NA, if the interval is either not in the lookup table or is too broad.

Because this scheme is designed for general use, the output of this function is a vector of character values. These have to be demoted to numeric values, otherwise their order would not be correct.

# convert to numeric
  stgMin <- as.numeric(stgMin)
  stgMax <- as.numeric(stgMax)

Now these two vectors have to be combined. There are more solutions to solve the problem of stratigraphic uncertainty, but for the sake of simplicity, let’s just omit every occurrence that cannot be assigned to a single stage in the stages timescale. For that, we will create an empty container, check whether an the interval entry for the occurrence satisfies the condition above, and if so, save the interval ID.

# empty container
dat$stg <- rep(NA, nrow(dat))

# select entries, where
stgCondition <- c(
# the early and late interval fields indicate the same stg
# or the late_intervar field is empty

# in these entries, use the stg indicated by the early_interval
  dat$stg[stgCondition] <- stgMin[stgCondition] 

Now every occurrence entry either has a single integer number in the stg column, or NA as its interval identifier. Note that the code above cannot be used for Cambrian and Ordovician occurrences due to stratigraphic problems. If you are interested in assigning these to stage-level intervals, we recommend to take a look at the vignette we wrote that describes global marine, Phanerozoic scale analyses:

The assignments can be repeated for the 10-million year bins with the following commands:

# a. categorize interval names to bin numbers
  # categorize is the new function of the package
  # convert to numeric
# b. resolve max-min interval uncertainty
# final variable (empty)
  dat$bin <- rep(NA, nrow(dat))
# use entries, where
  binCondition <- c(
  # the early and late interval fields indicate the same bin
  # or the late_interval field is empty
# in these entries, use the bin indicated by the early_interval
  dat$bin[binCondition] <- binMin[binCondition]

Sampling assessment

You can assess how well the stratigraphic resolution turned out by running the table() function on the interval-identifier vector:


  49   54   55   56   57   58   59   60   61   62   63   64   65   66   67   68 
   5  114   72  446  910  602  101   86  248   80   87  493  383  231 1953 1149 
  69   70   71   72   73   74   75   76   77   78   79   80   81   82   83   84 
 988  157  288  207  483 1023  390  537  128   71  158  295  745  396  289  295 
  85   86   87   88   89   90   91   92   93   94   95 
 294  465  651  519 1193 1772 1185 1466 1903 7829 1201 

Summing it tells us how many of the overall occurrences we can assign to stratigraphic stages.

[1] 31888
# which is a
[1] 0.8532591
# proportion of the total data.

As we cannot use unresolved occurrences in any way, coding will be easier if we omit them at this stage.

# omit unbinned
dats <- dat[!$stg),]

If you take a look at the stage object, you can notice that besides the few entries in stages 37, 49 and 50, Scleractinian become important fossils in the Anisian stage.

# omit Paleozoic
dats <- dats[dats$stg>52,]

You can get a more organized view of sampling parameters by running the binstat() function in divDyn, that calculates the occurrence, collection, and reference counts in a single line. This is the general use of the high-level function of the package: you state the occurrence data.frame as the first, and then the column names as additional arguments. The column tax stands for the taxon names, bin for the discrete time bins, coll for collection identifiers and ref for reference identifiers.

bsFull <- binstat(dats, tax="genus", bin="stg", 
  coll="collection_no", ref="reference_no")
The database contains duplicate occurrences (multiple species/genus).
 [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[16]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[31]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[46]   NA   NA   NA   NA   NA   NA   NA   NA  114   72  446  910  602  101   86
[61]  248   80   87  493  383  231 1953 1149  988  157  288  207  483 1023  390
[76]  537  128   71  158  295  745  396  289  295  294  465  651  519 1193 1772
[91] 1185 1466 1903 7829 1201

This output is organizes so that the index of the values in the vector match up the bin identifier (e.g. 60th value is for stg 60). The default setting of the function will output a message about duplicate occurrences. This warns us that there are collections with more than one genus entries in a collection (i.e. more than one species/genus). If you are interested in genus-level analyses, it is probably better to count these as one, which you can do with duplicates=FALSE option.

bs <- binstat(dats, tax="genus", bin="stg", 
  coll="collection_no", ref="reference_no", duplicates=FALSE)
 [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[16]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[31]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[46]   NA   NA   NA   NA   NA   NA   NA   NA   99   66  355  769  444   77   72
[61]  228   77   74  413  291  206 1663  906  701  126  243  190  387  821  349
[76]  454  119   63  146  225  632  360  252  255  225  410  507  410  941 1192
[91]  955 1120 1474 6157  972


Plotting these variables is probably better then just looking at the numbers. The package includes a powerful time-scale plotting function that lets you visualize the how time is broken down to discrete intervals. This highly-customizable function is built on the basic plot() function, and most of its arguments are inherited. The following function call will draw plot with in the past 250 million years, with series-level shading and system-level boxes at the bottom:

tsplot(stages, boxes="sys", boxes.col="systemCol", 
    shading="series", xlim=c(250, 0), ylim=c(0,2000))

The you can draw the number of occurrences with lines(). As the same row indices in the stages object and the result of binstat() indicate values that belong to the same interval, you do not need any subsetting to align the to data.frames.

tsplot(stages, boxes="sys", boxes.col="systemCol", 
  shading="series", xlim=c(250, 0), ylim=c(0,2000), ylab="Number occurrences")
lines(stages$mid, bs$occs)

We will mostly use the same basic function call for plotting, and typing all these arguments is time consuming. But we can spare some time, if we defined a wrapper function around the plotting that executes the plotting call the with the same arguments that we want to keep the same, while allowing us to change what we would like to change. This is fairly straightforward with the ... argument that allows you to pass arguments through function calls.

tp <- function(...) tsplot(stages, boxes="sys", boxes.col="systemCol", 
  shading="series", xlim=52:95, ...)

Now, every time you call the newly defined function tp() it will have the same colour, shading and x-interval, while you can change other arguments of the tsplot() function. For instance if you wanted see how the number of collections change over time, you can draw a similar plot as above, with:

tp(ylim=c(0,350), ylab="Number of collections") 
  lines(stages$mid, bs$colls)