Data exploration is an important step in data science. It mainly includes traditional statistical analysis to interpret data patterns, to find stories and to gain instights.
This section is adopted from the module, the tutorial and the book.
A community sample (plot, sample, etc), represent presence/absence or quantity (count, cover or biomass) of each species in each sample. Three matrices are used to delineate these data, i.e., the matrix of species composition (L matrix, sample × species), the matrix of sample attributes (R matrix, sample × sample attributes, with environmental variables, ), and the matrix of species attributes, like species traits or species indicator values (Q matrix, species × species attributes).
After the data have been imported into R, it is useful to explore data first, check for missing values and outliers, check for range and type of species and environmental data, apply transformation or standardization if necessary, check for possible correlations between environmental variables etc.
Missing data are elements in the matrix with no value, in R usually replaced by NA (not available). Note that there is an important difference between 0 and NA (e.g. species was not recorded and gets zero cover or abundance). For example, a pH-meter got broken and didn’t measure pH in some samples, you should not replace the value by 0, since it does not mean that the pH of that sample is so low.
As for the samples with missing values, if there are lots of missing values scattered across different variables, the analysis will be based on rather few samples. One way to reduce this effect is to remove the variables with the highest proportion of missing values from the analysis. Another option is to replace the missing values by estimates if these could be reasonably accurate (mostly by interpolation, e.g. from similar plots, neighbours, values measured at the same time somewhere close, or values predicted by a model).
is.na() will work on individual values, vectors, lists, and data frames. It will return TRUE or FALSE where you have an NA.
[1] "example" "data" "set"
[1] 3
You can use sum() and which() to figure out where NAs locate, and finally remote them, like this:
[1] 4
# na.omit(df) # remove the rows with NAs
As for the 0’s values, you should delete them if they affect results. Take the doubs dataset as an example. This data set gives environmental variables, fish species and spatial coordinates for 30 sites.doubs is a list with 4 components.
data(doubs, package = "ade4") # load data from ade4 package
# class(doubs) # find the object types
# names(doubs) # check names of the list
Then read the species, environment, distribution data from the list:
#Species and environment data from doubs
species <- doubs$fish
# head(spe)
# which(is.na(species)) # check missing values in a data set
# rowSums(species) == 0 # check the site without any species
spe <- species[-8,] # the site 8 without species is removed
environment <- doubs$env
env <- environment[-8,] # remove corresponding abiotic data for the site 8
Outliers are those values within a given variable that are conspicuously different from other values. Outlier value could get quite influential in the analysis, so it is worth to treat it in advance. First, spending a reasonable time to ensure that such value is not a mistype. If the sample really describes conditions that are rather different from the rest of the data set, it may be reasonable to remove them, since there may not be enough replications to describe this difference or phenomena.
There is a number of ways how to detect outliers. A simple exploratory data analysis (EDA) could reveal it graphically, e.g. using a box plot or a histogram. In a box plot, the outlier is defined as a value 1.5 times of interquartile range above upper quartile (Q3) or below lower quartile (Q1); the interquartile range is the range between upper and lower quartile: IQR = Q3-Q1.
Visual approaches such as histogram, scatter plot (such as Q-Q plot), and boxplot are the easiest method to detect outliers. Let’s take an example of the univariate dataset and identify outliers using visual approaches.
# x <- c(5, 8, 8, 12, 14, 15, 16, 19, 20, 22, 24, 25, 25, 26, 30, 48)
# boxplot(x) # show the outliers
You can also use boxplot() to remove the outliers, like this:
# boxplot(x, outline=FALSE) # remove the outliers
Using the interquartile range (IQR) to detect the data points which ranks at 25th percentile (first quartile or Q1) and 75th percentile (third quartile or Q3) in the dataset (IQR = Q3 - Q1), and futher detect outliers in three steps:
First, calculating Q1 and Q3 with summary():
# get values of Q1, Q3, and IQR
# summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 5.00 13.50 19.50 19.81 25.00 48.00
Then getting IQR with IQR() to calculate the threshold:
Finally detecting the outliers and remove them:
# find outlier
# x[which(x < Tmin | x > Tmax)]
# [1] 48
# remove outlier
# x[which(x > Tmin & x < Tmax)]
Homework assignment: detecting the outliers of the doubs dataset following the above procedure.
Transforming data is needed because statistical analyses and tests require that the residuals are normally distributed and have homogeneous variance, or because linear relationships may be easy to interpret. A good indicator of whether data need to be transformed is projecting the values using the histograms and checking whether the distribution is symmetrical, right-skewed or left-skewed. Ecological data are often right-skewed because they are limited by zero at the beginning. Several transformation ways are as follows:
# y <- log10(x) # for positively skewed data,
# y <- log10(max(x+1) - x) # for negatively skewed data
# y <- sqrt(x) # for positively skewed data,
# y <- sqrt(max(x+1) - x) # for negatively skewed data
Power transformation is suitable for left-skewed data.
Reciprocal transformation is suitable for ratios (e.g. height/weight body ratio).
# y <- 1/x # for positively skewed data
# y <- 1/(max(x+1) - x) # for negatively skewed data
Take the doubs datasets for example, illustrate how to do exploratory data analysis. First, we analysed species distribution.
# fish species names
#names(spe)
# all species distribution
ab <- table(unlist(spe)) # if want to see, put (the entire code line) in a bracket
barplot(ab, las=1, # flips labels on y-axis into horizontal position
xlab="Abundance class", ylab="Frequency", col=grey(5:0/5))
# individual species distribution
# ggplot(spe, aes(x = Cogo)) + geom_histogram()
# get the data
# cogo <- table(spe$Cogo)
# barplot(cogo, las=1, xlab="Abundance class", ylab="Frequency", col=grey(5:0/5))
# Can see that an intermediate number of sites contain the highest number of species.
# spe.pres <- colSums(spe > 0) # the number of sites where each species is present.
# hist(spe.pres, main="Species occurrence", las=1, xlab="Frequency of occurrences", # breaks=seq(0,30, by=5), col="grey")
# Calculate the number of species that occur at each site
#site.pres <- rowSums(spe>0) #number of species with a value greater than 0 in that site row
#hist(site.pres, main="Species richness", las=1, xlab="Frequency of sites", ylab="Number of species", breaks=seq(0,30, by=5), col="grey")
Standardization changes the data using a statistic calculated from data itself, e.g. mean, range, the sum of values (it is data-dependent). The most common reason to apply standardization is to remove differences in relative weights (importance) of individual variables or samples.
Centring: Standardised variable has mean equal to zero.
z-scores: Standardised variable has mean equal to zero and standard deviation equal to one.
# creating Standardization function
#standardize = function(x){
# z <- (x - mean(x)) / sd(x)
# return( z)
#}
# apply your function to the dataset
#dataframe[2:3] <-
# apply(dataframe[2:3], 2, standardize)
#displaying result
#dataframe
Sometimes species/community data may also need to be standardized or transformed. The decostand() function in vegan provides standardization and transformation options for community composition data.
# names(env)
# dim(env)
# str(env)
# head(env)
# summary(env)
pairs(env, main="Bivariate Plots of the Environmental Data" )
In this case, the environmental data (explanatory variables) are all in different units and need to be standardized prior to computing distance measures to perform ordination analyses. Standardize the environmental data (11 variables) using the function decostand() in vegan.
Prior to starting multivariate analyses, you have matrices with ecological data (such as the DoubsEnv or DoubsSpe), and use them to create association matrices between objects or among descriptors. Exploring the possible association measures can help you to understand what distance measure to use within ordination methods.
We can use the vegdist() function to compute dissimilarity indices in order to quantifying community composition data. These can then be visualized as a matrix if desired.
In the spe.db matrix, the numbers represent the distance (dissimilarity) between the first 3 species in DoubsSpe would look like this:
[1] "matrix" "array"
Cogo Satr Phph
Cogo 0.0000000 0.6000000 0.6842105
Satr 0.6000000 0.0000000 0.1428571
Phph 0.6842105 0.1428571 0.0000000
You can see that when comparing a species to itself (e.g. Cogo to Cogo), the distance = 0, because species 1 is like itself. You can create graphical depictions of these association matrices using the coldiss() function of the gclus package.
Let’s look at associations between environmental variables (also known as Q mode)
We can then look at the dependence between environmental variables (also known as R mode):
#(env.pearson<-cor(env)) # Pearson r linear correlation
#round(env.pearson, 2) #Rounds the coefficients to 2 decimal points
#(env.ken<-cor(env, method="kendall")) # Kendall tau rank correlation
#round(env.ken, 2)
The Pearson correlation measures the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable.
In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe:
One application of association matrices is clustering. It is not a statistical method per se, because it does not test a hypothesis, but it highlights structures in the data by partitioning the objects or the descriptors. As a result, similar objects are combined into groups. One goal of ecologists could be to divide a set of sites into groups with respect to their environmental conditions or their community composition.
There are several families of clustering methods. Let’s compare the single and complete linkage clustering methods using the Doubs fish species data.
spe.dhel <- vegdist(spe.hel, method="euclidean") #generates the distance matrix from Hellinger transformed data
head(spe.dhel)# Hellinger distances among sites
[1] 0.8420247 0.9391305 1.0616631 1.2308244 1.1153793 0.9391305
#Perform single linkage clustering
spe.dhel.single<-hclust(spe.dhel, method="single")
plot(spe.dhel.single)
#Perform complete linkage clustering
spe.dhel.complete<-hclust(spe.dhel, method="complete")
plot(spe.dhel.complete)
In order to compare this dendrogram to the single and complete linkage clustering results, one must calculate the square root of the distances.
#Perform Ward minimum variance clustering
spe.dhel.ward<-hclust(spe.dhel, method="ward.D2")
plot(spe.dhel.ward)
#Re-plot the dendrogram by using the square roots of the fusion levels
spe.dhel.ward$height<-sqrt(spe.dhel.ward$height)
plot(spe.dhel.ward)
plot(spe.dhel.ward, hang=-1) # hang=-1 aligns all objets on the same line
One must be careful in the choice of an association measure and clustering method in order to correctly address a problem. What are you most interested in: gradients? Contrasts? In addition, the results should be interpreted with respect to the properties of the method used. If more than one method seems suitable to an ecological question, computing them all and compare the results would be to go. As a reminder, clustering is not a statistical method, but further steps can be taken to identify interpretable clusters, or to compute clustering statistics.
Ordination is to reduce multidimensional information stored in community data into a few imaginable, interpretable and printable dimensions. We use it either to describe community pattern (usually the purpose of unconstrained = indirect ordination) or to explain changes in species composition by some (e.g. environmental, spatial, temporal) variables (constrained = direct ordination).
Ordination methods can be divided according to two criteria: whether their algorithm includes environmental variables along to the species composition data (unconstrained ordination methods do not, while constrained do), and what type of species composition data is used for analysis (raw data of sample-species matrix of species composition, pre-transformed data using Hellinger transformation, or distance matrix (sample-sample symmetric matrix of distances between samples).
Raw data-based | Transformation-based | Distance-based | ||
---|---|---|---|---|
Linear | Unimodal | |||
Unconstrained | PCA | CA & DCA | tb-PCA | PCoA, NMDS |
Constrained | RDA | CCA | tb-RDA | db-RDA |
The schemas below show the three alternative approaches you can use for the ordination of community ecology data, for either unconstrained or constrained ordination. You can decide to analyze data by either a) PCA/CA (depending on whether community composition data are homogeneous or heterogeneous), b) transformation-based PCA (first pre-transforming species composition data via Hellinger standardization, and then using PCA; doesn’t matter whether community composition data are homogeneous or heterogeneous), or c) distance-based PCoA or NMDS. But it often does not make much sense to combine these approaches.
Unconstrained ordination allows us to organize samples, sites or species along continuous gradients (e.g. ecological or environmental). The key difference between unconstrained and constrained ordination is that in the unconstrained techniques we are not attempting to define a relationship between independent and dependent sets of variables.
Unconstrained ordination can be used to:
Principal component analysis (PCA) is used to generate a few key variables from a larger set of variables that still represent as much of the variation in the dataset as possible. PCA is powerful to analyze quantitative descriptors (such as species abundances), but can not be applied to binary data (such as species absence/presence). PCA preserves Euclidean distances and detects linear relationships. As a consequence, raw species abundance data are subjected to a pre-transformation (i.e. a Hellinger transformation) before computing a PCA.
To do a PCA you need:
The “spe” data includes 27 fish taxa. To simplify the 27 fish taxa into a smaller number of fish-related variables or to identify where different sites or samples are associated with particular fish taxa we can run a PCA. Run a PCA on Hellinger-transformed species data:
The eigenvalue is the value of the change in the length of a vector. It is the amount of variation explained by each axis in a PCA. From the summary, you can see how much of the variance in the data is explained by the unconstrained variables. In this case, the total variance of the sites explained by the species is 0.5. The summary also tells you what proportion of the total explained variance is explained by each principal component in the PCA: the first axis of the PCA thus explains 51.33% of the variation, and the second axis 12.78%.
Sometimes you may want to extract the scores (i.e. the coordinates within a PCA biplot) for either the “sites” (the rows in your dataset, whether they be actual sites or not) or the “species” (the variables in your data, whether they be actual species or some other variables). This is useful if you want to then use a principal component as a variable in another analysis, or to make additional graphics. For example, with the “spe” dataset, you might want to obtain a single variable that is a composite of all the fish abundance data and then use that use that variable in a regression with another variable, or plot across a spatial gradient. To extract scores from a PCA, use the scores() function:
The PCA on the “spe” fish data produces as many principal components as there are fish taxon (columns), which in this case means that 27 principal components are produced. In many cases though, you may have done a PCA to reduce the number of variables to deal with and produce composite variables for the fish. In this case, you are likely interested in knowing how many of these principal components are actually significant or adding new information to the PCA (i.e. how many principal components do you need to retain before you aren’t really explaining any more variance with the additional principal components). To determine this, you can use the Kaiser-Guttman criterion and produce a barplot showing at what point the principal components are no longer explaining significant amount of variance. The code for the barplot below shows the point at which the variance explained by a new principal component explains less than the average amount explained by all of the eigenvalues:
PC1 PC2 PC3 PC4 PC5
0.25796049 0.06424089 0.04632294 0.03850244 0.02196526
From this barplot, you can see that once you reach PC6, the proportion of variance explained falls below the average proportion explained by the other components. If you take another look at the PCA summary, you will notice that by the time you reach PC5, the cumulative proportion of variance explained by the principal components is 85%.
A PCA is not just for species data. It can also be run and interpreted in the same way using standardized environmental variables:
Call:
rda(X = env.z)
Partitioning of variance:
Inertia Proportion
Total 11 1
Unconstrained 11 1
Eigenvalues, and their contribution to the variance
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Eigenvalue 6.446 2.2252 0.99825 0.39831 0.36224 0.25361
Proportion Explained 0.586 0.2023 0.09075 0.03621 0.03293 0.02306
Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123
PC7 PC8 PC9 PC10 PC11
Eigenvalue 0.16106 0.11062 0.022949 0.017476 0.004378
Proportion Explained 0.01464 0.01006 0.002086 0.001589 0.000398
Cumulative Proportion 0.98587 0.99593 0.998013 0.999602 1.000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores: 4.189264
Species scores
PC1 PC2 PC3 PC4 PC5 PC6
dfs 1.10786 0.4901 -0.20230 0.045871 0.187500 -0.20053
alt -1.06604 -0.5607 0.15135 0.193929 -0.096582 -0.05541
slo -0.95746 -0.5295 -0.25933 -0.352365 -0.066943 -0.39476
flo 0.98732 0.6262 -0.23731 0.009137 0.105599 -0.31209
pH -0.02679 0.4991 1.14319 -0.014864 0.005398 -0.17594
har 0.91493 0.5514 -0.09538 -0.128923 -0.639664 0.07145
pho 1.02238 -0.6481 0.20016 -0.178923 -0.040772 -0.04615
nit 1.14057 -0.1628 0.05081 -0.283125 0.272940 0.20434
amm 0.96776 -0.7382 0.18740 -0.198493 0.021140 0.04733
oxy -0.99282 0.4993 0.04744 -0.543295 0.012373 0.06953
bdo 0.96051 -0.7274 0.09630 0.089149 -0.178489 -0.14521
Site scores (weighted sums of species scores)
PC1 PC2 PC3 PC4 PC5 PC6
1 -1.35309 -1.0614 -0.626184 -1.14841 1.0494 -1.821e+00
2 -1.05499 -0.7841 0.195705 0.90757 1.7294 2.655e-01
3 -0.97457 -0.4890 1.340010 0.61152 0.8592 -7.288e-01
4 -0.90281 -0.3118 0.000372 0.17712 -0.2052 5.288e-01
5 -0.45666 -0.6973 0.550276 1.16964 -1.2500 1.273e-01
6 -0.81070 -0.7590 -0.318284 0.77249 0.2537 1.263e-01
7 -0.85277 -0.1858 0.231151 -0.37159 -1.3564 -3.616e-01
9 -0.27926 -0.4610 0.061754 1.60909 -1.2127 8.870e-01
10 -0.59145 -0.5554 -1.595293 -0.35281 -0.6397 -2.559e-01
11 -0.34078 0.3167 0.005014 -1.23777 -0.7883 -2.345e-01
12 -0.44165 0.3209 -0.697214 -0.46903 -0.3928 5.963e-01
13 -0.39855 0.6314 0.003511 -0.90415 -1.0778 5.002e-05
14 -0.22649 0.7350 0.946755 -0.87823 -0.8730 3.814e-02
15 -0.21927 1.0432 2.269502 -0.19576 0.1024 -2.360e-01
16 -0.16778 0.2507 -0.340084 -0.54136 0.1054 6.195e-01
17 0.14914 0.3628 -0.171537 -0.14337 0.1372 1.435e+00
18 0.08633 0.3674 -0.238531 -0.44014 0.2901 1.015e+00
19 0.10967 0.4821 0.232435 -0.28363 0.7316 9.414e-01
20 0.18575 0.3732 -0.268777 -0.69333 0.9729 1.131e+00
21 0.16766 0.3112 -0.834583 0.21603 0.7147 5.100e-01
22 0.13041 0.4842 -0.108135 0.18812 0.2895 -5.716e-01
23 1.28519 -1.3164 0.715652 -0.57204 -0.6499 -1.021e+00
24 1.01542 -0.4735 0.019519 1.43289 -0.5981 1.440e-01
25 2.10059 -2.1406 0.361157 -1.21146 0.1786 4.424e-01
26 0.89379 -0.1213 -0.671652 0.86581 0.3046 -8.871e-02
27 0.61092 0.3178 -0.139402 0.31511 0.8178 -9.684e-01
28 0.82353 0.8569 0.802337 -0.04239 0.8747 5.103e-02
29 0.67793 1.0652 -1.729365 0.28387 -0.2944 -1.028e+00
30 0.83450 1.4380 0.003890 0.93621 -0.0730 -1.543e+00
Call:
rda(X = env.z)
Partitioning of variance:
Inertia Proportion
Total 11 1
Unconstrained 11 1
Eigenvalues, and their contribution to the variance
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Eigenvalue 6.446 2.2252 0.99825 0.39831 0.36224 0.25361
Proportion Explained 0.586 0.2023 0.09075 0.03621 0.03293 0.02306
Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123
PC7 PC8 PC9 PC10 PC11
Eigenvalue 0.16106 0.11062 0.022949 0.017476 0.004378
Proportion Explained 0.01464 0.01006 0.002086 0.001589 0.000398
Cumulative Proportion 0.98587 0.99593 0.998013 0.999602 1.000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores: 4.189264
Species scores
PC1 PC2 PC3 PC4 PC5 PC6
dfs 1.10786 0.4901 -0.20230 0.045871 0.187500 -0.20053
alt -1.06604 -0.5607 0.15135 0.193929 -0.096582 -0.05541
slo -0.95746 -0.5295 -0.25933 -0.352365 -0.066943 -0.39476
flo 0.98732 0.6262 -0.23731 0.009137 0.105599 -0.31209
pH -0.02679 0.4991 1.14319 -0.014864 0.005398 -0.17594
har 0.91493 0.5514 -0.09538 -0.128923 -0.639664 0.07145
pho 1.02238 -0.6481 0.20016 -0.178923 -0.040772 -0.04615
nit 1.14057 -0.1628 0.05081 -0.283125 0.272940 0.20434
amm 0.96776 -0.7382 0.18740 -0.198493 0.021140 0.04733
oxy -0.99282 0.4993 0.04744 -0.543295 0.012373 0.06953
bdo 0.96051 -0.7274 0.09630 0.089149 -0.178489 -0.14521
Site scores (weighted sums of species scores)
PC1 PC2 PC3 PC4 PC5 PC6
1 -1.35309 -1.0614 -0.626184 -1.14841 1.0494 -1.821e+00
2 -1.05499 -0.7841 0.195705 0.90757 1.7294 2.655e-01
3 -0.97457 -0.4890 1.340010 0.61152 0.8592 -7.288e-01
4 -0.90281 -0.3118 0.000372 0.17712 -0.2052 5.288e-01
5 -0.45666 -0.6973 0.550276 1.16964 -1.2500 1.273e-01
6 -0.81070 -0.7590 -0.318284 0.77249 0.2537 1.263e-01
7 -0.85277 -0.1858 0.231151 -0.37159 -1.3564 -3.616e-01
9 -0.27926 -0.4610 0.061754 1.60909 -1.2127 8.870e-01
10 -0.59145 -0.5554 -1.595293 -0.35281 -0.6397 -2.559e-01
11 -0.34078 0.3167 0.005014 -1.23777 -0.7883 -2.345e-01
12 -0.44165 0.3209 -0.697214 -0.46903 -0.3928 5.963e-01
13 -0.39855 0.6314 0.003511 -0.90415 -1.0778 5.002e-05
14 -0.22649 0.7350 0.946755 -0.87823 -0.8730 3.814e-02
15 -0.21927 1.0432 2.269502 -0.19576 0.1024 -2.360e-01
16 -0.16778 0.2507 -0.340084 -0.54136 0.1054 6.195e-01
17 0.14914 0.3628 -0.171537 -0.14337 0.1372 1.435e+00
18 0.08633 0.3674 -0.238531 -0.44014 0.2901 1.015e+00
19 0.10967 0.4821 0.232435 -0.28363 0.7316 9.414e-01
20 0.18575 0.3732 -0.268777 -0.69333 0.9729 1.131e+00
21 0.16766 0.3112 -0.834583 0.21603 0.7147 5.100e-01
22 0.13041 0.4842 -0.108135 0.18812 0.2895 -5.716e-01
23 1.28519 -1.3164 0.715652 -0.57204 -0.6499 -1.021e+00
24 1.01542 -0.4735 0.019519 1.43289 -0.5981 1.440e-01
25 2.10059 -2.1406 0.361157 -1.21146 0.1786 4.424e-01
26 0.89379 -0.1213 -0.671652 0.86581 0.3046 -8.871e-02
27 0.61092 0.3178 -0.139402 0.31511 0.8178 -9.684e-01
28 0.82353 0.8569 0.802337 -0.04239 0.8747 5.103e-02
29 0.67793 1.0652 -1.729365 0.28387 -0.2944 -1.028e+00
30 0.83450 1.4380 0.003890 0.93621 -0.0730 -1.543e+00
Scaling refers to what portion of the PCA is scaled to the eigenvalues. Scaling = 2 means that the species scores are scaled by eigenvalues, whereas scaling = 1 means that site scores are scaled by eigenvalues. Scaling = 3 means that both species and site scores are scaled symmetrically by square root of eigenvalues. Using scaling = 1 means that the Euclidean distances among objects (e.g. the rows in your data) are preserved, whereas scaling = 2 means that the correlations among descriptors (e.g. the columns in this data) are preserved. This means that when you look at a biplot of a PCA that has been run with scaling=2, the angle between descriptors represents correlation.
PC1 PC2
6.445919 2.225186
As you saw in the explanation of the summary output, a lot of information can be extracted from a PCA before even plotting it. A PCA figure is the best way to convey major patterns. A PCA biplot includes the x-axis as the first Principal Component and the y-axis as the second Principal Component. A basic biplot without any customization could be plotted like this, where the site positions are shown by black numbers and species’ positions are shown by red species codes. Remember, species positions come from plotting species along PCs and the site positions are derived from the weighted sums of species scores.
What conclusions can you draw from this plot? You can see that there are only a few sites that are farther away from the majority. The species names are shown by their names in red and from the plot, you can see for example that the species “ABL” is not found or not found in the same prevalence in the majority of sites as other species closer to the centre of the ordination.
Now let’s look at a plot of the environmental PCA:
The unconstrained ordination allow to organize objects (e.g. sites) characterized by descriptors (e.g. species) in full-dimensional space. In other words, PCA, CA and PCoA computes lots of ordination axes representing the variation of species among sites and preserve distance among objects (the Euclidean distances in PCA, the Chi2 distances in CA and the other distances in PCoA). Users can then select the axis of interest to represent objects in an ordination plot. The produced biplot represents the distance among objects (e.g. the between-sites similarity), but fails to represent the whole variation dimensions of the ordination space.
In some case, the priority is not to preserve the exact distances among sites, but rather to represent as accurately as possible the relationships among objects in a number of axes (generally two or three). In such cases, nonmetric multidimensional scaling (NMDS) is the solution. A biplot of NMDS is better to represent similarity between objects: dissimilar objects are apart in the ordination space and similar objects close to one another. Also, NMDS allows users to choose the distance measure applied to calculate the ordination.
To find the best object representation, NMDS applies an iterative procedure to position the objects in the number of dimensions to minimize a stress function (scaled from 0 to 1). Consequently, the lower the stress value, the better the representation of objects in the ordination-space is. An additional way to assess the appropriateness of an NDMS is to construct a Shepard diagram which plot distances among objects in the ordination plot against the original distances. The R2 obtained from the regression between these two distances measure the goodness-of-fit of the NMDS ordination.
Run 0 stress 0.0747782
Run 1 stress 0.1152149
Run 2 stress 0.07478399
... Procrustes: rmse 0.003625267 max resid 0.0143458
Run 3 stress 0.08843919
Run 4 stress 0.07477801
... New best solution
... Procrustes: rmse 0.0002980715 max resid 0.001431492
... Similar to previous best
Run 5 stress 0.1196716
Run 6 stress 0.1124391
Run 7 stress 0.1141839
Run 8 stress 0.09157455
Run 9 stress 0.08801543
Run 10 stress 0.08841667
Run 11 stress 0.08695588
Run 12 stress 0.07376241
... New best solution
... Procrustes: rmse 0.01940148 max resid 0.09468274
Run 13 stress 0.122613
Run 14 stress 0.07477812
Run 15 stress 0.08917037
Run 16 stress 0.1223245
Run 17 stress 0.07383674
... Procrustes: rmse 0.003819841 max resid 0.01488639
Run 18 stress 0.1204797
Run 19 stress 0.08901468
Run 20 stress 0.08927164
*** Best solution was not repeated -- monoMDS stopping criteria:
1: no. of iterations >= maxit
19: stress ratio > sratmax
[1] 0.07376241
The Shepard plot identifies a strong correlation between observed dissimilarity and ordination distance (\(R^2 > 0.95\)), highlighting a high goodness-of-fit of the NMDS.