Home: Lesson 3: Exploratory Analysis

This section is adopted from the module, the tutorial and the book.

1. Raw data and preprocessing

1.1 Structures

A community sample (plot, sample, etc), represent presence/absence or quantity (count, cover or biomass) of each species in each sample. Three matrices are used to delineate these data, i.e., the matrix of species composition (L matrix, sample × species), the matrix of sample attributes (R matrix, sample × sample attributes, with environmental variables, ), and the matrix of species attributes, like species traits or species indicator values (Q matrix, species × species attributes).

1.2 Preprocessing

After the data have been imported into R, it is useful to explore data first, check for missing values and outliers, check for range and type of species and environmental data, apply transformation or standardization if necessary, check for possible correlations between environmental variables etc.

Missing and 0 values

Missing data are elements in the matrix with no value, in R usually replaced by NA (not available). Note that there is an important difference between 0 and NA (e.g. species was not recorded and gets zero cover or abundance). For example, a pH-meter got broken and didn’t measure pH in some samples, you should not replace the value by 0, since it does not mean that the pH of that sample is so low.

As for the samples with missing values, if there are lots of missing values scattered across different variables, the analysis will be based on rather few samples. One way to reduce this effect is to remove the variables with the highest proportion of missing values from the analysis. Another option is to replace the missing values by estimates if these could be reasonably accurate (mostly by interpolation, e.g. from similar plots, neighbours, values measured at the same time somewhere close, or values predicted by a model).

is.na() will work on individual values, vectors, lists, and data frames. It will return TRUE or FALSE where you have an NA.

df <- read.csv('missd_exam.csv')
# df
# is.na(df)
names(df) # the column names

[1] "example" "data"    "set"

sum(is.na(df)) # How many NAs in my data frame

[1] 3

You can use sum() and which() to figure out where NAs locate, and finally remote them, like this:

#which(is.na(df))
which(is.na(df$data)) # Which row contains an NA in the 'data' column

[1] 4

# na.omit(df) # remove the rows with NAs

As for the 0’s values, you should delete them if they affect results. Take the doubs dataset as an example. This data set gives environmental variables, fish species and spatial coordinates for 30 sites.doubs is a list with 4 components.

data(doubs, package = "ade4") # load data from ade4 package
# class(doubs) # find the object types
# names(doubs) # check names of the list

Then read the species, environment, distribution data from the list:

#Species and environment data from doubs
species <- doubs$fish
# head(spe)
# which(is.na(species)) # check missing values in a data set
# rowSums(species) == 0 # check the site without any species
spe <- species[-8,] # the site 8 without species is removed

environment <- doubs$env
env <- environment[-8,] # remove corresponding abiotic data for the site 8

Outliers

Outliers are those values within a given variable that are conspicuously different from other values. Outlier value could get quite influential in the analysis, so it is worth to treat it in advance. First, spending a reasonable time to ensure that such value is not a mistype. If the sample really describes conditions that are rather different from the rest of the data set, it may be reasonable to remove them, since there may not be enough replications to describe this difference or phenomena.

There is a number of ways how to detect outliers. A simple exploratory data analysis (EDA) could reveal it graphically, e.g. using a box plot or a histogram. In a box plot, the outlier is defined as a value 1.5 times of interquartile range above upper quartile (Q3) or below lower quartile (Q1); the interquartile range is the range between upper and lower quartile: IQR = Q3-Q1.

Visual approaches such as histogram, scatter plot (such as Q-Q plot), and boxplot are the easiest method to detect outliers. Let’s take an example of the univariate dataset and identify outliers using visual approaches.

# x <- c(5, 8, 8, 12, 14, 15, 16, 19, 20, 22, 24, 25, 25, 26, 30, 48)
# boxplot(x) # show the outliers

You can also use boxplot() to remove the outliers, like this:

# boxplot(x, outline=FALSE) # remove the outliers

Using the interquartile range (IQR) to detect the data points which ranks at 25th percentile (first quartile or Q1) and 75th percentile (third quartile or Q3) in the dataset (IQR = Q3 - Q1), and futher detect outliers in three steps:

First, calculating Q1 and Q3 with summary():

# get values of Q1, Q3, and IQR
# summary(x)
#   Min.  1st Qu.  Median    Mean   3rd Qu.    Max. 
#   5.00   13.50   19.50    19.81   25.00     48.00

Then getting IQR with IQR() to calculate the threshold:

Finally detecting the outliers and remove them:

# find outlier
# x[which(x < Tmin | x > Tmax)]
# [1] 48
# remove outlier
# x[which(x > Tmin & x < Tmax)]

Homework assignment: detecting the outliers of the doubs dataset following the above procedure.

Data transformation

Transforming data is needed because statistical analyses and tests require that the residuals are normally distributed and have homogeneous variance, or because linear relationships may be easy to interpret. A good indicator of whether data need to be transformed is projecting the values using the histograms and checking whether the distribution is symmetrical, right-skewed or left-skewed. Ecological data are often right-skewed because they are limited by zero at the beginning. Several transformation ways are as follows:

Log transformation is suitable for strongly right-skewed data

# y <- log10(x) # for positively skewed data,
# y <- log10(max(x+1) - x) # for negatively skewed data

Square-root transformation is suitable for slightly right-skewed data.

# y <- sqrt(x) # for positively skewed data,
# y <- sqrt(max(x+1) - x) # for negatively skewed data

Power transformation is suitable for left-skewed data.
Reciprocal transformation is suitable for ratios (e.g. height/weight body ratio).

# y <- 1/x # for positively skewed data
# y <- 1/(max(x+1) - x) # for negatively skewed data

2. Descriptive statistics

2.1 Species data distribution

Take the doubs datasets for example, illustrate how to do exploratory data analysis. First, we analysed species distribution.

# fish species names
#names(spe)

# all species distribution
ab <- table(unlist(spe)) # if want to see, put (the entire code line) in a bracket
barplot(ab, las=1, # flips labels on y-axis into horizontal position
        xlab="Abundance class", ylab="Frequency", col=grey(5:0/5))

# individual species distribution
# ggplot(spe, aes(x = Cogo)) + geom_histogram()
# get the data
# cogo <- table(spe$Cogo)
# barplot(cogo, las=1, xlab="Abundance class", ylab="Frequency", col=grey(5:0/5))

# Can see that an intermediate number of sites contain the highest number of species.
# spe.pres <- colSums(spe > 0) # the number of sites where each species is present. 
# hist(spe.pres, main="Species occurrence", las=1, xlab="Frequency of occurrences", # breaks=seq(0,30, by=5), col="grey")

# Calculate the number of species that occur at each site
#site.pres <- rowSums(spe>0) #number of species with a value greater than 0 in that site row
#hist(site.pres, main="Species richness", las=1, xlab="Frequency of sites", ylab="Number of species", breaks=seq(0,30, by=5), col="grey")

2.2 Species data standardization

Standardization changes the data using a statistic calculated from data itself, e.g. mean, range, the sum of values (it is data-dependent). The most common reason to apply standardization is to remove differences in relative weights (importance) of individual variables or samples.

Centring: Standardised variable has mean equal to zero.
z-scores: Standardised variable has mean equal to zero and standard deviation equal to one.

# creating Standardization function
#standardize = function(x){
#  z <- (x - mean(x)) / sd(x)
#  return( z)
#}
  
# apply your function to the dataset
#dataframe[2:3] <-
#  apply(dataframe[2:3], 2, standardize)
  
#displaying result
#dataframe

Ranging: Changes the range of variable, e.g. into [0, 1].

2.3 Species data transformations

Sometimes species/community data may also need to be standardized or transformed. The decostand() function in vegan provides standardization and transformation options for community composition data.

#Transforming abundance or count data to pres-abs data
spe.pa<-vegan::decostand(spe, method="pa") 

#Hellinger transformation
spe.hel<-vegan::decostand(spe, method="hellinger") #can also use method=”hell”
 
#Chi-square transformation
spe.chi<-vegan::decostand(spe, method="chi.square")

2.4 Environmental data VS. collinearity

# names(env)
# dim(env)
# str(env)
# head(env)
# summary(env)
pairs(env, main="Bivariate Plots of the Environmental Data" )

In this case, the environmental data (explanatory variables) are all in different units and need to be standardized prior to computing distance measures to perform ordination analyses. Standardize the environmental data (11 variables) using the function decostand() in vegan.

3. Preliminary analysis

3.1 Association measures

Prior to starting multivariate analyses, you have matrices with ecological data (such as the DoubsEnv or DoubsSpe), and use them to create association matrices between objects or among descriptors. Exploring the possible association measures can help you to understand what distance measure to use within ordination methods.

Distance measures of species data

We can use the vegdist() function to compute dissimilarity indices in order to quantifying community composition data. These can then be visualized as a matrix if desired.

In the spe.db matrix, the numbers represent the distance (dissimilarity) between the first 3 species in DoubsSpe would look like this:

[1] "matrix" "array"

          Cogo      Satr      Phph
Cogo 0.0000000 0.6000000 0.6842105
Satr 0.6000000 0.0000000 0.1428571
Phph 0.6842105 0.1428571 0.0000000

You can see that when comparing a species to itself (e.g. Cogo to Cogo), the distance = 0, because species 1 is like itself. You can create graphical depictions of these association matrices using the coldiss() function of the gclus package.

 library(gclus)
 source("coldiss.R")
# coldiss(spe.db, byrank=FALSE, diag=TRUE) # Heat map of Bray-Curtis dissimilarity
# coldiss(spe.dj, byrank=FALSE, diag=TRUE) # Heat map of Jaccard dissimilarity
# coldiss(spe.dg, byrank=FALSE, diag=TRUE) # Heat map of Gower dissimilarity

Distance measures of env data

Let’s look at associations between environmental variables (also known as Q mode)

We can then look at the dependence between environmental variables (also known as R mode):

#(env.pearson<-cor(env)) # Pearson r linear correlation
#round(env.pearson, 2) #Rounds the coefficients to 2 decimal points 
#(env.ken<-cor(env, method="kendall")) # Kendall tau rank correlation
#round(env.ken, 2)

The Pearson correlation measures the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable.

In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe:

3.2 Clustering

One application of association matrices is clustering. It is not a statistical method per se, because it does not test a hypothesis, but it highlights structures in the data by partitioning the objects or the descriptors. As a result, similar objects are combined into groups. One goal of ecologists could be to divide a set of sites into groups with respect to their environmental conditions or their community composition.

There are several families of clustering methods. Let’s compare the single and complete linkage clustering methods using the Doubs fish species data.

spe.dhel <- vegdist(spe.hel, method="euclidean") #generates the distance matrix from Hellinger transformed data
 
head(spe.dhel)# Hellinger distances among sites

[1] 0.8420247 0.9391305 1.0616631 1.2308244 1.1153793 0.9391305

#Perform single linkage clustering
spe.dhel.single<-hclust(spe.dhel, method="single")
plot(spe.dhel.single)

#Perform complete linkage clustering
spe.dhel.complete<-hclust(spe.dhel, method="complete")
plot(spe.dhel.complete)

In order to compare this dendrogram to the single and complete linkage clustering results, one must calculate the square root of the distances.

#Perform Ward minimum variance clustering
spe.dhel.ward<-hclust(spe.dhel, method="ward.D2")
plot(spe.dhel.ward)

#Re-plot the dendrogram by using the square roots of the fusion levels
spe.dhel.ward$height<-sqrt(spe.dhel.ward$height)
plot(spe.dhel.ward)

plot(spe.dhel.ward, hang=-1) # hang=-1 aligns all objets on the same line

One must be careful in the choice of an association measure and clustering method in order to correctly address a problem. What are you most interested in: gradients? Contrasts? In addition, the results should be interpreted with respect to the properties of the method used. If more than one method seems suitable to an ecological question, computing them all and compare the results would be to go. As a reminder, clustering is not a statistical method, but further steps can be taken to identify interpretable clusters, or to compute clustering statistics.

4. Advance analysis

4.1 General introduction

Ordination is to reduce multidimensional information stored in community data into a few imaginable, interpretable and printable dimensions. We use it either to describe community pattern (usually the purpose of unconstrained = indirect ordination) or to explain changes in species composition by some (e.g. environmental, spatial, temporal) variables (constrained = direct ordination).

Ordination methods can be divided according to two criteria: whether their algorithm includes environmental variables along to the species composition data (unconstrained ordination methods do not, while constrained do), and what type of species composition data is used for analysis (raw data of sample-species matrix of species composition, pre-transformed data using Hellinger transformation, or distance matrix (sample-sample symmetric matrix of distances between samples).

	Raw data-based		Transformation-based	Distance-based
	Linear	Unimodal

Unconstrained	PCA	CA & DCA	tb-PCA	PCoA, NMDS
Constrained	RDA	CCA	tb-RDA	db-RDA

The schemas below show the three alternative approaches you can use for the ordination of community ecology data, for either unconstrained or constrained ordination. You can decide to analyze data by either a) PCA/CA (depending on whether community composition data are homogeneous or heterogeneous), b) transformation-based PCA (first pre-transforming species composition data via Hellinger standardization, and then using PCA; doesn’t matter whether community composition data are homogeneous or heterogeneous), or c) distance-based PCoA or NMDS. But it often does not make much sense to combine these approaches.

4.2 Unconstrained Ordination

Unconstrained ordination allows us to organize samples, sites or species along continuous gradients (e.g. ecological or environmental). The key difference between unconstrained and constrained ordination is that in the unconstrained techniques we are not attempting to define a relationship between independent and dependent sets of variables.

Unconstrained ordination can be used to:

Assess relationships within a set of variables (not between sets).
Find key components of variation between samples, sites, species etc.
Reduce the number dimensions in multivariate data without substantial loss of information.
Create new variables for use in subsequent analyses (such as regression). These principal components are weighted, linear combinations of the original variables in the ordination.

Principal Component Analysis

Principal component analysis (PCA) is used to generate a few key variables from a larger set of variables that still represent as much of the variation in the dataset as possible. PCA is powerful to analyze quantitative descriptors (such as species abundances), but can not be applied to binary data (such as species absence/presence). PCA preserves Euclidean distances and detects linear relationships. As a consequence, raw species abundance data are subjected to a pre-transformation (i.e. a Hellinger transformation) before computing a PCA.

To do a PCA you need:

A set of variables (with no distinction between independent or dependent variables, i.e. a set of species OR a set of environmental variables).
Samples that are measured for the same set of variables.
Generally a dataset that is longer than it is wide is preferred.

The “spe” data includes 27 fish taxa. To simplify the 27 fish taxa into a smaller number of fish-related variables or to identify where different sites or samples are associated with particular fish taxa we can run a PCA. Run a PCA on Hellinger-transformed species data:

The eigenvalue is the value of the change in the length of a vector. It is the amount of variation explained by each axis in a PCA. From the summary, you can see how much of the variance in the data is explained by the unconstrained variables. In this case, the total variance of the sites explained by the species is 0.5. The summary also tells you what proportion of the total explained variance is explained by each principal component in the PCA: the first axis of the PCA thus explains 51.33% of the variation, and the second axis 12.78%.

Sometimes you may want to extract the scores (i.e. the coordinates within a PCA biplot) for either the “sites” (the rows in your dataset, whether they be actual sites or not) or the “species” (the variables in your data, whether they be actual species or some other variables). This is useful if you want to then use a principal component as a variable in another analysis, or to make additional graphics. For example, with the “spe” dataset, you might want to obtain a single variable that is a composite of all the fish abundance data and then use that use that variable in a regression with another variable, or plot across a spatial gradient. To extract scores from a PCA, use the scores() function:

The PCA on the “spe” fish data produces as many principal components as there are fish taxon (columns), which in this case means that 27 principal components are produced. In many cases though, you may have done a PCA to reduce the number of variables to deal with and produce composite variables for the fish. In this case, you are likely interested in knowing how many of these principal components are actually significant or adding new information to the PCA (i.e. how many principal components do you need to retain before you aren’t really explaining any more variance with the additional principal components). To determine this, you can use the Kaiser-Guttman criterion and produce a barplot showing at what point the principal components are no longer explaining significant amount of variance. The code for the barplot below shows the point at which the variance explained by a new principal component explains less than the average amount explained by all of the eigenvalues:

       PC1        PC2        PC3        PC4        PC5 
0.25796049 0.06424089 0.04632294 0.03850244 0.02196526

From this barplot, you can see that once you reach PC6, the proportion of variance explained falls below the average proportion explained by the other components. If you take another look at the PCA summary, you will notice that by the time you reach PC5, the cumulative proportion of variance explained by the principal components is 85%.

A PCA is not just for species data. It can also be run and interpreted in the same way using standardized environmental variables:


Call:
rda(X = env.z) 

Partitioning of variance:
              Inertia Proportion
Total              11          1
Unconstrained      11          1

Eigenvalues, and their contribution to the variance 

Importance of components:
                        PC1    PC2     PC3     PC4     PC5     PC6
Eigenvalue            6.446 2.2252 0.99825 0.39831 0.36224 0.25361
Proportion Explained  0.586 0.2023 0.09075 0.03621 0.03293 0.02306
Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123
                          PC7     PC8      PC9     PC10     PC11
Eigenvalue            0.16106 0.11062 0.022949 0.017476 0.004378
Proportion Explained  0.01464 0.01006 0.002086 0.001589 0.000398
Cumulative Proportion 0.98587 0.99593 0.998013 0.999602 1.000000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores:  4.189264 


Species scores

         PC1     PC2      PC3       PC4       PC5      PC6
dfs  1.10786  0.4901 -0.20230  0.045871  0.187500 -0.20053
alt -1.06604 -0.5607  0.15135  0.193929 -0.096582 -0.05541
slo -0.95746 -0.5295 -0.25933 -0.352365 -0.066943 -0.39476
flo  0.98732  0.6262 -0.23731  0.009137  0.105599 -0.31209
pH  -0.02679  0.4991  1.14319 -0.014864  0.005398 -0.17594
har  0.91493  0.5514 -0.09538 -0.128923 -0.639664  0.07145
pho  1.02238 -0.6481  0.20016 -0.178923 -0.040772 -0.04615
nit  1.14057 -0.1628  0.05081 -0.283125  0.272940  0.20434
amm  0.96776 -0.7382  0.18740 -0.198493  0.021140  0.04733
oxy -0.99282  0.4993  0.04744 -0.543295  0.012373  0.06953
bdo  0.96051 -0.7274  0.09630  0.089149 -0.178489 -0.14521


Site scores (weighted sums of species scores)

        PC1     PC2       PC3      PC4     PC5        PC6
1  -1.35309 -1.0614 -0.626184 -1.14841  1.0494 -1.821e+00
2  -1.05499 -0.7841  0.195705  0.90757  1.7294  2.655e-01
3  -0.97457 -0.4890  1.340010  0.61152  0.8592 -7.288e-01
4  -0.90281 -0.3118  0.000372  0.17712 -0.2052  5.288e-01
5  -0.45666 -0.6973  0.550276  1.16964 -1.2500  1.273e-01
6  -0.81070 -0.7590 -0.318284  0.77249  0.2537  1.263e-01
7  -0.85277 -0.1858  0.231151 -0.37159 -1.3564 -3.616e-01
9  -0.27926 -0.4610  0.061754  1.60909 -1.2127  8.870e-01
10 -0.59145 -0.5554 -1.595293 -0.35281 -0.6397 -2.559e-01
11 -0.34078  0.3167  0.005014 -1.23777 -0.7883 -2.345e-01
12 -0.44165  0.3209 -0.697214 -0.46903 -0.3928  5.963e-01
13 -0.39855  0.6314  0.003511 -0.90415 -1.0778  5.002e-05
14 -0.22649  0.7350  0.946755 -0.87823 -0.8730  3.814e-02
15 -0.21927  1.0432  2.269502 -0.19576  0.1024 -2.360e-01
16 -0.16778  0.2507 -0.340084 -0.54136  0.1054  6.195e-01
17  0.14914  0.3628 -0.171537 -0.14337  0.1372  1.435e+00
18  0.08633  0.3674 -0.238531 -0.44014  0.2901  1.015e+00
19  0.10967  0.4821  0.232435 -0.28363  0.7316  9.414e-01
20  0.18575  0.3732 -0.268777 -0.69333  0.9729  1.131e+00
21  0.16766  0.3112 -0.834583  0.21603  0.7147  5.100e-01
22  0.13041  0.4842 -0.108135  0.18812  0.2895 -5.716e-01
23  1.28519 -1.3164  0.715652 -0.57204 -0.6499 -1.021e+00
24  1.01542 -0.4735  0.019519  1.43289 -0.5981  1.440e-01
25  2.10059 -2.1406  0.361157 -1.21146  0.1786  4.424e-01
26  0.89379 -0.1213 -0.671652  0.86581  0.3046 -8.871e-02
27  0.61092  0.3178 -0.139402  0.31511  0.8178 -9.684e-01
28  0.82353  0.8569  0.802337 -0.04239  0.8747  5.103e-02
29  0.67793  1.0652 -1.729365  0.28387 -0.2944 -1.028e+00
30  0.83450  1.4380  0.003890  0.93621 -0.0730 -1.543e+00


Call:
rda(X = env.z) 

Partitioning of variance:
              Inertia Proportion
Total              11          1
Unconstrained      11          1

Eigenvalues, and their contribution to the variance 

Importance of components:
                        PC1    PC2     PC3     PC4     PC5     PC6
Eigenvalue            6.446 2.2252 0.99825 0.39831 0.36224 0.25361
Proportion Explained  0.586 0.2023 0.09075 0.03621 0.03293 0.02306
Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123
                          PC7     PC8      PC9     PC10     PC11
Eigenvalue            0.16106 0.11062 0.022949 0.017476 0.004378
Proportion Explained  0.01464 0.01006 0.002086 0.001589 0.000398
Cumulative Proportion 0.98587 0.99593 0.998013 0.999602 1.000000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores:  4.189264 


Species scores

         PC1     PC2      PC3       PC4       PC5      PC6
dfs  1.10786  0.4901 -0.20230  0.045871  0.187500 -0.20053
alt -1.06604 -0.5607  0.15135  0.193929 -0.096582 -0.05541
slo -0.95746 -0.5295 -0.25933 -0.352365 -0.066943 -0.39476
flo  0.98732  0.6262 -0.23731  0.009137  0.105599 -0.31209
pH  -0.02679  0.4991  1.14319 -0.014864  0.005398 -0.17594
har  0.91493  0.5514 -0.09538 -0.128923 -0.639664  0.07145
pho  1.02238 -0.6481  0.20016 -0.178923 -0.040772 -0.04615
nit  1.14057 -0.1628  0.05081 -0.283125  0.272940  0.20434
amm  0.96776 -0.7382  0.18740 -0.198493  0.021140  0.04733
oxy -0.99282  0.4993  0.04744 -0.543295  0.012373  0.06953
bdo  0.96051 -0.7274  0.09630  0.089149 -0.178489 -0.14521


Site scores (weighted sums of species scores)

        PC1     PC2       PC3      PC4     PC5        PC6
1  -1.35309 -1.0614 -0.626184 -1.14841  1.0494 -1.821e+00
2  -1.05499 -0.7841  0.195705  0.90757  1.7294  2.655e-01
3  -0.97457 -0.4890  1.340010  0.61152  0.8592 -7.288e-01
4  -0.90281 -0.3118  0.000372  0.17712 -0.2052  5.288e-01
5  -0.45666 -0.6973  0.550276  1.16964 -1.2500  1.273e-01
6  -0.81070 -0.7590 -0.318284  0.77249  0.2537  1.263e-01
7  -0.85277 -0.1858  0.231151 -0.37159 -1.3564 -3.616e-01
9  -0.27926 -0.4610  0.061754  1.60909 -1.2127  8.870e-01
10 -0.59145 -0.5554 -1.595293 -0.35281 -0.6397 -2.559e-01
11 -0.34078  0.3167  0.005014 -1.23777 -0.7883 -2.345e-01
12 -0.44165  0.3209 -0.697214 -0.46903 -0.3928  5.963e-01
13 -0.39855  0.6314  0.003511 -0.90415 -1.0778  5.002e-05
14 -0.22649  0.7350  0.946755 -0.87823 -0.8730  3.814e-02
15 -0.21927  1.0432  2.269502 -0.19576  0.1024 -2.360e-01
16 -0.16778  0.2507 -0.340084 -0.54136  0.1054  6.195e-01
17  0.14914  0.3628 -0.171537 -0.14337  0.1372  1.435e+00
18  0.08633  0.3674 -0.238531 -0.44014  0.2901  1.015e+00
19  0.10967  0.4821  0.232435 -0.28363  0.7316  9.414e-01
20  0.18575  0.3732 -0.268777 -0.69333  0.9729  1.131e+00
21  0.16766  0.3112 -0.834583  0.21603  0.7147  5.100e-01
22  0.13041  0.4842 -0.108135  0.18812  0.2895 -5.716e-01
23  1.28519 -1.3164  0.715652 -0.57204 -0.6499 -1.021e+00
24  1.01542 -0.4735  0.019519  1.43289 -0.5981  1.440e-01
25  2.10059 -2.1406  0.361157 -1.21146  0.1786  4.424e-01
26  0.89379 -0.1213 -0.671652  0.86581  0.3046 -8.871e-02
27  0.61092  0.3178 -0.139402  0.31511  0.8178 -9.684e-01
28  0.82353  0.8569  0.802337 -0.04239  0.8747  5.103e-02
29  0.67793  1.0652 -1.729365  0.28387 -0.2944 -1.028e+00
30  0.83450  1.4380  0.003890  0.93621 -0.0730 -1.543e+00

Scaling refers to what portion of the PCA is scaled to the eigenvalues. Scaling = 2 means that the species scores are scaled by eigenvalues, whereas scaling = 1 means that site scores are scaled by eigenvalues. Scaling = 3 means that both species and site scores are scaled symmetrically by square root of eigenvalues. Using scaling = 1 means that the Euclidean distances among objects (e.g. the rows in your data) are preserved, whereas scaling = 2 means that the correlations among descriptors (e.g. the columns in this data) are preserved. This means that when you look at a biplot of a PCA that has been run with scaling=2, the angle between descriptors represents correlation.

     PC1      PC2 
6.445919 2.225186

As you saw in the explanation of the summary output, a lot of information can be extracted from a PCA before even plotting it. A PCA figure is the best way to convey major patterns. A PCA biplot includes the x-axis as the first Principal Component and the y-axis as the second Principal Component. A basic biplot without any customization could be plotted like this, where the site positions are shown by black numbers and species’ positions are shown by red species codes. Remember, species positions come from plotting species along PCs and the site positions are derived from the weighted sums of species scores.

What conclusions can you draw from this plot? You can see that there are only a few sites that are farther away from the majority. The species names are shown by their names in red and from the plot, you can see for example that the species “ABL” is not found or not found in the same prevalence in the majority of sites as other species closer to the centre of the ordination.

Now let’s look at a plot of the environmental PCA:

Nonmetric MultiDimensional Scaling

The unconstrained ordination allow to organize objects (e.g. sites) characterized by descriptors (e.g. species) in full-dimensional space. In other words, PCA, CA and PCoA computes lots of ordination axes representing the variation of species among sites and preserve distance among objects (the Euclidean distances in PCA, the Chi2 distances in CA and the other distances in PCoA). Users can then select the axis of interest to represent objects in an ordination plot. The produced biplot represents the distance among objects (e.g. the between-sites similarity), but fails to represent the whole variation dimensions of the ordination space.

In some case, the priority is not to preserve the exact distances among sites, but rather to represent as accurately as possible the relationships among objects in a number of axes (generally two or three). In such cases, nonmetric multidimensional scaling (NMDS) is the solution. A biplot of NMDS is better to represent similarity between objects: dissimilar objects are apart in the ordination space and similar objects close to one another. Also, NMDS allows users to choose the distance measure applied to calculate the ordination.

To find the best object representation, NMDS applies an iterative procedure to position the objects in the number of dimensions to minimize a stress function (scaled from 0 to 1). Consequently, the lower the stress value, the better the representation of objects in the ordination-space is. An additional way to assess the appropriateness of an NDMS is to construct a Shepard diagram which plot distances among objects in the ordination plot against the original distances. The R2 obtained from the regression between these two distances measure the goodness-of-fit of the NMDS ordination.

Run 0 stress 0.0747782 
Run 1 stress 0.1152149 
Run 2 stress 0.07478399 
... Procrustes: rmse 0.003625267  max resid 0.0143458 
Run 3 stress 0.08843919 
Run 4 stress 0.07477801 
... New best solution
... Procrustes: rmse 0.0002980715  max resid 0.001431492 
... Similar to previous best
Run 5 stress 0.1196716 
Run 6 stress 0.1124391 
Run 7 stress 0.1141839 
Run 8 stress 0.09157455 
Run 9 stress 0.08801543 
Run 10 stress 0.08841667 
Run 11 stress 0.08695588 
Run 12 stress 0.07376241 
... New best solution
... Procrustes: rmse 0.01940148  max resid 0.09468274 
Run 13 stress 0.122613 
Run 14 stress 0.07477812 
Run 15 stress 0.08917037 
Run 16 stress 0.1223245 
Run 17 stress 0.07383674 
... Procrustes: rmse 0.003819841  max resid 0.01488639 
Run 18 stress 0.1204797 
Run 19 stress 0.08901468 
Run 20 stress 0.08927164 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: no. of iterations >= maxit
    19: stress ratio > sratmax

[1] 0.07376241

The Shepard plot identifies a strong correlation between observed dissimilarity and ordination distance (\(R^2 > 0.95\)), highlighting a high goodness-of-fit of the NMDS.