---
title: "Multivariate Analysis of Modern Pollen Data for Servera-Vives et al. 2022"
author: "Grant Snitker - USFS Southern Research Station"
date: "17/03/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Please reference this work using the following citation:
  
|       Servera-Vives, Gabriel, Maurici Mus Amezquita, Grant Snitker, Assunta Florenzano, Paola Torri, Joan Estrany, and Anna Maria Mercuri.
|       2022. Modern pollen analogues for the understanding of pollen-vegetation relationships in a Mediterranean mosaic landscape (Balearic
|       Islands, Western Mediterranean). Manuscript submitted to *The Holocene*.

<br>

### Introduction
This RMarkdown demonstrates the data cleaning, transformations, and multivariate analysis used to  understand the relationship between the composition of the modern pollen data and the vegetation communities throughout eh Balearic Islands. When running this Rmarkdown or the R script in the "Multivariate_Analysis_Servera_Vives_et_al_2022" RStudio project, all path names are relative to the containing folder. Also note that the term "species" is used in this script to denote variables and not necessarily taxa represented in the pollen data. Finally, this analysis excludes rare pollen taxa following methods in López-Sáez et al. 2020, which includes pollen taxa with percentages ≥ 3% in at least two samples.

<br>

### 0.0 Preamble
```{r preamble, message = F, warning= F}
# clear workspace
rm(list = ls())
# set seed
set.seed(123)
# load libraries
library(devtools)
library(vegan)
library(RColorBrewer)
library(raster)
library(dplyr)
library(stats)
library(dendextend)
#devtools::install_github("ropenscilabs/ochRe") # install via devtools::install_github if using for the first time
library(ochRe)
library(ggplot2)
library(ggrepel)
library(indicspecies)
#devtools::install_github("gavinsimpson/ggvegan") # install via devtools::install_github if using for the first time
library(ggvegan)
library(tibble)
```

<br>

### 1.0 Load csv data and subset
```{r load}
mod.pol = read.csv("./data/modern_pollen_data_percent_reduced_1per_2samp.csv") # with the latest recombination of the pollen data
# note: these data are represented as percentages and have been subset to remove pollen clumps and 
# individual genera that are accounted for in summed genera (ex. Quercus represents the summed observations of Quercus ilex-t,	Quercus robur/pubescens-t, and Quercus sp).

mod.pol$Veg_Types = as.factor(mod.pol$Veg_Types)

mod.pol.data  = mod.pol[,-c(1:5)] # remove site name, code, and veg type. These will be added again later to examine similarities between veg types. 

mod.pol.data[is.na(mod.pol.data)] = 0 # replace all NAs with 0

mod.pol.sites = mod.pol[,c(1:5)] # preserve the veg type, site name, and code

mod.pol.sites$Veg_Types  = droplevels(mod.pol.sites$Veg_Types) # drop levels for future labeling

# fix column names
columns = colnames(mod.pol.data)
columns.fixed = gsub(".t", '-t', columns, fixed = T)
columns.fixed.2 = gsub(".", ' ', columns.fixed, fixed = T)
columns.fixed.3 = gsub("major media", 'major/media', columns.fixed.2, fixed = T)
colnames(mod.pol.data) = columns.fixed.3 
```
#### *1.1 Perform sqrt transformation on the pollen dataset for all future analyses*
```{r sqrt}
mod.pol.sqrt = sqrt(mod.pol.data)
```

<br>

### 2.0 Conduct Detrended Correspondence analysis to evaluate appropriate ordination technique after Lepš & Šmilauer (2003)
```{r DCA}
dca.sqrt = decorana(mod.pol.sqrt ) # Lepš & Šmilauer (2003) suggest performing all transformations prior to DCA. 
dca.sqrt 
# axis length for DCA1 < 3 (DCA1 = 2.0147), so linear ordination is preferred  

dca= decorana(mod.pol.data) # Lepš & Šmilauer (2003) suggest performing all transformations prior to DCA. 
dca
# axis length for DCA1 is between 3-4 (DCA1 = 3.7889), so linear ordination is ok  
```

<br>

### 3.0 Hierarchical clustering of each site to identify vegetation type clusters
```{r cluster colors}
# create cluster colors
colors  = ochre_pal(palette =  "parliament")(8) # define colors for plots
colors = c("red", colors)
```
#### *3.1 Cluster with sqrt transformation*
```{r cluster Sqrt, fig.align = "center", fig.width=10, fig.height=10}
rownames(mod.pol.sqrt) = mod.pol.sites$Site_Code_Veg# add now names to help interpret clusters

mod.pol.sqrt.dist <- vegdist(mod.pol.sqrt, method = "euclidean") # euclidean distance following Lopes- saez

mod.pol.sqrt.clust <- hclust(mod.pol.sqrt.dist, method = "ward.D2")
plot(mod.pol.sqrt.clust$height[length(mod.pol.sqrt.clust$height):(length(mod.pol.sqrt.clust$height)-10)], type = "b", xlab = "Clusters", ylab=  "Height") # use elbow method to determine how many clusters
abline(v = 3, lty = "dashed")

mod.pol.dend <- as.dendrogram(mod.pol.sqrt.clust)
plot(mod.pol.dend, ylim  = c(0,26), main = "Hierarchical clustering of Veg Types using Ward's Method", ylab = "Height")
mtext("With Sqrt Transformation", side=3)
rect.dendrogram(mod.pol.dend, k = 3)

mod.pol$Cluster_SqrtTrans =cutree(mod.pol.sqrt.clust, 3)
```

### 4.0 Exploratory Ordination (PCA)
#### *4.1 create initial unconstrained redundancy analysis (pca)*
```{r pca}
# 4.1 create initial unconstrained ordination (pca), but using a hellinger / Sqrt  transformation to accommodate for sparseness and zeros in species data for each sample site.
pca.sqrt = rda(mod.pol.sqrt)

```
#### *4.2 Evaluate the performance of the pca*
```{r pca results}
pca.sqrt # shows eigenvalues and inertia (variance)
```
#### *4.3 A more in depth look at eigenvalues and proportion of variance explained*
```{r pca more results}
summary(eigenvals(pca.sqrt)) # the proportion of variance explained in the first 2 axes is ~ 34%
```
#### *4.4 Visualize scree plot to evaluate which axes should be considered*
```{r pca scree, fig.align = "center"}
screeplot(pca.sqrt, bstick = T, type = "l", main  = "Scree plot of variance explained: Sqrt") # start with axes 1 and 2, but still not a lot of variance explained here
```


#### *4.5 Identify only the most influential species scores*
```{r species scores, fig.align = "center"}
# Sqrt
species.scores.sqrt = scores(pca.sqrt, display = "species") # extract species scores
species.scores.sqrt = cbind(species.scores.sqrt, rep(NA, length(species.scores.sqrt[,1]))) # create empty column to populate species importance
colnames(species.scores.sqrt)[3] = "importance"
for (i in 1:length(species.scores.sqrt[,1])){ # find the length of each species "arrow" using euclidean distance from plot origin to arrow tip
  arrow.dist = pointDistance(c(0, 0), species.scores.sqrt[i,c(1:2)], lonlat=FALSE)
  species.scores.sqrt[i,3] = arrow.dist
}
species.scores.sqrt.df = as.data.frame(species.scores.sqrt) # convert to dataframe for sorting and later operations
species.scores.sqrt.df.ordered  = species.scores.sqrt.df[order(-species.scores.sqrt.df$importance),] # reorder by importance

species.scores.sqrt.df.ordered

# create a scree plot of species importance
plot((species.scores.sqrt.df.ordered$importance), main = "Species Importance", xlab ="Number of Species", ylab = "Importance")
mtext("Blue line denotes the top 25% most influential species: Sqrt", side=3, col = "blue")
lines(species.scores.sqrt.df.ordered$importance)
abline(h =  quantile(species.scores.sqrt.df$importance, prob = .75), col = "blue")

# evaluate the top 25% most influential species
for (i in 1:length(species.scores.sqrt[,1])){
  if(species.scores.sqrt.df$importance[i] >= quantile(species.scores.sqrt.df$importance, prob = .75) ) { # adjust quantile here 
    species.scores.sqrt.df$include[i] = 1
  } else {species.scores.sqrt.df$include[i] = 2}}
```
#### *4.6 create biplots of pca results*
##### 4.6.1 PCs 1-3
```{r PCs , warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
scores(pca.sqrt, choices = 1:3, display = "species", scaling = "species")
```
##### 4.6.2 Species and Sites with PC1 PC2 (clusters Sqrt trans)
```{r PC1 PC2 biplots 1a, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# 4.6.2 Calculate cluster centroids for plotting PCs 1 and 2
pca.scores.sqrt = as.data.frame(scores(pca.sqrt, choices = c(1,2), display = c("sp", "sites"),
                                       scaling = "species")[2])

pca.scores.sqrt$Cluster_SqrtTrans = mod.pol$Cluster_SqrtTrans 
pca.scores.centroids.sqrt  = pca.scores.sqrt %>%  dplyr::group_by(Cluster_SqrtTrans) %>% 
  dplyr::summarise(
    PC1.mean = mean(sites.PC1),
    PC2.mean = mean(sites.PC2))


# Plot PCs 1 and 2
colors.cluster = brewer.pal(4, "Dark2") # define colors for plots
colors.species  = c("red", rgb(0, 0, 0, alpha = 0))
shapes = seq(1:9)

biplot(pca.sqrt, scaling = "species", type = c("point", "n"), main = "PCA of Modern Pollen Data and Vegetation Type: Sqrt Trans")
mtext("Only the top 25% most influential species (by score) are labeled - PCs 1 and 2", side=3)
with(mod.pol, points(pca.sqrt, display = "sites", col = colors.cluster[Cluster_SqrtTrans],
                     scaling = "species", pch = shapes[Veg_Types], cex =1.1))
with(pca.scores.centroids.sqrt, points(PC1.mean, PC2.mean,col = colors.cluster[Cluster_SqrtTrans], pch = 16, cex =1.6))
with(species.scores.sqrt.df, text(pca.sqrt, display = "species", col = colors.species[include],
                                  scaling = "species", cex = .8, offset  = .2))
with(mod.pol, legend("bottomright", legend = levels(as.factor(Cluster_SqrtTrans)), bty = "n",
                     col = colors.cluster, pch = 16, cex  = .9, title="Clusters: Sqrt Trans", title.adj =1))

#Sites only
biplot(pca.sqrt, scaling = "species", type = c("n", "n"), main = "PCA of Modern Pollen Data and Vegetation Type: Sqrt Trans")
mtext("Sites only - PCs 1 and 2", side=3)
with(mod.pol, points(pca.sqrt, display = "sites", col = colors.cluster[Cluster_SqrtTrans],
                     scaling = "species", pch = shapes[Veg_Types], cex =1.1))
with(mod.pol, text(pca.sqrt, display = "sites", col = colors.cluster[Cluster_SqrtTrans],
                   scaling = "species", pch = shapes[Veg_Types], cex =.7, pos = 3, offset = .5))
with(pca.scores.centroids.sqrt, points(PC1.mean, PC2.mean,col = colors.cluster[Cluster_SqrtTrans], pch = 16, cex =1.6))
with(mod.pol, legend("bottomright", legend = levels(as.factor(Cluster_SqrtTrans)), bty = "n",
                     col = colors.cluster, pch = 16, cex  = .9, title="Clusters: Sqrt Trans", title.adj =1))
```

<br>

### 5.0 Vegetation Indices (from Davis 1984; McGlone and Meurk 2000; and Fall 2012)
#### *5.1 Define index variables*
```{r Indices1, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# A = B0 / (P0 + P1 + B0) 
# B0 =  number of stands in which the pollen type is present in the surface sample and the associated plant taxon is present in vegetation sample of the stand
# P0 = number of stands in which the pollen type is present in the surface sample but the associated plant taxon is absent in the local vegetation
# P1 = number of stands in which the plant taxon is present in the vegetation but the pollen type is absent in the surface sample
# For percentage data, the variance of the sample is a function of the mean. Consequently, percentage data do not meet the assumption of normality for correlation analysis. The arcsine transformation:
#  Xa = arcsine * sqrt(X)
# where X is the untransformed percent and Xa is the transformed value is a standard procedure (Sokal and Rohlf, 1969) used to treat percentage data so they meet the assumptions for statistical tests. For these data the trans- formed and untransformed percentages give similar results for correlation analysis: the same types show significant relationships between pollen percentages and vegetation coverage.

# to make this work, all column names in the pollen dataset have to be the same as the species names identified in the field.
# see López-Sáez et al.2020 for pollen indices.
```
#### *5.2 Load observed species data*
```{r Indices2, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
species.reveles = read.csv("./data/vegetation_relevés_davis_fid_disp.csv")
species.present =  as.data.frame(unique(species.reveles$Equiv_Pollen_Types))
head(species.present)

# summarized by species then convert to 1/0 presence absence
species.reveles.summ = species.reveles %>%
  dplyr::group_by(Equiv_Pollen_Types) %>%
  dplyr::summarise_all(list(mean))

species.reveles.summ$height_log_value = round(species.reveles.summ$height_log_code)
species.reveles.summ$pollination_value = round(species.reveles.summ$pollination_code)
# replace values
species.reveles.summ$height_log_value[species.reveles.summ$height_log_value==1]<-"7-20 cm"
species.reveles.summ$height_log_value[species.reveles.summ$height_log_value==2]<-"25-60 cm"
species.reveles.summ$height_log_value[species.reveles.summ$height_log_value==3]<-"70-200 cm"
species.reveles.summ$height_log_value[species.reveles.summ$height_log_value==4]<- "300-3,000 cm"
species.reveles.summ$pollination_value[species.reveles.summ$pollination_value==0]<-"Entomophilic, anthers not exposed"
species.reveles.summ$pollination_value[species.reveles.summ$pollination_value==1]<-"Entomophilic, anthers exposed"
species.reveles.summ$pollination_value[species.reveles.summ$pollination_value==2]<-"Anemophilic"

species.reveles.info = species.reveles.summ[,1:6]
species.reveles.info = species.reveles.info[,-2]
colnames(species.reveles.info)[1] = "species"
species.reveles.data = species.reveles.summ[,-c(2:6)]

species.reveles.p.a = species.reveles.data %>% mutate_if(is.numeric, ~1 * (. > 0))

#load (all) pollen data with taxa names that match the reveles
pollen.taxa.indices = read.csv("./data/modern_pollen_data_percent_davis_fid_disp_index_values.csv")
# convert to Presence/absence
pollen.taxa.p.a = pollen.taxa.indices %>% mutate_if(is.numeric, ~1 * (. > 0))
#replace NA with 0
pollen.taxa.p.a[is.na(pollen.taxa.p.a )] = 0

# limit analysis to only the reveles we have pollen data for
reveles.analyzed = colnames(pollen.taxa.p.a) %>% .[-c(1)]
reveles.analyzed = c("Equiv_Pollen_Types", reveles.analyzed)
species.reveles.p.a  = species.reveles.p.a[,c(reveles.analyzed)]
```
#### *4.3 Construct Davis index and Fid/Disp dataframe*
```{r Indices3, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
davis.indices.df = data.frame('species' = character(), 'A' =numeric(), 'U' = numeric(), 'O' = numeric(), 'Fid' = numeric(), 'Dis' = numeric())
```
#### *5.4 Populate dataframe*
```{r Indices4, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
for (i in 1:length(pollen.taxa.p.a$Pollen_taxa)){
  species = pollen.taxa.p.a[i,1]
  plants.reveles.present = species.reveles.p.a[species.reveles.p.a$Equiv_Pollen_Types == species,] %>% .[, colSums(. != 0) > 0]
  plants.reveles.absent = species.reveles.p.a[species.reveles.p.a$Equiv_Pollen_Types == species,] %>% .[, colSums(. != 0) == 0]

  plants.present = colnames(plants.reveles.present) %>% .[-1]
  plants.absent = colnames(plants.reveles.absent)

  pollen.reveles.present = pollen.taxa.p.a[pollen.taxa.p.a$Pollen_taxa == species,] %>% .[, colSums(. != 0) > 0]
  pollen.present = colnames(pollen.reveles.present) %>% .[-1]

  pollen.reveles.absent = pollen.taxa.p.a[pollen.taxa.p.a$Pollen_taxa == species,] %>% .[, colSums(. != 0) == 0]
  pollen.absent = colnames(pollen.reveles.absent)


  # where b0 is the number of samples in which both the pollen type and its parent plant are present,
  # p0 is the number of samples in which the pollen type is present and the parent plant absent, and
  # p1 is the number of samples in which the parent plant is present and the pollen type absent.
  `%notin%` <- Negate(`%in%`)


  B0 = pollen.present %in% plants.present %>% sum(.)
  P0 = pollen.present %in% plants.absent %>% sum(.)
  P1 = plants.present %in% pollen.absent %>% sum(.)

  B0
  P0
  P1

  A = B0 / (P0 + P1 + B0)
  U = P1 / (P1 + B0)
  O = P0 / (P0 + B0)
  Fid = (B0 / length(plants.present)) * 100
  Dis = (P0 / length(plants.absent)) * 100

  A
  U
  O
  Fid
  Dis

  index_input = c(A,U,O,Fid,Dis)
  davis.indices.df[i,2:6] = index_input
  davis.indices.df[i,1] = species
}
```
#### *5.5 Clean up results and subset to top 25% percent species (as calculated in PCA)*
```{r Indices5, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# 5.5 clean up results and subset to top 25% percent species (as calculated in PCA)
davis.indices.all = davis.indices.df
davis.indices.all[davis.indices.all=='NaN']<-0

# Join reveles info to species
species.reveles.info = as.data.frame(species.reveles.info)
davis.indices.all = merge(x = davis.indices.all, y = species.reveles.info, by = "species" )
```
#### *5.6 Clusters fo all indices*
```{r Indices6, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# Cluster Davis and Dis/Fid indices
rownames(davis.indices.all) = davis.indices.all$species# add now names to help interpret clusters
davis.indices.all.dist <- vegdist(scale(davis.indices.all[,2:6]), method = "euclidean") # euclidean distance following Lopes- saez

davis.indices.all.clust <- hclust(davis.indices.all.dist, method = "ward.D2")
plot(davis.indices.all.clust$height[length(davis.indices.all.clust$height):(length(davis.indices.all.clust$height)-10)], type = "b", xlab = "Clusters", ylab=  "Height") # use elbow method to determine how many clusters
abline(v = 5, lty = "dashed")

davis.indices.dend <- as.dendrogram(davis.indices.all.clust)
plot(davis.indices.dend, main = "Hierarchical clustering of Davis Indices and Dis/Fid using Ward's Method\n scaled", ylab = "Height")
rect.dendrogram(davis.indices.dend, k = 5)

davis.indices.all$Cluster =cutree(davis.indices.all.clust, 5)

# subset to top 25% species
top.25.species = rownames(species.scores.sqrt.df[species.scores.sqrt.df$include ==1,])

davis.indices.df.top.25 = davis.indices.all[(davis.indices.all$species  %in% top.25.species), ]
davis.indices.df.top.25$include = 1

```
#### *5.7 Plot all indices using ggplot*
```{r Indices plots, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# create PCA for plot
davis.indices.rda = davis.indices.all[,2:4]
rownames(davis.indices.rda) = davis.indices.all$species
pca.davis = rda(davis.indices.rda)

# create color palette
colors.cluster.davis = brewer.pal(5, "Dark2") # define colors for plots
colors.species  = c(colors.cluster.davis, rgb(0, 0, 0, alpha = 0))

## ggplot version
PCAscores <- scores(pca.davis, display = "sites") %>% as.data.frame() %>% rownames_to_column("species") %>% full_join(davis.indices.all, by = "species")
PCAvect <- scores(pca.davis, display = "species") %>% as.data.frame()
PCAvect.A = PCAvect[1,] 
PCAvect.U = PCAvect[2,] 
PCAvect.O = PCAvect[3,] 

# plot PCA of -color
ggplot() +
  geom_point(data = PCAscores, aes(x = PC1, y = PC2, color = as.factor(Cluster)), shape = 16, size = 2) +
  scale_color_manual(values =c("#1B9E77" ,"#D95F02" ,"#7570B3" ,"#E7298A", "#66A61E"), breaks = c(1,2,3,4,5)) +
  scale_shape(solid = FALSE) +
  geom_text_repel(data = PCAscores,aes(x = PC1, y = PC2,label = species, segment.alpha = 0.5, segment.size = 0.3), max.overlaps = 30,box.padding = unit(0.45, "lines"), size = 2.8) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_segment(data = PCAvect, aes(x = 0, y = 0, xend = PC1, yend = PC2), alpha = 0.5, arrow = arrow(length = unit(0.2, "cm"))) +
  geom_text(data = PCAvect.A, aes(x = PC1, y = PC2, label = rownames(PCAvect.A)), nudge_x = -.055, nudge_y = 0.04, size =5) +
  geom_text(data = PCAvect.O, aes(x = PC1, y = PC2, label = rownames(PCAvect.O)), nudge_x = .07, nudge_y = -0.003, size =5) +
  geom_text(data = PCAvect.U, aes(x = PC1, y = PC2, label = rownames(PCAvect.U)), nudge_x = -.02, nudge_y = -0.04, size =5) +
  theme_bw() + theme(axis.title = element_text(size=13), axis.text = element_text(size = 11),  legend.position = c(.9,.115),legend.background = element_rect(fill="white",size=0.25, linetype="solid", colour ="black")) +
  labs(x = "\nPC1 (62.1%)",
       y = "\nPC2 (35.0%)", color = "Cluster groups") 


davis.indices.all$cluster_include = match(davis.indices.all$species, davis.indices.df.top.25$species)
for (i in 1:length(davis.indices.all$species)){
if (!is.na(davis.indices.all$cluster_include[i])){davis.indices.all$cluster_include[i] = davis.indices.all$Cluster[i]}}
davis.indices.all[is.na(davis.indices.all)] <- 6


# plot Fidelity / Dispersibility scatterplots
# reorder factors
davis.indices.all$height_log_value<- factor(davis.indices.all$height_log_value, levels=c("7-20 cm", "25-60 cm", "70-200 cm", "300-3,000 cm"))

ggplot(data = davis.indices.all, aes(x =Dis , y = Fid)) +
  geom_point(aes(col = pollination_value,  size = height_log_value,shape = Cluster)) + scale_shape_identity(guide = "legend") +
  geom_text_repel(aes(label = species, segment.alpha = 0.5, segment.size = 0.45), max.overlaps = 50,box.padding = unit(0.6, "lines"),point.padding = unit(0.6, "lines"), size = 3.0) +
  scale_fill_brewer(palette="Dark2")+
  labs(title = "", subtitle = "", y = "Fidelity (%)\n", x = "\nDispersibility (%)",
       col = "Pollination syndrome", size = "Taxa height") +
  theme_bw() + theme(axis.title = element_text(size = 13), axis.text = element_text(size = 11),
                     legend.background = element_rect(fill="white",size=0.25, linetype="solid", colour ="black"))

```

<br>

### 6.0 RDA and Environmental Variables
#### *6.1 Load, clean, and select appropriate environmental variables*
```{r RDA env cut, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# load and select environmental data
env.data.raw = read.csv("./data/environmental_variables.csv")
row.names(env.data.raw) = env.data.raw$Code
env.data = env.data.raw[reveles.analyzed[-1],]
env.data = env.data[,4:(length(env.data))]
env.data = as.data.frame(scale(env.data))

#conduct initial, holding nothing back, RDA
rda.initial = rda(mod.pol.sqrt ~., data  = env.data)

# Analyze variance inflation factor to eliminate correlated variables
vif.cca(rda.initial)

#Conduct initial cut by removing highly correlated variables (those with a variance inflation factor of approximately >10)
env.data.cut = subset(env.data, select=-c(Slope_degree,thermality_index_2008_2018,Percent_Trees,
                                          Percent_Shrubs, Percent_herbs, Percent_bryophytes,
                                          Percent_rocks, Percent_stones_gravels,N_totals_nitrophilous_taxa, 
                                          n_total_taxa, n_total_crops, Veg_type, direct_human_pressure))

# examine the result of cutting variables
rda.cut = rda(mod.pol.sqrt ~., data  = env.data.cut)
vif.cca(rda.cut) # proceed with these variables
```
#### *6.2 Run initial RDA without permutation selection of environmental variables*
```{r RDA no selection, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# Run second RDA
rda.2 = rda(mod.pol.sqrt ~., data  = env.data.cut)
stp = rda(mod.pol.sqrt ~ 1, data  = env.data.cut)

# Plots No Selection
# clusters
rda.no.selection.scores.sqrt = as.data.frame(scores(rda.2, choices = c(1,2), display = c("sp", "sites"),
                                                scaling = "species")[2])
rda.no.selection.scores.sqrt$Cluster_SqrtTrans = mod.pol$Cluster_SqrtTrans 
rda.no.selection.centroids.sqrt  = rda.no.selection.scores.sqrt %>%  dplyr::group_by(Cluster_SqrtTrans) %>% 
  dplyr::summarise(
    RDA1.mean = mean(sites.RDA1),
    RDA2.mean = mean(sites.RDA2))

# Sites
plot(rda.2, type="n", main = "RDA", ylim =c(-3.5,2.5), xlim = c(-3,3))
mtext("No Selection", side=3)
text(rda.2, dis="bp")
with(mod.pol, points(rda.2, display = "sites", col = colors.cluster[Cluster_SqrtTrans],
                     scaling = "species", pch = shapes[Veg_Types], cex =1.1))
with(mod.pol, text(rda.2, display = "sites", col = colors.cluster[Cluster_SqrtTrans],
                   scaling = "species", pch = shapes[Veg_Types], cex =.7, pos = 3, offset = .5))
with(rda.no.selection.centroids.sqrt, points(RDA1.mean, RDA2.mean,col = colors.cluster[Cluster_SqrtTrans], pch = 16, cex =1.6))
with(mod.pol, legend("bottomright", legend = levels(as.factor(Cluster_SqrtTrans)), bty = "n",
                     col = colors.cluster, pch = 16, cex  = .7, title="Clusters: Sqrt Trans", title.adj =1))
with(mod.pol, legend("bottomleft", legend = levels(Veg_Types), bty = "n", pch = shapes, cex  = .7, title="Veg Types", title.adj = 0.01))

# Species 
plot(rda.2,  display = c("sp", "cn"), main = "RDA", ylim =c(-1.5,1.5), xlim = c(-1.5, 1.5))
mtext("No Selection", side=3)
text(rda.2, dis="bp")

plot(rda.2,  display = c("sp", "cn"), main = "RDA", ylim =c(-.5,.5), xlim = c(-.5, .5))
mtext("No Selection - zoomed", side=3)
text(rda.2, dis="bp")


# Anova
anova(rda.2, by = "margin")
summary(rda.2)
```
#### *6.3 Run RDA with selection of most important variables using a permutation tests on environmental variables*
```{r RDA perm selection, warning=FALSE, fig.align = "center",fig.width=10, fig.height=10}
# Stepwise forward/backward method for variable selection
rda.selected.fb = step(rda.2, scope = list(lower = formula(stp), upper = formula(rda.2)), trace = 0, test = "perm")
colnames(env.data.cut)

# create clusters and centroids for plotting
rda.fb.scores.sqrt = as.data.frame(scores(rda.selected.fb, choices = c(1,2), display = c("sp", "sites"),
                                                    scaling = "sites")[2])
rda.fb.scores.sqrt$Cluster_SqrtTrans = mod.pol$Cluster_SqrtTrans 
rda.fb.centroids.sqrt  = rda.fb.scores.sqrt %>%  dplyr::group_by(Cluster_SqrtTrans) %>% 
  dplyr::summarise(
    RDA1.mean = mean(sites.RDA1),
    RDA2.mean = mean(sites.RDA2))

# Transform data for use with ggplot (for publication)
# env variables and taxa
RDAspecies <- scores(rda.selected.fb, display = "species", scaling = "species") %>% as.data.frame() %>% rownames_to_column("species") 
clusters = rda.fb.scores.sqrt %>% rownames_to_column("Site_Code_Veg") %>% full_join(mod.pol.sites, by = "Site_Code_Veg")
clusters = as.data.frame(clusters[,c(1,4)])
RDAsites <- scores(rda.selected.fb, display = "sites", scaling = "sites") %>% as.data.frame() %>% rownames_to_column("Site_Code_Veg") %>% full_join(mod.pol.sites, by = "Site_Code_Veg")  %>%  full_join(clusters, by = "Site_Code_Veg")

RDAenv <- scores(rda.selected.fb, display = "bp", scaling = "species") %>% as.data.frame()
RDAenv.herb = RDAenv[1,]
RDAenv.agro = RDAenv[2,]
RDAenv.tramp = RDAenv[3,]
RDAenv.landopen = RDAenv[4,]
RDAenv.soiltype = RDAenv[5,]
RDAenv.rainfall = RDAenv[6,]
RDAenv.temp = RDAenv[7,]
RDAenv.slope = RDAenv[8,]
RDAenv.salty = RDAenv[9,]
RDAenv.flooded = RDAenv[10,]
RDAenv.fire = RDAenv[11,]
RDAenv.distag = RDAenv[12,]
RDAenv.gHM = RDAenv[13,]

# Sites
ggplot() +
  geom_point(data = RDAsites, aes(x = RDA1, y = RDA2, color = as.factor(Cluster_SqrtTrans), shape = Veg_Types_Code ), size = 2.5) +
  geom_text_repel(data = RDAsites,aes(x = RDA1, y = RDA2,label = Site_Code, segment.alpha = 0.5, segment.size = 0.3), max.overlaps = 10,box.padding = unit(0.45, "lines"), size = 2.8) + 
  geom_point(data = rda.fb.centroids.sqrt, aes(x = RDA1.mean, y = RDA2.mean, color = as.factor(Cluster_SqrtTrans)), shape = 19, size = 3) +
  scale_shape_manual(values = c(0,1,2,3,4,5,6,7,8)) +
  scale_color_manual(values = c("#D58B60", "#548F00", "#4A7381")) +
  geom_text(data = RDAenv.herb, aes(x = RDA1, y = RDA2), label = "Herbivory pressure", nudge_x = -0.2, nudge_y = 0.0, size =4) +
  geom_text(data = RDAenv.agro, aes(x = RDA1, y = RDA2), label = "Agropastoral use", nudge_x = -0.09, nudge_y = 0.029, size =4) +
  geom_text(data = RDAenv.tramp, aes(x = RDA1, y = RDA2), label = "Trampling", nudge_x = 0.1, nudge_y = 0.04, size =4) +
  geom_text(data = RDAenv.landopen, aes(x = RDA1, y = RDA2), label = "Landscape openness", nudge_x = 0.2,  nudge_y = -0.04, size =4) +
  geom_text(data = RDAenv.soiltype, aes(x = RDA1, y = RDA2), label = "Soil type", nudge_x =0.09, nudge_y = -0.031, size =4) +
  geom_text(data = RDAenv.rainfall, aes(x = RDA1, y = RDA2), label = "Mean decadal rainfall", nudge_x =-0.01, nudge_y = 0.08, size =4) +
  geom_text(data = RDAenv.temp, aes(x = RDA1, y = RDA2), label = "Mean decadal temperature", nudge_x =0.005, nudge_y = -0.032, size =4) +
  geom_text(data = RDAenv.slope, aes(x = RDA1, y = RDA2), label = "Slope", nudge_x =0.02, nudge_y = 0.03, size =4) +
  geom_text(data = RDAenv.salty, aes(x = RDA1, y = RDA2), label = "Saline soil", nudge_x =-0.05, nudge_y = 0.05, size =4) +
  geom_text(data = RDAenv.flooded, aes(x = RDA1, y = RDA2), label = "Flooded Soil", nudge_x =0.13, nudge_y = 0.028, size =4) +
  geom_text(data = RDAenv.fire, aes(x = RDA1, y = RDA2), label = "Fire Disturbance", nudge_x =-0.19, nudge_y = 0.021, size =4) +
  geom_text(data = RDAenv.distag, aes(x = RDA1, y = RDA2), label = "Distance to agriculture", nudge_x =-0.23, nudge_y = 0.07, size =4) +
  geom_text(data = RDAenv.gHM, aes(x = RDA1, y = RDA2), label = "gHM", nudge_x =0.023, nudge_y = -0.023, size =4) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_segment(data = RDAenv, aes(x = 0, y = 0, xend = RDA1, yend = RDA2), alpha = 0.5, arrow = arrow(length = unit(0.2, "cm"))) +
  theme_bw() + theme(axis.title = element_text(size=13), axis.text = element_text(size = 11),  legend.position = c(.895,.25),legend.background = element_rect(fill="white",size=0.25, linetype="solid", colour ="black")) +
  labs(x = "\nRDA1 (12.6%)",
       y = "RDA2 (10.0%)\n", color = "Cluster Centroid", shape = "Vegetation Type") + 
  lims(x = c(-0.7,1.2), y = c(-.9,.9))

 # Species
ggplot() +
  geom_point(data = RDAspecies, aes(x = RDA1, y = RDA2), color = 'grey50', shape = 1, size = 2.2) +
  geom_text_repel(data = RDAspecies,aes(x = RDA1, y = RDA2,label = species, segment.alpha = 0.5, segment.size = 0.3), max.overlaps = 10,box.padding = unit(0.45, "lines"), size = 2.8) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_segment(data = RDAenv, aes(x = 0, y = 0, xend = RDA1, yend = RDA2), alpha = 0.5, arrow = arrow(length = unit(0.2, "cm"))) +
  theme_bw() + theme(axis.title = element_text(size=13), axis.text = element_text(size = 11)) +
  labs(x = "\nRDA1 (12.6%)",
       y = "RDA2 (10.0%)\n") + 
  lims(x = c(-1.362,1.25), y = c(-.89,1.37))

# Inset of species plot (to show obscured data points)
ggplot() +
  geom_point(data = RDAspecies, aes(x = RDA1, y = RDA2), color = 'grey50', shape = 1, size = 3) +
  geom_text_repel(data = RDAspecies,aes(x = RDA1, y = RDA2,label = species, segment.alpha = 0.5, segment.size = 0.3), max.overlaps = 10,box.padding = unit(0.45, "lines"), size = 2.5) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  theme_bw()  + theme(axis.title = element_text(size=13), axis.text = element_text(size = 11), plot.margin=unit(c(.6,.6,.2,.2),"cm")) +
  labs(x = "",y = "") + lims(x = c(-.1,.25), y = c(-.2,.2))

```

