Investigating Blue Carbon Ecosystems with Beta Regression

Research Question

Large amounts of carbon is naturally stored in the world’s oceans and in the soil of coastal wetlands. This is referred to as blue carbon.

Blue carbon is unique due to its ability to store or sequester carbon at a much faster rate than terrestrial ecosystems like forests(National Oceanic and Atmospheric Administration (NOAA) (2024)). This carbon is stored in the soils of these ecosystems – marsh, sea grass, mangrove, and salt marshes. These habitats are of growing interest as greenhouse gases continue to accumulate in the atmosphere. Understanding the biogeochemical processes that make up carbon sequestration is important for advancing blue carbon management as a solution to climate change. (Survey (2024)).

Graphic of blue carbon ecosystems and carbon sequestration. Source: USGS

For this project, I will fit a Beta Regression model to answer the following question:

How does soil organic matter affect carbon in coastal ecosystems?

Soil organic matter consists of organic material deposited in the soil from organisms at various stages of decomposition. It is a key indicator of soil carbon in an ecosystem because soil carbon represents the portion of that organic matter made up of carbon. In other words, soil carbon is directly derived from, and proportional to, the amount of soil organic matter present.

Data Background

This data is from the Smithsonian Environmental Research Center’s Coastal Carbon Network (Center (n.d.)). The CCN collects and standardizes carbon stock and flux data from coastal ecosystems worldwide. They provide access to data sets for researchers and students like me. The dataset itself is originally over 15,000,000 observations and includes data on soil measurements such as depth of soil cores, soil bulk density, organic matter, carbon, and data on habitat and associated vegetation.

Data Wrangling

In this section I filtered the data for only variables of interest (fraction_carbon, fraction_organic_matter, max_depth, habitat) as well as ensured that fraction_carbon only contained fractional data between 0 and 1. For this dataset, I also filtered to the three most researched habitats for blue carbon: mangrove, seagrass, and marsh.

Code

# Import 'CCN_sites', 'CCN_species', 'CCN_cores', 'CCN_depth_series'

sites <- read_csv(here("posts/eds222-blog/CCN_data/CCN_sites.csv"))
species <- read_csv(here("posts/eds222-blog/CCN_data/CCN_species.csv"))
cores <- read_csv(here("posts/eds222-blog/CCN_data/CCN_cores.csv"))
depth_series <- read_csv(here("posts/eds222-blog/CCN_data/CCN_depthseries.csv"))

Code

set.seed(123)

# .....Further filter datasets.....

# Species data 
species <- species %>% 
  select('site_id', 'habitat') %>% 
  filter(habitat %in% c("mangrove", "seagrass", "marsh")) %>% 
  drop_na()

# Cores data 
cores <- cores %>% 
  select('site_id', 'max_depth') %>% 
  drop_na()

# Depth series data 
depth_series <- depth_series %>% 
  select('site_id','fraction_organic_matter', 'fraction_carbon') %>% 
  drop_na()

#.....Join datasets.....

blue_carbon <- species %>% 
  full_join(depth_series, by = c('site_id')) %>% 
  full_join(cores, by =  c('site_id')) 

#.....Random sample....
blue_carbon <- sample_n(blue_carbon, size = 100000)


#.....Remove Negative values from Fraction Carbon.....

# Ensure `fraction_carbon` ranges between 0 and 1 for analysis.
range(blue_carbon$fraction_carbon)

sum(blue_carbon$fraction_carbon < 0, na.rm = TRUE) # 17 negative values. 

# Replace negative numbers with NAs
blue_carbon$fraction_carbon[blue_carbon$fraction_carbon < 0] <- NA

# Check range to ensure negative values were 
range(blue_carbon$fraction_carbon)

# Remove Nas from entire dataset
blue_carbon <- na.omit(blue_carbon)

#..... Ensure R keeps data within bounds....
blue_carbon$fraction_carbon[blue_carbon$fraction_carbon == 1] <- .999 
blue_carbon$fraction_carbon[blue_carbon$fraction_carbon == 0] <- .0001

# Clear up environment
rm(list = c("cores", "depth_series", "sites", "species"))

Data Exploration

In order to understand what analysis needs to be done, some preliminary data exploration is useful. I am particularly interested in the spread of data points for fraction_carbon.

Code

frac_car_plot <- blue_carbon %>% 
  ggplot(aes(fraction_carbon)) +
  geom_histogram(fill = "#9fd0d6") +
  labs(x = "Fraction carbon (dimensionless)", 
       y = "Frequency", 
       title = "Distribution of Fraction Carbon") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) # Center title 

frac_car_plot

As you can see in Fig 1., fraction carbon is left skewed signifying that it does not follow a normal distribution. This is due to fraction carbon being bounded between 0 and 1 because it is a fraction.

Code

# Define palette
my_palette <- c("#9FD0D6", "#F2A176", "#985D73")

som_car_habitat_plt <- blue_carbon %>% 
  ggplot(aes(x = fraction_organic_matter, y = fraction_carbon, color = habitat)) +
  geom_point(alpha = 0.2) +
  geom_smooth(aes(group = habitat), method = "lm", se = FALSE) + # Add best fit lines 
  scale_color_manual(values = my_palette) +
  theme_classic() +
  labs(x = "Soil Organic Matter (dimensionless)", 
       y = "Fraction carbon (dimensionless)", 
       title = "Relationship between soil organic matter and fraction carbon")

som_car_habitat_plt

Fig 2. Relationship between soil organic matter and fraction carbon

As you can see in Fig 2., soil organic matter follows a generally linear pattern with fraction carbon. There is some notable variation between habitats, especially between seagrass compared to marsh and mangrove, but the overall shape of the data is similar across most habitats. Despite these differences, all three habitats show broadly comparable patterns in their relationship between soil organic matter and fraction carbon based on this graph.

Statistical Model: Beta Regression

Since my response variable, fraction_carbon is a fraction between 0 and 1, and is not normally distributed around the mean, I will use Beta regression (Cribari–Neto and Zeileis (2010)). Beta regression models the mean of a proportion or fraction as a function of predictors. My predictors in this study are the following:

fraction_organic_matter: Mass of organic matter relative to sample dry mass (dimensionless)
max_ depth: Maximum depth of a sampling increment (centimeters)

Why include max_depth in the model? Deeper soils might have more organic matter accumulation and fraction carbon can change with depth (e.g., shallower layers of soil may have higher carbon than deeper layers).

Why exclude habitat from the model? I am interested in the overall relationship between soil organic matter and fraction carbon, regardless of habitat. If I were to represent the causal relationship of habitat and fraction carbon I would represent it as habitat → fraction_organic_matter → fraction_carbon. Because habitat influences organic matter, it inhibits the total effect of organic matter on fraction carbon, therefore excluding it allows me to model the full effect of the predictor on the response.

I represented these causal relationships with a directed acylcic graph (DAG):

Hypothesis

Question: How does soil organic matter affect carbon?

H₀ (null hypothesis): Soil organic matter has no effect on fraction carbon in the soil.
H₁ (alternative hypothesis): Higher soil organic matter is associated with higher fraction carbon.

Model Statistical Notation

\[ \begin{align} \text{carbon} &\sim \text{Beta}(\mu, \phi) \\ logit(\mu) &= \beta_0 + \beta_1 \, \text{organicmatter}_i + \beta_2 \, \text{maxdepth}_i \end{align} \]

carbon is the fraction of carbon, constrained between 0 and 1.
\(\mu_i\) is the expected value of carbon.
\(\phi\) is the precision parameter of the beta distribution. This controls the variance.
coefficients: \(\beta_0\) is the intercept; \(\beta_1\) and \(\beta_2\) are coefficients for the predictors fraction organic matter and max depth

Fit Model to Simulated Data

In order to determine if a Beta regression model is fit for this data, I will simulate data and perform an analysis on the simulated data. This step is also helpful to see how well you understand the model you plan to use. This includes the following steps:

define sample size
choose parameters for \(\beta\)
define range for predictors
define linear equation and logit function
generate response
fit model to simulated data
see if parameters were recovered

Code

# set seed for reproducibility
set.seed(123)

# 1. Define sample size
n <- 10000 # Large enough sample size 

# 2. Choose parameters 
beta_0 <- 0.5 # Intercept (fraction carbon)
beta_1 <- 0.8  # Effect of fraction organic matter
beta_2 <- -0.01 # Effect of max_depth
phi <- 0.9 # Precision parameter 

# 3. Define predictor ranges (x's)
# Runif(n=sample size, min, max)
fraction_organic_matter <- runif(n, 0.05, 0.9)
max_depth <- runif(n, 0, 30)

# 3. Linear predictor and Logit response
  # Set logit_mu 
logit_mu <- beta_0 + beta_1 * fraction_organic_matter + beta_2 * max_depth 
  # Set transformation 
  # Inverse logit 
mu <- 1 / (1 + exp(-logit_mu)) # Mu = mean of beta distribution

# 4. Generate response with rbeta()
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
fraction_carbon <- rbeta(n, shape1, shape2)

# R computer error: ensure that fraction carbon stays between 0 and 1
fraction_carbon[fraction_carbon == 1] <- .999 
fraction_carbon[fraction_carbon == 0] <- .0001


# 5. Create dataframe for simulated data
sim_blue_carbon <- data.frame(fraction_carbon, fraction_organic_matter, max_depth)

# 6. Fit model 
betareg_sim_mdl <- glmmTMB(fraction_carbon ~ fraction_organic_matter + max_depth, data = sim_blue_carbon, family = beta_family())

  # Summarize model
summary(betareg_sim_mdl)

Table 1. Model Assessment
Coefficient	Simulation Estimate	Actual Parameter
Intercept	0.4789468	0.50
Fraction Organic Matter	0.8298071	0.80
Max Depth	-0.0097642	-0.01

The model seems to recover parameters well. This is good evidence that beta regression is the model for this data.

Fit Model To Real Data

Now, I will fit a beta regression model to real life data from the Smithsonian’s Coastal Carbon Network. I will use the glmmTMB package and specify the family as beta_family for a beta regression.

# .....Fit beta regression model..... 

betareg_mdl <- glmmTMB(fraction_carbon ~ fraction_organic_matter + max_depth, 
                      data = blue_carbon, family = beta_family())

Code

# .....Summarize model.....
betareg_mdl_sum <- summary(betareg_mdl)

# .....Extract coefficients..... 

# Beta_0 
beta_0 <- betareg_mdl_sum$coefficients$cond["(Intercept)", "Estimate"] 

# Beta_1 
beta_1 <- betareg_mdl_sum$coefficients$cond["fraction_organic_matter", "Estimate"]

# Beta_2 
beta_2 <- betareg_mdl_sum$coefficients$cond["max_depth", "Estimate"]

# Std. Errors 
error_beta_0 <- betareg_mdl_sum$coefficients$cond["(Intercept)", "Std. Error"] 
error_beta_1 <- betareg_mdl_sum$coefficients$cond["fraction_organic_matter", "Std. Error"] 
error_beta_2 <- betareg_mdl_sum$coefficients$cond["max_depth", "Std. Error"] 

# P-values 
p_beta_0 <- betareg_mdl_sum$coefficients$cond["(Intercept)", "Pr(>|z|)"] 
p_beta_1 <- betareg_mdl_sum$coefficients$cond["fraction_organic_matter", "Pr(>|z|)"] 
p_beta_2 <- betareg_mdl_sum$coefficients$cond["max_depth", "Pr(>|z|)"]

Table 2. Beta Regression Output
Coefficient	Estimate	Standard Error	P-Value
Intercept	-2.2069219	0.0023720	< 2e-16
Fraction Organic Matter	1.7919272	0.0084771	< 2e-16
Max Depth	-0.0000365	0.0000266	1.701443e-01

Model Interpretation

The beta regression analysis shows a low baseline of fraction carbon in coastal carbon ecosystems when there is no soil organic matter. Soil organic shows a strong, significant positive effect on fraction carbon. As for max depth, there is a very small negative effect on fraction carbon, however it is not significant enough in this model to definitively say deeper soils have less carbon. All of these effects are transformed in logit space.

Effect of fraction organic matter on fraction carbon → significant
Effect of max depth on fraction carbon → not significant

Reject null hypothesis that soil organic matter has no effect on fraction carbon in the soil.

Generate Predictions

In this section I will generate predictions to visualize the predicted effect of soil organic matter on fraction carbon. This step is important to visualize the effect of the predictor on the response and interpret their relationship with predicted data. This includes the following steps:

Create prediction grid/data frame where fraction_organic_matter is allowed to vary based on a specified sequence while max_depth is held constant (with its mean).
Use my Beta regression model fitted to real-life data (betareg_mdl) to generate predictions for fraction_carbon. This produces predicted beta-regression means. Ensure this dataframe is in the response space so that it returns values between 0 and 1.

Why am I holding max_depth constant? I want to show how fraction_carbon changes with one predictor (fraction_organic_matter). This isolates the effect of organic matter on carbon. Additionally, from the results of my beta regression model, max depth did not have a significant effect on fraction carbon, therefore averaging the value will not impact the predictive power of the model.

Code

# .....Create prediction grid.....
pred_grid <- expand_grid(fraction_organic_matter = seq(from = 0, to = 1, by = 0.001)) %>% 
  mutate(max_depth = mean(blue_carbon$max_depth, na.rm = TRUE)) # Hold max depth constant 

# ......Generate predictions.....
betareg_grid <- pred_grid %>% # Pull from pred_grid
  mutate(predictions = predict(object = betareg_mdl, # Predict based on beta reg model
                     newdata = pred_grid, # Use expand grid data
                     type = "response")) # Ensure predictions are in response space

Calculate Confidence Intervals

In this section I will calculate the degree of certainty my predictions have using confidence intervals. A confidence interval is a range we are confident contains the beta’s. I will do this through boot strapping. Bootstrapping is useful here because the glmmTMB package does not calculate confidence intervals for your model. This includes the following steps:

Resample blue_carbon dataset 1000 times with replacement
Refit beta regression model on each bootstrap sample
Make a predictions column from the pred_grid made previously
Compute the 2.5th percentile (lower CI) and the 97.5th percentile (upper CI)

Code

boot <- map_dfr(1:1000, \(i) {
  # Boot strap sample
  boot_sample <- slice_sample(blue_carbon, n = 1000, replace = TRUE) # Replace the bootstrap 
  # Fit model 
  boot_mdl <- glmmTMB(fraction_carbon ~ fraction_organic_matter + max_depth, #
                      family = beta_family(link = "logit"), 
                      data = boot_sample)
  # Add predictions 
  pred_grid %>%
    mutate(predictions = predict(boot_mdl, newdata = pred_grid, type = "response"),
           boot = i)
})

# Build confidence interval 
boot_ci <- boot %>% 
  group_by(fraction_organic_matter, max_depth) %>% # Group by predictors 
  summarise(
    ci_lwr = quantile(predictions, 0.025), # Lower CI
    ci_upr = quantile(predictions, 0.975) # Upper CI
  )

Visualize Predictions

Code

# .....Plot predictions.....

predict_plot <- ggplot() +
  # Predicted line
  geom_line(data = betareg_grid, 
            aes(x = fraction_organic_matter, y = predictions), 
            color = "#345084FF", size = 1) +
  # Confidence Interval 
  geom_ribbon(data = boot_ci, 
              aes(x = fraction_organic_matter, ymin = ci_lwr, ymax = ci_upr), 
              fill = "#9FD0D6", alpha = 0.2) +
  labs(x = "Fraction Organic Matter (dimensionless)", 
    y = "Fraction Carbon (dimensionless)", 
    title = "Predicted Fraction Carbon") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.4)) # Center title 


predict_plot_data <- ggplot() +
  # Predicted line
  geom_line(data = betareg_grid, 
            aes(x = fraction_organic_matter, y = predictions), 
            color = "#345084FF", size = 1) +
  # Confidence Interval 
  geom_ribbon(data = boot_ci, 
              aes(x = fraction_organic_matter, ymin = ci_lwr, ymax = ci_upr), 
              fill = "#9FD0D6", alpha = 0.2) +
  # Original data points
  geom_point(data = blue_carbon, 
             aes(x = fraction_organic_matter, y = fraction_carbon),
             alpha = 0.5, size = 2, color = "#345084FF") +
  labs(x = "Fraction Organic Matter (dimensionless)", 
    y = "Fraction Carbon (dimensionless)", 
    title = "Predicted Fraction Carbon\nRaw Data") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.4)) # Center title 


# Plot side-by-side
predict_plot + predict_plot_data

Fig 3. and Fig 4. Predicted fraction carbon.

The predicted plot above (fig 3.) shows an increase in fraction carbon as organic matter increases. This is consistent with my findings from the beta regression model (p < 2e-16). When looking at the confidence interval, there is more certainty for organic matter proportions between 0.00 and 0.30 fractions. There is less certainty in the relationship as organic matter increases and fraction carbon increase, which is typical for confidence intervals as the level of certainty decreases as the amount of available data decreases. When plotting the raw data onto the predicted model in fig 4., you can see that the data trajectory differs from the predicted line. There can be many reasons for this behavior; my main idea is that the data itself has natural variation, as is common in all environmental datasets. When referencing fig 2., you can see that the “check-mark” shape of the data is due to variation in habitat (i.e., seagrass fraction carbon measurements varied). Since my model did not account for habitat, the predicted model may underfit the data. This natural “noise” in the data does not signify that the model failed, but rather shows that there is real-world variation within this dataset.

Conclusion and Next Steps

Soil organic matter is a known predictor of fraction carbon, as organic matter is the material that directly contributes to fraction carbon. As hypothesized, the results of the beta regression model showed a significant effect of soil organic matter on fraction carbon (p < 2e-16). Max depth’s effect was not significant, which is important to note since it was included in this study based on the assumption that it might confound the relationship between fraction carbon and organic matter. From an ecological standpoint, depth may only have a strong effect at extremes (e.g., very shallow or very deep soils), and small variations may not noticeably influence fraction carbon.

In conclusion, understanding the biogeochemical processes of these unique ecosystems is crucial. In future studies, it would be interesting to isolate the effect of habitat on fraction carbon, or to include the interactive effects of habitat and soil organic matter. This could provide a more detailed understanding of the factors controlling carbon distribution in these systems.

Github

To view this analysis, visit my Github repository

References

Center, Smithsonian Environmental Research. n.d. “Coastal Carbon Network.” https://serc.si.edu/coastalcarbon.

Cribari–Neto, Francisco, and Achim Zeileis. 2010. “Beta Regression.” Journal of Statistical Software 34 (2). https://doi.org/10.18637/jss.v034.i02.

National Oceanic and Atmospheric Administration (NOAA). 2024. “What Is Blue Carbon?” NOAA Ocean Service. 2024. https://oceanservice.noaa.gov/facts/bluecarbon.html.

Survey, U. S. Geological. 2024. “Blue Carbon.” https://www.usgs.gov/science/science-explorer/climate/blue-carbon.

Citation

BibTeX citation:

@online{segarra2025,
  author = {Segarra, Isabella},
  title = {Investigating {Blue} {Carbon} {Ecosystems} with {Beta}
    {Regression}},
  date = {2025-12-11},
  url = {https://isabellasegarra.github.io/posts/my-first-blog-post},
  langid = {en}
}

For attribution, please cite this work as:

Segarra, Isabella. 2025. “Investigating Blue Carbon Ecosystems with Beta Regression.” December 11, 2025. https://isabellasegarra.github.io/posts/my-first-blog-post.