Diversity estimation

Diversity estimation within immune cell populations is a fundamental analysis in immunoinformatics. Diversity estimation refers to the process of analyzing and quantifying the diversity of immune cells, such as T cells or B cells.

Diversity indeces

The repDiversity function offers various approaches for estimating repertoire diversity. The method parameter allows users to specify the means for diversity estimation. Users can select one of the following methods to set the means for diversity estimation:

  • shannon Shannon index takes into account both the number of different species and the distribution of individuals among those species, providing a single numerical value that reflects the diversity within a community. Higher Shannon index values indicate greater diversity.
  • chao1 Chao1 is a nonparametric asymptotic estimator used to estimate species richness, which refers to the total number of species present in a population or community.
  • hill Hill numbers represent a unified family of diversity indices characterized by a mathematical framework, with the only variation being the exponent q.
  • div Div, known as true diversity or the effective number of types, represents the count of equally abundant types required for the average proportional abundance of the types to match that observed in the dataset of interest, where all types may not be equally abundant.
  • gini.simp The Gini-Simpson index quantifies the probability of interspecific encounter, indicating the likelihood that two randomly selected entities within a community represent different types or species.
  • inv.simp Inverse Simpson index represents the effective number of types achieved when the weighted arithmetic mean is utilized to quantify the average proportional abundance of types within the dataset of interest.
  • gini The Gini coefficient measures the inequality among values in a frequency distribution, such as income levels. A Gini coefficient of zero signifies complete equality, where all values are identical (e.g., everyone has the same income). Conversely, a Gini coefficient of one, or 100%, represents maximum inequality among values (e.g., where one person possesses all the income).
  • raref Rarefaction is a statistical technique used to estimate species richness based on sampled data by extrapolating the expected number of species in a population.

Table (Data format)

This is an example of a data format containing the necessary information for diversity estimation visualizations. The table includes mock data specifically generated for this purpose.

The “Sample” column contains unique identifiers for each sample, while the “Group” column indicates the different groups to which the samples belong. The “Patient” column provides information about the respective patients associated with each sample. The “Shannon” column corresponds to the Shannon index utilized for the diversity analysis.

Code
library(data.table)
library(reactable)

df = fread("../inst/data/diversity1.txt")

reactable(
  
  df, 
  theme = reactableTheme(
    backgroundColor  = "transparent"
  )
)

Boxplot

Boxplot is one of the most commonly used chart types for comparing the distribution of a numeric variable across multiple groups.

In a boxplot, the central line within the box represents the median of the data, dividing it into two equal parts. The edges of the box represent the upper and lower quartiles, while the extreme lines extend to the the highest and lowest values within the data range, excluding outliers.

Boxplots are created using the geom_boxplot() function from the ggplot2 package.

Boxplot 1

It’s important to note that while boxplots provide a summary of data distribution for each group, they may hide the underlying distribution details.

To address this concern, a common practice is to overlay individual data points using geom_point() behind the boxplot, offering a clearer visualization of the dataset’s distribution.

Code
# libraries -------

library(ggplot2)
library(ggforce)

library(paletteer)
# plot 1 ---------------------

df |>
    ggplot(aes(Group, Shannon)) +
    
    geom_point(
        aes(fill = Group),
        position = position_jitternormal(sd_y = 0, sd_x = .08),
        shape = 21, size = 2, stroke = .15, color = "white"
    ) +

    scale_fill_manual(values = paletteer_d("ggsci::hallmarks_light_cosmic")) +
    
    geom_boxplot(width = .15, outlier.shape = NA) +
    
    theme_minimal() +
    
    theme(
        legend.position = "none",
        
        axis.line = element_line(linewidth = .55),
        axis.ticks = element_line(linewidth = .55),
        
        panel.grid.major = element_line(linewidth = .55),
        panel.grid.minor = element_line(linewidth = .45, linetype = "dashed"),
        
        plot.background = element_rect(fill = "transparent", color = NA),
        plot.margin = margin(20, 20, 20, 20)
    )

Boxplot 2

The box plot illustrates the Shannon index on the y-axis with distinct patient groups depicted along the x-axis. Different colors are employed to distinguish between the two patient groups.

Code
# libraries ------------

library(ggnewscale)
library(colorspace)

# plot 2 ------------

df |>
  ggplot(aes(Group, Shannon)) +
    
  geom_point(
      aes(fill = Patient),
      position = position_jitterdodge(jitter.width = .15, dodge.width = .5),
      shape = 21, size = 2, stroke = .15, color = "white"
  ) +

  scale_fill_manual(values = paletteer_d("ggsci::hallmarks_light_cosmic") |> lighten(.25)) +

  geom_boxplot(
    aes(color = Patient),
    position = position_dodge(width = .5),
    width = .2, outlier.shape = NA
  ) +
  
  scale_color_manual(values = paletteer_d("ggsci::hallmarks_light_cosmic") |> darken(.25)) +
  
  theme_minimal() +
  
  theme(
      legend.position = "bottom",
      legend.justification = "left",
      legend.title = element_blank(),
      
      axis.line = element_line(linewidth = .55),
      axis.ticks = element_line(linewidth = .55),
      
      panel.grid.major = element_line(linewidth = .55),
      panel.grid.minor = element_line(linewidth = .45, linetype = "dashed"),
      
      plot.background = element_rect(fill = "transparent", color = NA),
      plot.margin = margin(20, 20, 20, 20)
  )