Make sankey, alluvial and sankey bump plots in ggplot



The goal of ggsankey is to make beautiful sankey, alluvial and sankey bump plots in ggplot2


You can install the development version of ggsankey from github with:

# install.packages("devtools")

How does it work

Google defines a sankey as:

A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages.

To plot a sankey diagram with ggsankey each observation has a stage (called a discrete x-value in ggplot) and be part of a node. Furthermore, each observation needs to have instructions of which node it will belong to in the next stage. See the image below for some clarification.

Hence, to use geom_sankey the aestethics x, next_x, node and next_node are required. The last stage should point to NA. The aestethics fill and color will affect both nodes and flows.

To controll geometries (not changed by data) like fill, color, size, alpha etc for nodes and flows you can either choose to set a global value that affect both, or you can specify which one you want to alter. For example node.color = 'black' will only draw a black line around the nodes, but not the flows (links).



A basic sankey plot that shows how dimensions are linked.


df <- mtcars %>%
  make_long(cyl, vs, am, gear, carb)

ggplot(df, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node))) +

And by adding a little pimp.

  • Labels with geom_sankey_label which places labels in the center of nodes if given the same aestethics.

  • ggsankey also comes with custom minimalistic themes that can be used. Here I use theme_sankey.

ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) +
  geom_sankey(flow.alpha = .6,
              node.color = "gray30") +
  geom_sankey_label(size = 3, color = "white", fill = "gray40") +
  scale_fill_viridis_d() +
  theme_sankey(base_size = 18) +
  labs(x = NULL) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = .5)) +
  ggtitle("Car features")


Alluvial plots are very similiar to sankey plots but have no spaces between nodes and start at y = 0 instead being centered around the x-axis.

ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) +
  geom_alluvial(flow.alpha = .6) +
  geom_alluvial_text(size = 3, color = "white") +
  scale_fill_viridis_d() +
  theme_alluvial(base_size = 18) +
  labs(x = NULL) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = .5)) +
  ggtitle("Car features")


Sankey bump plots is mix between bump plots and sankey and mostly useful for time series. When a group becomes larger than another it bumps above it.

# install.packages("gapminder")

df <- gapminder %>%
  group_by(continent, year) %>%
  summarise(gdp = (sum_(pop * gdpPercap)/1e9) %>% round(0), .groups = "keep") %>%

ggplot(df, aes(x = year,
               node = continent,
               fill = continent,
               value = gdp)) +
  geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 6) +
  scale_fill_viridis_d(option = "A", alpha = .8) +
  theme_sankey_bump(base_size = 16) +
  labs(x = NULL,
       y = "GDP ($ bn)",
       fill = NULL,
       color = NULL) +
  theme(legend.position = "bottom") +
  labs(title = "GDP development per continent")

  • size of the flow

    size of the flow

    Is there a way in geom_sankey to specify an aesthetic that provides directly the size of the flow, i.e. the number of connections between the nodes?

    For example:

    df <- data.frame(expand.grid(LETTERS[1:3],LETTERS[1:3]))
    df$N <- sample(1:10,size = nrow(df),replace = T)

    I would like something like

    df %>%
    make_long(Var1, Var2)%>%
      ggplot( aes(x = x, 
                     next_x = next_x, 
                     node = node, 
                     next_node = next_node,
                     fill = factor(node))) +


    But with the flows given by N. A hack would be to repeat each row by N before the make_long, but I am sure there is a proper way.

    opened by dmongin 2
  • "sum_" function missing for geom_sankey_bump()

    Following the sum_ issue reported in #1, I have the same issue with the example provided in the readme.

    I tried:

    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #>     intersect, setdiff, setequal, union
    df <- gapminder %>%
        group_by(continent, year) %>%
        summarise(gdp = (sum_(pop * gdpPercap)/1e9) %>% round(0), .groups = "keep") %>%
    #> Error: Problem with `summarise()` input `gdp`.
    #> x could not find function "sum_"
    #> ℹ Input `gdp` is `(sum_(pop * gdpPercap)/1e+09) %>% round(0)`.
    #> ℹ The error occurred in group 1: continent = "Africa", year = 1952.
    ggplot(df, aes(x = year,
                   node = continent,
                   fill = continent,
                   value = gdp)) +
        geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 6) +
        scale_fill_viridis_d(option = "A", alpha = .8) +
        theme_sankey_bump(base_size = 16) +
        labs(x = NULL,
             y = "GDP ($ bn)",
             fill = NULL,
             color = NULL) +
        theme(legend.position = "bottom") +
        labs(title = "GDP development per continent")
    #> Error:   You're passing a function as global data.
    #>   Have you misspelled the `data` argument in `ggplot()`

    Created on 2021-04-03 by the reprex package (v1.0.0)

    I also tried removing sum_ and replacing with sum when writing to the variable df, but I also had no luck.

    See here:

    df <- gapminder %>%
      group_by(continent, year) %>%
      summarise(gdp = (sum(pop * gdpPercap)/1e9) %>% round(0), .groups = "keep") %>%
    ggplot(df, aes(x = year,
                   node = continent,
                   fill = continent,
                   value = gdp)) +
      geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 6) +
      scale_fill_viridis_d(option = "A", alpha = .8) +
      theme_sankey_bump(base_size = 16) +
      labs(x = NULL,
           y = "GDP ($ bn)",
           fill = NULL,
           color = NULL) +
      theme(legend.position = "bottom") +
      labs(title = "GDP development per continent")

    My console error is different than reprex's for some reason; this is my console error:

    Error: Problem with `summarise()` input `flow_freq`.
    x could not find function "sum_"
    ℹ Input `flow_freq` is `sum_(value)`.
    ℹ The error occurred in group 1: n_x = 1952, node = "Oceania 1952", n_next_x = 1957, next_node = "Oceania 1957".
    opened by engineerchange 2
  • Using ggsankey in Shiny

    Using ggsankey in Shiny

    It seems I cannot use ggsankey with my Shiny app.

    I did a test app to see if it worked, but I get an error from toJSON about a named vector. I made a simple selector to use a database extract as the base dataframe for the sankey plot. On initialisation, it works, but after switching to another df, I get the error.

    Sadly, as an R beginner, I cannot ascertain that I am not at the origin of the issue.

    opened by c3-rkieffer 1
  • ggsankey vs ggalluvial

    ggsankey vs ggalluvial

    Hi- I just discovered the existence of Sankey plots (or rather, that such things had a name and could be done in R...).

    I found your package and ggalluvial, which seems to pre-date ggsankey. Can you comment on the pros and cons of ggsankey? Both packages seem pretty good at a first glance. Thanks!

    opened by dariober 1
  • Suggest license

    Suggest license

    Thanks for this package! It's really cool. But I saw that under license, you do not have any license. Legally this means no one can use or modify it. Can you add a license?

    For more on this you can read this page.

    opened by GaborioSensata 1
  • could not find function

    could not find function "sum_" with geom_sankey_bump()

    I get this error when I run the example for geom_sankey_bump()

    Problem with `summarise()` input `gdp`.
    x could not find function "sum_"
    ℹ Input `gdp` is `(sum_(pop * gdpPercap)/1e+09) %>% round(0)`.
    ℹ The error occurred in group 1: continent = "Africa", year = 1952.
      1. base::source("~/.active-rstudio-document", echo = TRUE)
     13. base::.handleSimpleError(...)
     14. dplyr:::h(simpleError(msg, call))
    Run `rlang::last_trace()` to see the full context.

    geom_sankey() works like a charm, by the way! 😀

    opened by gkaramanis 1
  • Flow.fill isn't working

    Flow.fill isn't working

    I have a sankey chart with 21 nodes, and I'm trying to fill the flows one of three colors, but flow.fill isn't working. Is there more documentation on how it works?

    opened by bdidds2 2
  • Labels always get misaligned

    Labels always get misaligned

    Labels always get misaligned. They don't seem to follow any logical rule. Using ggsankey on a W10 laptop with R version 4.2.1. Output was generated with pdf() since there is no proper antialiasing in the png() output (not an issue of ggsankey but of the OS), and some labels get cropped for being so far from the diagram. The package is excellent however! ggsankey

    opened by gluijk 3
  • How to skip nodes with NA value in ggsankey?

    How to skip nodes with NA value in ggsankey?

    Suppose I have this dataset (the actual dataset has 30+ columns and thousands of ids)

    	df <- data. Frame(id = 1:5,
    				admission = c("Severe", "Mild", "Mild", "Moderate", "Severe"),
    				d1 = c(NA, "Moderate", "Mild", "Moderate", "Severe"),
    				d2 = c(NA, "Moderate", "Mild", "Mild", "Moderate"),
    				d3 = c(NA, "Severe", "Mild", "Mild", "Severe"),
    				d4 = c(NA, NA, "Mild", "Mild", NA),
    				outcome = c("Dead", "Dead", "Alive", "Alive", "Dead"))

    I want to make a Sankey diagram that illustrates the daily severity of the patients over time. However, when the observation reaches NA (means that an outcome has been reached), I want the node to directly link to the outcome.

    This is how the diagram should look like: [enter image description here]1

    Image fetched from the question asked by @qdread here

    Is this possible with ggsankey?

    This is my current code:

    df.sankey <- df %>%
    	make_long(admission, d1, d2, d3, d4, outcome)
    ggplot(df.sankey, aes(x = x,
    					 next_x = next_x,
    					 node = node,
    					 next_node = next_node,
    					 fill = factor(node),
    					 label = node)) +
    	geom_sankey(flow.alpha = 0.5,
    				node.color = NA,
    				show.legend = TRUE) +
    	geom_sankey_text(size = 3, color = "black", fill = NA, hjust = 0, position = position_nudge(x = 0.1))

    Which results in this diagram: [enter image description here]3

    Thanks in advance for the help.

    opened by gilbertlzrus 0
  • missing dplyr:: call

    missing dplyr:: call

    in sankey.R, in the function StatSankeyFlow (line 228) is summarise(flow_freq = dplyr::n(), .groups = "keep") which is missing the explicit reference to dplyr.

    opened by ulysses-sr 0
  • how can I joint ggsankey and a dotplot?

    how can I joint ggsankey and a dotplot?


    I put it together myself. The coordinates don't match:


    This is what I'm looking for:


    my code:

    pl <- ggplot(dat3, aes(x = x, 
                           next_x = next_x,
                           node = node, 
                           next_node = next_node,
                           fill = factor(node),
                           label = node2
                           )) +
      geom_sankey(flow.alpha = 0.5, node.color = "black") +
      geom_sankey_label(size = 6, color = "black", fill = "white", hjust = 1, family = "Times") +
      scale_fill_viridis_d(option = "magma") +
      theme_sankey(base_size = 16) +
      scale_x_discrete(expand = c(0.01,0.1)) +
      theme(legend.position = "none",
            axis.title = element_blank(),
            axis.text = element_blank())
    kk_dot <- dotplot(kk, showCategory=10) +
      theme(text = element_text(family = "Times"),
            axis.text.y = element_text(size = 12, face = "bold"),
            axis.text.x = element_text(size = 10, face = "bold"),
            axis.title.x = element_text(size = 14, face = "bold"),
            legend.title = element_text(face = "bold"))
    kk_dot2 <- kk_dot + theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank())
    design <- c("
    all_p <- pl + kk_dot2 + theme(text = element_text(size = 20), 
                                  axis.title.x = element_text(size = 25),
                                  axis.text.x = element_text(size = 20)) +
      plot_layout(design = design)

    Looking forward to your reply!

    opened by Sagityq 1
David Sjoberg
Happy R user. Twitter: @davsjob
David Sjoberg
