Overlay Histogram with Fitted Density Curve in Base R & ggplot2 Package (2 Examples)

 

In this tutorial you’ll learn how to fit a density plot to a histogram in the R programming language.

Table of contents:

Let’s just jump right in!

 

Introduction of Example Data

In the examples of this R programming tutorial, we’ll use the following example data:

set.seed(18462)                                       # Create example data
data <- data.frame(x = round(rnorm(1000, 10, 10)))
head(data)                                            # Print example data
#     x
# 1   6
# 2   7
# 3  14
# 4   4
# 5 -10
# 6  16

As you can see based on the output of the RStudio console, our example data contains only one numeric column. Now, let’s draw these data

 

Example 1: Histogram & Density with Base R

Example 1 explains how to fit a density curve to a histogram with the basic installation of the R programming language. First, we need to use the hist function to draw a histogram:

hist(data$x, prob = TRUE)                             # Create histogram with Base R

 

Base R Histogram

Figure 1: Histogram Created with Base R.

 

Figure 1 shows the output of the previous R code: A histogram without a density line. If we want to add a kernel density to this graph, we can use a combination of the lines and density functions:

lines(density(data$x), col = "red")                   # Overlay density curve

 

Base R Histogram and Fitted Density Curve

Figure 2: Histogram & Overlaid Density Plot Created with Base R.

 

Figure 2 illustrates the final result of Example 1: A histogram with a fitted density curve created in Base R.

 

Example 2: Histogram & Density with ggplot2 Package

Example 2 shows how to create a histogram with a fitted density plot based on the ggplot2 add-on package. First, we need to install and load ggplot2 to R:

install.packages("ggplot2")                           # Install & load ggplot2
library("ggplot2")

Now, we can use a combination of the ggplot, geom_histogram, and geom_density functions to create out graphic:

ggplot(data, aes(x)) +                                # ggplot2 histogram & density
  geom_histogram(aes(y = stat(density))) +
  geom_density(col = "red")

 

ggplot2 R Histogram and Fitted Density Curve

Figure 3: Histogram & Overlaid Density Plot Created with ggplot2 Package.

 

Figure 3 visualizes our histogram and density line created with the ggplot2 package. Note that the histogram bars of Example 1 and Example 2 look slightly different, since by default the ggplot2 packages uses a different width of the bars compared to Base R.

 

Video, Further Resources & Summary

Some time ago I have published a video on my YouTube channel, which illustrates the content of this tutorial. You can find the video below:

 

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

 

Furthermore, you might want to have a look at some of the related articles which I have published on my homepage.

 

In this tutorial, I illustrated how to combine histograms with probability on the y-axis and density plots in the R programming language. If you have additional questions or comments, let me know in the comments section below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


9 Comments. Leave new

  • Hello Joachim

    Thanks for your nice videos. I have the following R script which is for only one .tsv file. I want to tweak it in a way that can plot (Histogram + line) two similar but separate .tsv files with different colours overlaid on each other. Could you please guide?

    # read in data
    df = read.csv(“your_distribution.tsv”, sep=”\t”)

    # filter Ks distribution (0.001 < Ks < 5)
    lower_bound = 0.001
    upper_bound = 5
    df = df[df$Ks lower_bound,]

    # perform node-averaging (redo when applying other filters)
    dff = aggregate(df$Ks, list(df$Family, df$Node), mean)

    # reflect the data around the lower Ks bound to account for boundary effects
    ks = c(dff$x, -dff$x + lower_bound)

    # plot a histogram and KDE on top
    hist(ks, prob=TRUE, xlim=c(0, upper_bound), n=50)
    lines(density(ks), xlim=c(0, upper_bound))

    Reply
  • Hello Cansu
    Yes that is right.
    Regards
    Ardy

    Reply
    • Hello Ardy,

      Here is how to do it in two ways (via graphics and ggplot2 libraries):

      Expanding the data frame for this example:

      set.seed(18462)                                       # Create example data
      data <- data.frame(x = round(rnorm(1000, 10, 10)), y= round(rnorm(1000, 20, 10)))
      head(data)                                            # Print example data

      Using R graphics:

      hist(data$x, prob = TRUE)   
      lines(density(data$x), col = "red")   
      lines(density(data$y), col = "blue")

      Using ggplot2 library:

      ggplot(data, aes(x)) +                                # ggplot2 histogram & density
        geom_histogram(aes(y = stat(density))) +
        geom_density(data=data, aes(x=x, y=stat(density)), col = "red") +
        geom_density(data=data, aes(x=y, y=stat(density)), col = "blue")

      I hope this answers your question. Let me know if you have any further comments.

      Regards,
      Cansu

      Reply
  • Dear Cansu
    Thanks a lot for your help. Sorry I am so new to R. Could you pls let me know how/where to fit these codes in the contest of the following, if possible?

    # read in data
    df = read.csv(“your_distribution1.tsv”, sep=”\t”)
    df = read.csv(“your_distribution2.tsv”, sep=”\t”)

    # filter Ks distribution (0.001 < Ks < 5)
    lower_bound = 0.001
    upper_bound = 5
    df = df[df$Ks lower_bound,]

    # perform node-averaging (redo when applying other filters)
    dff = aggregate(df$Ks, list(df$Family, df$Node), mean)

    # reflect the data around the lower Ks bound to account for boundary effects
    ks = c(dff$x, -dff$x + lower_bound)

    # plot a histogram and KDE on top
    hist(ks, prob=TRUE, xlim=c(0, upper_bound), n=50)
    lines(density(ks), xlim=c(0, upper_bound))

    Reply
  • Sorry I am not sure about term overlay! But better to say overlap two graphs for comparison purpose.

    Reply
  • Yes, that is correct.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top