Compute z-score in R (2 Examples)

 

This article shows how to calculate z-scores (also called standard scores, z-values, normal scores, and standardized variables) in the R programming language.

The content of the page is structured as follows:

If you want to learn more about these topics, keep reading.

 

Introducing Example Data

As a first step, we’ll need to construct some data that we can use in the exemplifying syntax later on.

x <- c(7, 6, 1, 4, 3, 5, 3, 7, 6, 5)                           # Create example data
x                                                              # Print example date
# 7 6 1 4 3 5 3 7 6 5

The previous output of the RStudio console reveals that our example data is a vector consisting of several numeric values. The values of our example data are not standardized / normalized yet.

 

Example 1: Standardize Values Manually

Example 1 explains how to standardize the values of a vector or data frame column manually by using the mean and sd functions in R.

Have a look at the following R code:

x_stand1 <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)  # Standardize manually
x_stand1                                                       # Print standardized values
# 1.1816039  0.6678631 -1.9008410 -0.3596186 -0.8733594  0.1541222 -0.8733594  1.1816039  0.6678631  0.1541222

The previous output of the RStudio console shows the standardized values that correspond to our input vector.

Note that we have specified the na.rm argument to be equal to TRUE. In case your data would contain missing values, those values would be removed for the computation of z-scores.

 

Example 2: Standardize Values Using scale() Function

The previous example shows how to calculate z-scores manually based on its formula. However, the R programming language provides a function called scale, which makes the computation of z-scores easier and more efficient.

We can use the scale function as shown below:

x_stand2a <- scale(x)                                          # Standardize using scale()
x_stand2a                                                      # Print standardized values
#            [,1]
# [1,]  1.1816039
# [2,]  0.6678631
# [3,] -1.9008410
# [4,] -0.3596186
# [5,] -0.8733594
# [6,]  0.1541222
# [7,] -0.8733594
# [8,]  1.1816039
# [9,]  0.6678631
# [10,]  0.1541222
# attr(,"scaled:center")
# [1] 4.7
# attr(,"scaled:scale")
# [1] 1.946507

As you can see, the scale function returns a matrix instead of a vector. In case you prefer to have a standardized vector, you can modify the output of the scale function as shown below:

x_stand2b <- as.numeric(x_stand2a)                             # Convert matrix to vector
x_stand2b                                                      # Print standardized values
# 1.1816039  0.6678631 -1.9008410 -0.3596186 -0.8733594  0.1541222 -0.8733594  1.1816039  0.6678631  0.1541222

The previous output is exactly the same as in Example 1.

 

Video, Further Resources & Summary

Do you want to learn more about standardization in R? Then I can recommend to have a look at the following video which I have published on my YouTube channel. In the video instruction, I explain the R programming syntax of this tutorial in RStudio.

 

 

In addition, you might want to have a look at the related tutorials on this website:

 

To summarize: You learned in this article how to standardize vectors and data frame columns in the R programming language. If you have additional questions, don’t hesitate to let me know in the comments below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


10 Comments. Leave new

  • Henri B Tuthill
    December 3, 2020 2:23 pm

    Joachim,
    I wish you would put all this wonderful, useful information about R into a book. Very helpful information and much appreciated.
    Regards,
    Henri Tuthill

    Reply
    • Hi Henri,

      Thank you very much for such an awesome feedback!

      Indeed, I’m planning to release a book or maybe a video series in the future. Unfortunately, I never find the time to do it.

      However, this is definitely something that will come sooner or later!

      Regards,

      Joachim

      Reply
  • EDUARDO QUENTAL
    December 22, 2020 12:26 pm

    Hi Joachin,
    Your tutorials on R were very useful for me to complete my course on Data Science in Brazil.
    Despite not being versed in the language, his objective tutorials and google translator helped me a lot! :-))
    Thank you!
    Merry Christmas and Prosperous New Year!

    Reply
    • Hey Eduardo,

      Thank you very much for these kind words. I’m very happy to get such a positive feedback from you! 🙂

      Merry Christmas and a happy new year for you as well!

      Joachim

      Reply
  • This is an amazing tutorial. I have a question if possible it could be answered. Let’s say you have a dataset with multiple columns and you wanted to calculate the z-score for a subset of a certain variable, what would you do then?
    For example: calculate the z-score for the female grades
    Gender. Grade
    Male. 82
    Female. 100
    Male. 95
    Female. 75
    Male. 77
    Male. 88
    Would you use the population mean and standard deviation but use only the female data point when subtracting or do you use only the sample mean and deviation. Thank you for everything.

    Reply
    • Hi Moe,

      Thanks a lot for the nice comment!

      This depends a bit on what you want to evaluate. However, in case you want to analyze the females separately I would only use the mean and standard deviation of the subset of females.

      I hope that helps!

      Joachim

      Reply
  • Thank you so much

    Reply
  • Hello Matthias,
    thank you very much for the nice explainings. Nevertheless I wonder about the equivalence of the term z score and the formula Designed in R for scale(x) for a while now and hope you can help:The R Formular makes use of the Population SD -by testing and studying the documentary there s no doubt here- while normalizing normal data is achieved (due to my knowledge) using the Population SD–> The other variant would due to my knowledge result in t Statistic. This has Implications eg when building z score Based Additive index as the sd of the final Index values will only have the intended property (mü=0 and (!!) Sd =1) if the Population sd is used.

    Hence, I wonder about the misleading name of the R command or there is something i have overlooked until now, which seems more likely. Any clue will be highly appreciated.

    Thank you very much.

    Reply
    • Hello Marcel,

      Please see the wikipedia link linked to z-scores keyword in the beginning of the tutorial. There you will see the description “When the population mean and the population standard deviation are unknown, the standard score may be estimated by using the sample mean and sample standard deviation as estimates of the population values.”.

      However, it is important to underline the difference in concepts of z-score and z-test, which might be the cause of your confusion.

      A z-score is a statistical measure that describes a data value in terms of standard deviations from the mean. The formula for a z-score is: z = (X−μ)​/σ. The scale() function in R standardizes data by subtracting the mean and dividing by the standard deviation. By default, scale() uses the sample standard deviation, which is appropriate for most data analysis situations where you are working with a sample rather than the entire population. The use of the sample standard deviation means the function is calculating z-scores in the context of a sample, not the population.

      Z-test is a type of statistical test used to determine whether there is a significant difference between sample and population parameters or between the means of two samples. A basic form for a one-sample z-test comparing the sample mean to a population mean is z = (X̄−μ)/(σ/√n) where X̄ is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size. Z-tests are typically used when the population standard deviation is known and the sample size is large (usually n>30). For smaller samples or unknown population standard deviations, a t-test is usually more appropriate, which uses sample standard deviation, t = (X̄−μ)/(s/√n).

      I hope the concept of Z-score is clearer now.

      Best,
      Cansu

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top