Generating Variables with Predefined Correlation

Share this article

Background

Suppose you are working on a project where the relationship between two variables is influenced by an unobserved confounder, and you want to simulate data that reflects this dependency. Standard random number generators often assume independence between variables, making them unsuitable for this task. Instead, you need a method to introduce specific correlations into your data generation process.

A powerful and efficient way to achieve this is through Cholesky decomposition. By decomposing a positive-definite correlation matrix into its triangular components, you can transform independent random variables into correlated ones. This approach is versatile, efficient, and mathematically grounded, making it ideal for simulating realistic datasets with predefined (linear) relationships.

Diving Deeper

The core idea revolves around the transformation of independent variables into correlated variables using matrix operations. Assume we want to generate n observations for p variables with a target correlation matrix \Sigma.

The algorithm are as follows:

  1. Generate Independent Variables: Create a matrix X of dimensions n \times p, where each column is independently drawn from \approx N(0,1):

        \begin{equation*}X = \begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1p} \\x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix} \end{equation*}

  2. Cholesky Decomposition: Decompose the target correlation matrix \Sigma as:

        \[\Sigma = LL^T,\]


    where L is a lower triangular matrix.
  3. Transform Variables: Multiply the independent variable matrix X by L to obtain the correlated variables:

        \[Y = XL.\]


    Here, Y is an n\times p matrix where the columns have the desired correlation structure defined by \Sigma.

To ensure that \Sigma is a valid correlation matrix, it must be positive-definite. This condition guarantees the success of Cholesky decomposition and the correctness of the resulting correlated variables.

An Example

Let’s implement this in R. Our target correlation matrix defines the desired relationships between the variables. In this case, and have a correlations of 0.8, 0.5, and 0.3.

rm(list=ls())
set.seed(1988)

# Generate X, independent standard normal variables
n <- 100  # Number of observations
p <- 3    # Number of variables
x <- matrix(rnorm(n * p), nrow = n, ncol = p)

# Define Sigma, the target correlation matrix
sigma <- matrix(c(
  1.0, 0.8, 0.5,
  0.8, 1.0, 0.3,
  0.5, 0.3, 1.0
), nrow = p, byrow = TRUE)

# Cholesky decomposition
L <- chol(sigma)

Using our notation above we have:

    \begin{equation*}\Sigma = \begin{bmatrix}1.0 & 0.8 & 0.5 \\0.8 & 1.0 & 0.3 \\ 0.5 & 0.3 &1.0\end{bmatrix}. \end{equation*}

The chol function in R decomposes the matrix into a lower triangular matrix. In our example:

    \begin{equation*}L^T = \begin{bmatrix}1 & 0.8 & 0.5 \\0 & 0.6 & -0.17 \\0 & 0.0 & 0.85 \end{bmatrix}. \end{equation*}

Multiplying the independent variables X by the transpose of L ensures the output Y matches the specified correlation structure.

y <- x %*% t(L)
correlated_data <- as.data.frame(y)

The cor function checks whether the generated data conforms to the target correlation matrix.

print(round(cor(correlated_data), 2))
print(sigma)

You can find the code in this GitHub repo.

Takeaways
  • A common data practitioner’s need is to generate variables with a predefined correlation structure.
  • Cholesky decomposition offers a powerful and efficient way to achieve this.

Leave a Reply

Your email address will not be published. Required fields are marked *