Introduction
When exploring dependencies between variables, the data scientist’s toolbox often relies on correlation measures to reveal relationships and potential patterns. But what if we’re looking to capture relationships beyond linear correlations? Mutual Information (MI) quantifies the “amount of information” shared between variables in a more general sense. It measures how much we know about by observing . This approach goes beyond linear patterns and can help us uncover more complex relationships within data.
MI can be especially helpful for applications such as feature selection and unsupervised learning. For readers familiar with various types of correlation metrics (as discussed in my earlier post), Mutual Information provides an additional lens to interpret relationships within data.
In this article, I’ll guide you through the concept of Mutual Information, its definition, properties, and usage, as well as comparisons with other dependency measures. We’ll explore MI’s mathematical foundations, its advantages and limitations, and its applications, concluding with an example demonstrating MI’s behavior alongside more traditional measures like Pearson’s correlation.
Notation
Before diving deeper, let’s establish our notation:
- Random variables will be denoted by capital letters .
- Lowercase letters represent specific values of these variables.
- denotes the probability mass/density function of .
- represents the joint probability mass/density function of and .
- is the conditional probability of given .
- represents the entropy of random variable .
These should not surprise anyone, as they are standard conventions in most statistics textbooks.
Diving Deeper
Refresher on Entropy
Mutual Information is related to the notion of entropy, a measure of uncertainty or randomness in a random variable. In less formal terms, entropy quantifies how “surprising” or “unpredictable” the outcomes of are.
Formally, for a discrete random variable with possible outcomes and associated probabilities , entropy is defined as:
For continuous variables we switch the summation with an integral.
Entropy equals when there’s no uncertainty, such as when always takes a single outcome. High entropy means greater uncertainty (many possible outcomes, all equally likely), while low entropy indicates less uncertainty, as some outcomes are much more likely than others. In essence, entropy tells us how much “information” is gained on average when observing the variable’s realization.
Mathematical Definitions of MI
MI can be defined in mulitple ways. Perhaps the most intiuitive definition of MI between two random variables and is that it measures the difference between the joint distribution and the product of their marginals. If two random variables are independent, their joint distribution is the product of their marginals . In some sense, the discrepancy between these two objects (i.e., the two sides of the equality) measures the strength of association between and (i.e., their “independence”).
Formally, MI is the Kullback-Leibler divergence between the joint distribution and the product of the two marginals:
Let’s now examine MI from a different angle. We can express mutual information as follows:
Again, for continuous variables, the sums become integrals. This is a more standard definition since it does not rely on
Alternatively, MI can also be expressed in terms of entropy. It equals the sum of the entropy of and taking away their joint entropy:
.
You can compute MI in R with the infotheo package.
Software Package: infotheo.
Properties
MI has several intriguing properties:
- Non-negativity: . Mutual Information is always non-negative, as it measures the amount of information one variable provides about the other. Higher values correspond to stronger association.
- Symmetry: . This symmetry implies that the information provides about is the same as what provides about .
- Independence: Similarly to Chatterjee’s correlation coefficient, if and only if and are independent.
- Scale-invarance: Mutual information is scale invariant. If you apply a scaling transformation to the variables, their MI will not be affected.
Conditional MI
Conditional Mutual Information (CMI) extends the concept of MI to measure dependency between and given a third variable . CMI is useful for investigating how much information and share independently of . It is defined as:
This can be valuable in causal inference, where understanding dependencies conditioned on specific variables aids in interpreting relationships within complex models. CMI is also particularly useful for feature selection when accounting for redundancy among already included features. Let’s explore this idea in greater detail.
Feature Selection with MI
In machine learning, MI serves as a useful metric for feature selection (Brown et al. 2012, Vergara and Estévez 2014). Consider an outcome variable and a set of features with i.i.d. observations. By evaluating the MI between each feature and the target variable, one can retain features with the highest information content, passing a certain threshold. This is somewhat primitive since it assumes independence across features. More sophisticated approaches take feature dependency into acount.
For instance, methods like Minimum Redundancy Maximum Relevance (mRMR, Peng et al. 2005) aim to maximize the relevance of features to the target while minimizing redundancy among features. Here is a concise version of the mRMR algorithm:
- Calculate MI between each feature and the target (relevance).
- Calculate MI between features (redundancy).
- Select features that maximize relevance while minimizing redundancy:
,
where is the set of already selected features.
Pros and Cons
Like any other statistical tool, mutual Information has several advantages and limitations. On the positive side, it captures both linear and nonlinear relationships, is scale-invariant, and works with both continuous and discrete variables, making it a theoretically well-founded measure. However, it requires density estimation for continuous variables, can be computationally intensive for large datasets, and its results can be sensitive to binning choices for continuous variables. Additionally, there is no standard normalization for MI.
An Example
Let’s implement MI calculation in and compare it with traditional correlation measures using the dataset. I used ChatGPT to generate the code below.
library(infotheo)
library(corrplot)
library(dplyr)
data(iris)
# Calculate mutual information matrix
mi_matrix <- mutinformation(discretize(iris[,1:4]))
# Calculate correlation matrix
cor_matrix <- cor(iris[,1:4])
# Compare MI vs Correlation for Sepal.Length and Petal.Length
mi_value <- mi_matrix[1,3]
cor_value <- cor_matrix[1,3]
print(paste("Mutual Information:", round(mi_value, 3)))
print(paste("Pearson Correlation:", round(cor_value, 3)))
We can now print the results.
[1] "Mutual Information: 0.585"
[1] "Pearson Correlation: 0.872"
The left matrix displays the MI results and the right one shows the standard (Pearson) correlation values. The default scales are different, so one should compare the values and not the colors. Indeed, the variable pairs with negative linear correlation also have the lowest MI values.
This example demonstrates how MI can capture nonlinear relationships that might be missed by traditional correlation measures.
You can download the code from this GitHub repo.
Bottom Line
- Mutual Information provides a comprehensive measure of statistical dependence, capturing both linear and nonlinear relationships.
- Unlike correlation coefficients, MI works naturally with both continuous and categorical variables.
- MI serves as the foundation for sophisticated feature selection algorithms like mRMR.
Where to Learn More
Wikipedia is a great place to start and learng the basics. Brown et al. (2012) and Vergara and Estévez (2014) are the go-to resources for conditional MI and using MI for feature selection.
References
- Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. JMLR.
- Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Wiley-Interscience.
- Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6).
- Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI.
- Ross, B. C. (2014). Mutual information between discrete and continuous data sets. PloS one, 9(2).
- Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1).
Leave a Reply