Background
Stratified sampling is a foundational technique in survey design, ensuring that observations capture key characteristics of a population. By dividing the data into distinct strata and sampling from each, stratified sampling often results in more efficient estimates than simple random sampling. Strata are typically defined by categorical variables such as classrooms, villages, or user types. This method is particularly advantageous when some strata are rare but carry critical information, as it ensures their representation in the sample. It is also often employed to tackle spillover effects or manage survey costs more effectively.
While straightforward for categorical variables (like geographic region), continuous variables—such as income or churn score—pose greater challenges for stratified sampling. The primary issue lies in the curse of dimensionality: attempting to create strata across multiple continuous variables results in an explosion of possible combinations, making effective sampling impractical. For example, stratifying a population based on income at every possible dollar amount is absurd.
In this article, I present two solutions to the problem of stratified sampling with continuous variables.
Diving Deeper
The Traditional Method: Equal-Sized Binning
This approach involves dividing the continuous variable(s) into intervals or bins. For example, churn score, a single continuous variable , can be divided into quantiles (e.g., quartiles or deciles), ensuring each bin contains approximately the same number of observations/users.
Let’s focus on the case of building ten equally-sized strata. Mathematically, for a continuous variable , the decile-based binning can be defined as:
where represents the -th quantile of . This approach splits in the first ten percentiles (i.e., minimum value to the 10th percentile) into a single stratum, the next ten percentiles (10th to 20th) into another stratum, and so on.
When dealing with multiple variables, this method extends to either marginally stratify each variable or jointly stratify them. However, joint stratification across multiple variables can also fall prey to the curse of dimensionality.
The Modern Method: Unsupervised Clustering
An alternative approach uses unsupervised clustering algorithms, such as -means or hierarchical clustering, to group observations into clusters, treating these clusters as strata. Unlike binning, clustering leverages the distribution of the data to form natural groupings.
Formally, let be a matrix of observations across continuous variables. One class of clustering algorithms aims to assign each observation to one of clusters:
where is the centroid of cluster of size .
Commonly, which leads to k-means clustering. Unlike in the binning approach, here we are not restricting each strata to have the same number of observations.
Pros and Cons
Unsurprisingly, each method comes with its trade-offs. Traditional binning is simple and interpretable but can struggle with multivariate dependencies. Clustering accounts for multivariate relationships between variables, avoids imposing arbitrary bin thresholds, and may results in more natural groupings. However, it can be computationally expensive and sensitive to the choice of algorithm and hyperparameters (e.g., in -means).
One can also imagine a hybrid approach. Begin with a dimensionality reduction method like PCA and then perform binning on the first few principal components.
An Example
Here is R code illustrating both types of approaches on the popular iris
dataset. We are interested in creating strata based on the SepalLenght
variable. We begin with the traditional binning approach.
rm(list=ls())
set.seed(42)
data(iris)
# Divide the continuous variable "Sepal.Length" into 4 quantile bins
iris$SepalLengthBin <- cut(iris$Sepal.Length,
breaks = quantile(iris$Sepal.Length, probs = seq(0, 1, 0.25)), include.lowest = TRUE)
# Inspect the resulting strata
table(iris$SepalLengthBin)
We split the dataset into four roughly equal parts based on the SepalLenght
values, each with about 38 observations.
Let’s now turn to k-means clustering.
# Perform k-means clustering on two continuous variables
iris_cluster <- kmeans(iris[, c("Sepal.Length", "Petal.Length")], centers = 4)
# Assign clusters as strata
iris$Cluster <- as.factor(iris_cluster$cluster)
# Inspect the resulting strata
table(iris$Cluster)
Here we also have four clusters, but their size ranges from 25 to 50 observations each.
You can find the code in this GitHub repo.
Takeaways
- Stratified sampling with continuous variables requires balancing simplicity and sophistication.
- Traditional binning remains a practical choice for single variables or marginal stratification.
- Clustering provides a robust alternative for capturing multivariate structures.
- Understanding the strengths and limitations of each approach, allows designing more effective tailored sampling strategies.
Leave a Reply