Back to Incite! Decision Technologies.

Imagine that we have a population of something composed of two subset populations that, while distinct from each other, share a common characteristic that can be measured along some kind of scale. Furthermore, let’s assume that each subset population expresses this characteristic with a frequency distribution unique to each. In other words, along the scale of measurement for the characteristic, each subset displays varying levels of the characteristic among its members. Now, we choose a specimen from the larger population in an unbiased manner and measure this characteristic for this specific individual. Are we justified in inferring the subset membership of the specimen based on this measurement alone? Baye’s rule (or theorem), something you may have heard about in this age of exploding data analytics, tells us that we can be so justified as long as we assign a probability (or degree of belief) to our inference. The following discussion provides an interesting way of understanding the process for doing this. More importantly, I present how Baye’s theorem helps us overcome a common thinking failure associated with making inferences from an incomplete treatment of all the information we should use. I’ll use a bit of a fanciful example to convey this understanding along with showing the associated calculations in the R programming language.

Note: If you don't care about the R code at all and just want to follow the
basic reasoning, you can hide all the code at once by selecting Code/Hide All
Code from the drop down list in the top right corner of this page. If you want
to play around with the R code yourself, you can download the code by selecting
Code/Download Rmd.

Suppose we are aliens from another planet conducting scientific research on this strange group of bipedal organisms called humans. Humans are sexually dimorphic, presenting themselves as female and male genders. Each gender expresses its height across a distribution, and the means of each gender population are distinct. Years of collecting data about the humans reveals that the mean height for adult females is 5.33 bliks; males, 5.83 bliks. (“Bliks” are our unit distance measurement, like feet or meters.) Both genders show a similar bell-shaped distributed (i.e., Normal distribution) variation around their mean heights with a standard deviation of 0.2.

gender <- c("Female", "Male")
adult.sd <- 0.2 # Adult height standard deviation. Constant across gender.

Another piece of information that we have collected about the human population is the base rate frequency, or the proportion, of each subpopulation in the total population. These proportions provide some valuable information, particularly if the proportion of one subpopulation is is larger than the others. For example, among the Zargons, chartreuse gills show up in 85% of the population compared to 15% for those with magenta gills. If we randomly select a member of the Zargon population for our zoo exhibit, we most likely will select one with chartreuse gills. In the language of Bayesian reasoning, we can call this base rate the prior probability (or degree of belief) that a randomly selected individual from the total population should be classified as a member of one category or another. In the generic notation, this would be Prob(Hypothesis), but in the context of our current situation, we might write Prob(Gender). Hold on to this concept, as we will return to it in just two shakes of a proton dislocator.

Unfortunately, a recent viral epidemic nearly wiped out the population of males, leaving the females over-represented as 90% of the population as compared to the historically approximately even distribution of females to males.

# Set up a vector to contain the current Prob(Gender).
base.rate.freq.fem <- 0.9 # The female base rate frequency.
base.rate.freq <- c(base.rate.freq.fem, 1 - base.rate.freq.fem) # (females, males)

Imagine that on an exploratory mission to Earth, we discover an injured human to whom we intend to apply medical attention before we ship it off to our galactic zoo; however, due to its injuries and its extreme emotional state, we can’t determine its gender. The only data we can obtain is its height (Yes, we can travel across vast distances of space, but we have to use this crude means to make a measurement. Bear with me. This is just an example, not a sci-fi novel.), which we observe to be 5.65 bliks. Based on this one sample of measured information, we must infer the gender of the specimen before we proceed. How can we use the information we have on hand to make a rational inference?

# The measured height of our unidentified specimen.
# units = Bliks
specimen.height <- 5.65

First, we need to determine the probable height of any member of a gender based on our current characterization of the gender populations. Again, in the language of Bayesian reasoning, we call this the likelihood function, which tells us the probability of observing some kind of evidence, like a measurement, that is conditional on the selection of a given population or conditional on a given hypothesis being the case. We usually write this as Prob(Evidence|Hypothesis). In the context of our problem, we might write Prob(Height|Gender).

Using the dnorm(...) function, we can calculate the probability distribution across the height domain for each gender. Note that the dnorm(...) function produces the probability density at a given point.

# Here we set up an index for the height domain, with a step size of
# 0.04 bliks.
step.size <- .04
height <- seq(4, 7, by = step.size)
# Calculate the height probability distribution for each gender across the
# height domain. In this case, the result will be an array of shape (2 x
# height). We apply the transpose funtion to this array to reorient it to
# (height x 2).
likelihood.height <- t(sapply(height, function(h)

# Recast the array as a data frame in preparation for plotting in ggplot.
df.likelihood.height <- data.frame(height, likelihood.height)
colnames(df.likelihood.height) <- c("Height", gender)

If we plot this function, we observe the probability distribution across the height domain. Again, we also refer to this as the likelihood function.

# Recast the rectangular data frame to a relational format such that the height
# index is the leftmost variable.
melt.df.likelihood.height <- melt(df.likelihood.height,
id.vars = "Height",
variable.name = "Gender",
value.name = "Probability")
height.distr <- ggplot(melt.df.likelihood.height,
aes(
x = Height,
y = Probability,
group = Gender,
colour = Gender
)) +
geom_line(aes(group = Gender, colour = Gender)) +
geom_vline(xintercept = specimen.height) +
xlab("Height [bliks]") +
ylab("Probability Density") +
ggtitle("Height Distribution of Human Adults", subtitle = "Prob(Height|Gender)")
print(height.distr) It’s tempting to look at these distributions and conclude that the proper inference about the gender of our specimen would be the population with the greatest probability at the specimen’s measured height. After all, the height of adult humans is one of the most easily observable and available aspects of this strange species that tends to cover up all the distinguishing dimorphic characteristics with something they call “clothes.” A quick reference to our intuition might tell us that height is a good measure to make an inference about gender. In this case, we measured the height of our specimen at 5.65 bliks (the vertical black line). Based on this height compared to the distributions, we might infer that the specimen is male since the likelihood at 5.65 bliks is higher for males than females.

Unfortunately, this approach leaves out the information we have about the current prevalence (or the base rate frequency or prior degree of belief) of the genders in the population. Remember the example of the Zargons? We typically could infer the subpopulation based on the observation of a dominantly expressed characteristic. But in this case, we have two characteristcs we could consider, and they seem to work against each other; that is, a randomly selected adult human after the viral post-apocalypse is now mostly likely female, but the height characteristic information we have seems to lead us to infer that we have a male. Can we combine both pieces of information to adjust the strength of our inclination in a reasonable manner? Fortunately, all that we have to do to include both pieces of information in our judgment is to weight the likelihood of height for a given population proportionally to the prevalence of the population. We can achieve this by multiplying the liklihood function by the base rate probability.

likelihood.height.gender <- t(base.rate.freq * t(likelihood.height))
# Recast the array as a data frame in preparation for plotting in ggplot.
df.likelihood.height.gender <- data.frame(height, likelihood.height.gender)
colnames(df.likelihood.height.gender) <- c("Height", gender)
# Recast the rectangular data frame to a relational format such that the height
# index is the leftmost variable.
melt.df.likelihood.height.gender <- melt(df.likelihood.height.gender,
id.vars = "Height",
variable.name = "Gender",
value.name = "Probability")
height.distr.gender <- ggplot(melt.df.likelihood.height.gender,
aes(
x = Height,
y = Probability,
group = Gender,
colour = Gender
)) +
geom_line(aes(group = Gender, colour = Gender)) +
geom_vline(xintercept = specimen.height) +
xlab("Height [bliks]") +
ylab("Probability Density") +
ggtitle("Probability-Weighted Height Distribution of Human Adults", subtitle = "Prob(Height|Gender) * Prob(Gender)")
print(height.distr.gender) Now we observe that the probability weighted height distributions most likely ought to change our inference about the gender of our specimen. All we need to do now is find the conditional likelihoods at the measured height and normalize them with the marginal probability at that height, where the marginal probability is just the sum of the conditional likelihoods across genders at the measured height.

conditional.likelihood <-
# cast into an array.
conditional.likelihood <- array(conditional.likelihood, dim = c(1, 2))
colnames(conditional.likelihood) <- gender

The conditional likelihoods for each gender at the measured height is:

Female   Male
0.50   0.13 
# The marginal probability is the total probability of the product of the prior
# and the likelihood at the specified measured observation.
marginal.prob <- sum(conditional.likelihood)

The marginal probability, which is just the sum of the two probabilities in our conditional likelihood is:

 0.63

Now we divide the conditional likelihood values by the marginal probability value.

# The probability (rational inference) of gender associated with sampled height.
posterior.prob.gender <- conditional.likelihood / marginal.prob

We should accept now as our updated belief (i.e., posterior probability) about the gender of our specimen to be:

Female   Male
0.79   0.21 

Our final observation is that Female should be our rational inference for the gender of our specimen based on the prior probability weighted likelihood of height as a function of gender.

In short, what we calculated was prob(Gender|Height) = prob(Gender) * prob(Height|Gender)/(marginal probability) where the marginal probability is sum(prob(Gender) * prob(Height|Gender)) Then we selected the gender argument with the highest probability.

The term sum(prob(Gender) * prob(Height|Gender)) is a normalizing factor. Once the quotient has been found, the sum of probabilities should equal to 1.

With regard to our example, this exercise has shown us that tall females among the population of all adult females are somewhat rare. However, tall females among the entire post-apocalyptic population of adult humans (in which females are the most prevalent subpopulation) are more prevalent than just-below-average height males, even though such males would be more prevalent than tall females if the the subpopulations were more equal in size. Therefore, finding a tall individual should lead us to infer, given no other information about them, that the individual is most likely female. Combining both pieces of information allows us to temper our intuition that would have been based on only one of the available pieces of information.

An interesting question that might arise from these insights is: what is the base rate frequency of the populations of females and males that might instill complete ambiguity in us about the gender of the specimen based on the measured height and the likelihood height functions of each gender? The answer is simple. All we have to do is recognize that

base.rate.freq.fem * Female_Likelihood|specimen.height = (1 - base.rate.freq.fem) * Male_Likelihood|specimen.height

and then solve for base.rate.freq.fem. Let’s call this special base rate frequency indif.base.rate.freq.fem. Rearranging we get

Likelihood.ratio = Male_Likelihood|specimen.height / Female_Likelihood|specimen.height
indif.base.rate.freq.fem = Likelihood.ratio / (1 + Likelihood.ratio)
# reform example equations into R statements
indif.base.rate.freq.fem <- likelihood.ratio / (1 + likelihood.ratio)
Female   Male
0.71   0.29