Probabilistic programming primer

Part I — Probability, Likelihood and Beta functions

Seydou Dia
OLX Engineering

--

The goals of this series is to discover what is probabilistic programming and how it can be used to solve real world and business challenges. We will see its applications in analytics, inference and data driven product experimentations.

Probabilistic programming refers to programs that fit probability model and the primitives of the language can be stochastic; e.g. probability distribution. Thanks to those primitives we are able to express programs that deal with uncertainty and information.

This post is the first of the series and we will introduce probability distribution and likelihood.

Why Probabilistic Programming anyway?

If you want to know why probabilistic programming is exciting let me point you to this post by Beau Cronin; my favourite quote from his post is,

[…] it’s clearly cool, but why does it matter?

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories — thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

The first time I read this, I was definitely excited about the subject and promised myself to get up to speed. So if, like me, you are excited and want to learn more, I hope this series will give you the intuition about the mechanism behind probabilistic programming, as well as some useful applications.

1. Before you see the data

In this first section we will talk about the concept of probability. For our purpose we define probability as a way to quantify our degree of belief in some events. For example, we think that the probability it rains tomorrow is 60%, or 0.6. A probability is always a number between 0 and 1. Moreover the sum of the probabilities over all outcomes is always 1––or 100%. If the probability it rains tomorrow is 60% then we imply that the probability of not raining is 40%.

The same applies when there is more than 2 outcomes, for example there is 3 possible outcomes in a football game for a given team: draw / loose / win. If we were asked about our belief in a match we could say something like : 10% chance to win, 35% chance to loose and, by deduction, 55% chance of a draw.

As John said: Without the data… just imagine!. sce J. Beck

A probability distribution combines the possible outcomes and the belief we put on them in one view. If we take the rain example, the probability distribution looks like this:

There 35% chance that it rains tomorrow and therefore 65% that it won’t.

The probability distribution for the football match example looks like this:

Football example with 10% chance to win, 35% to loose and, and 55% for draw

Let's look at another example, say I have a fair dice; if I am interested in the probability to get a specific number the probability distribution looks like this:

Fair dice: probability for a given number is uniform: 1/6 = 16.66%

What we can read from the previous distribution is that the numbers from 1 to 6 have the same probability of occurrence — that is expected, we are dealing with a fair dice after all.

Continuous distribution

So far we've seen example of distribution for discrete events: YES / NO for tomorrow rain, [1, 2, 3, 4, 5, 6] for the dice and 3 possible outcomes for the football game etc. However distributions also apply to continuous values, for example I might want to predict the temperature for tomorrow; in such case a continuous distribution is convenient.

Positive thinking: the most probable temperature for tomorrow is 25ºC.

This is quite similar to what we have seen so far, the only difference is that the outcome we try to predict, the temperatures, is continuous: it can take value like 25.57ºC, etc.

Previously we saw that the probabilities add up to 1 (or 100%), so how do we do that in the continuous case? The answer is in calculus–– the construct that allows us to “sum” continuous function is called the integral.

So returning to my example, this is how I express the distribution summing up to 1:

In calculus functions are the constructs used to represent continuous value, in the formula above I’ve used the function named p to represent the probability on the continuous range of temperatures.

Ok, it looks scary, but bear with me, don't close that tab just yet! Integrals are not really relevant for our purpose, I just mention them for the sake of completeness.

Before closing this section let me share the most famous probability distribution of all time: the Normal distribution.

The almighty Normal distribution in its normalized form: 0 centered and 1 standard deviation.

Beta functions

In the previous section we saw that functions are used to represent a continuous distribution. In this section we will focus on a special family of functions called Beta functions.

A function from the Beta family can be used to represent any continuous variable with values going from 0 to 1. There are many things that take value between 0 and 1: proportions, percentage, conversion rate, probability itself.

A Beta function has 2 parameters, let's note them a and b. The only thing you need to know is that changing a and b has the effect to change the shape of the Beta function. This is what make the Beta function so powerful, by tweaking a and b you can design a function that matches closely your model of belief.

Let's say I want a probability distribution for a conversion rate of 0.5, or a fair coin (50% chance to get heads), then Beta function shown in the next figure is a very good candidate.

With a=b=2, the Beta function is centered around 50%.

In a similar way, if I want to model a conversion rate of 20% I can use the following:

With a=2 and b=5, the Beta function is centred around 20%.

Finally, if I have no idea what the most probable conversion rate is, then from my perspective all values are equally probable; I will go with the uniform Beta function, by setting a=b=1.

With a=b=1, the Beta function is uniform.

So putting everything together, the following plots sum up the shape of the Beta function for various value of a and b:

Beta Art concept.

The key takeaway of this section is that Beta functions are very useful to build probability distribution for continuous variable in the 0 to 1 range. However, another amazing property of the Beta functions is, they help simplify computation. You remember those scary integrals shown earlier, — well, thanks to Beta functions we can get rid of them.

Conclusion

In this section we discussed:

  • probability and probability distribution to model our belief on some events,
  • the Beta function as a tool to model any continuous variable in the 0 to 1 range.

So far we have dealt with probability, and uncertainty in general; but then at some point what you tried to predict actually happen (or not), your predictions face reality; in other words you see the data! Now you need to capture the information gained from the seen data— that is the goal of the next section.

2. Let there be data!

The second construct we'll see is called likelihood. Basically if we put belief in probability for future events, then likelihood captures information from new data, once those events actually take place. We use probability before we see the data and likelihood after.

Let's say before seeing the data we have a belief on a set of outcomes — a probability distribution; then after data has been seen, likelihood is the construct that allow us to tell which of our initial beliefs best explain the data we are seeing.

Likelihoods are inherently comparative; only by comparing likelihoods they become interpretable. Unlike probabilities that provide an absolute measure of belief, likelihood is a relative measure. Let’s see likelihood in action.

Say we have a coin, and we think it is a fair coin meaning the probability of heads coming up is 50%.

Now we flip the coin 10 times and say we get the following sequence: H-T-H-H-T-H-H-T-H-H, so 7 heads for 3 tails. The likelihood for that data is a classic of statistics, it is called the Binomial distribution, and for our dataset it looks like this:

where:

  • p is the probability to get head
  • h is the number of heads and t for tail (in our case h=7 and t=3)

It's important to see that this formula is very simple, it's just the multiplication of the probability of occurrence of each data points — p for the heads, 1-p for the tails.

If we replace h and t by what we get in our dataset, then the formula is:

and the plot of the likelihood looks like:

Binomial likelihood for 7 successes out of 10 trials

The graph above can be interpreted as follows: according to the data, the most likely probability is 0.7 (among all the possibilities) — which makes sense as we have 7 heads out of 10 trials.

Conclusion

In this sections we have introduced the concept of likelihood. Likelihood can be seen as a construct to capture information from data.

An important point to remember is that likelihood is inherently comparative. Contrary to probability, the absolute value for likelihood does not make sense per se; only when used to compare assumptions. For 2 given hypotheses the one with higher likelihood is most likely to explain the observed data. Maximum likelihood is a major concept in classic statistics.

3. When the facts change, I change my mind…

In the previous section we've seen that before seeing any data, we've put belief on events using probability distribution. Having seen the data, we use likelihood to capture new information gained from the data. A natural thing to do once you have new information is to update your initial belief; that is achieved simply by multiplying the probability distribution with the likelihood.

Show me the data sce A. Chambers

Going back to the coin example, — before tossing the coin our assumption was that the coin is fair. Let's use a Beta function centered at 0.5 to model that belief:

The Beta distribution centered in 0.5 can be used to model our belief of a fair coin.

Let's toss the coin 10 times, say we get 7 heads out of 10 tosses, that results in the following likelihood:

The multiplication of probability distribution and likelihood is not as easy as it sounds, it can actually be quite an intensive operation involving calculus and integrals.

But it turns out the Binomial likelihood and the Beta function play really nice from a computational point of view. In fact, the result of that multiplication gives back a Beta distribution, thus frees us from a lot of of hassle, like having to deal with calculus and, God forbid, integrals.

So, coming back to our main subject: how to update our initial belief in the face of new information? Our initial belief is captured in the probability distribution: a Beta function centered around 0.5. And the new information is in the Binomial likelihood. The multiplication of both looks like this:

The result of this multiplication is a probability — we are using the symbol for proportionality instead of equality because we are missing a normalisation step to make the result a probability. For our case you can just consider the previous equation as an equality.

So plotting everything will give us the following:

Comparison of our initial belief (blue) to the updated one (orange)

In the previous graph the orange line is the result of the multiplication. In other words, the blue line is our initial belief and the orange line is the new belief once we take into account new data. Initially we assumed a fair dice and our belief was at 50%, but taking into account new data has "shifted" that belief to 70%.

Probabilistic programming gives us the ability to iterate and continuously update our belief as new data becomes available. We could, for example, write a program like the following:

In the previous snippet we see that we started with a prior belief and no data. As we get new data, we keep updating our belief by multiplying it with the resulting likelihood. Each iteration is an update step, and return basically a Beta function, which in turn is updated at the following iteration, and so on.

If we plot the intermediate graphs resulting from the previous snippet we'd get something like:

As we get more data we are getting closer the true probability of head: 70%

In the previous graph, iterations refer to coin tossing or more data; as we get more data as the we converge toward the true probability of head: 70%.

If you replace the coin and the probability of heads by, say, a conversion rate, you can start to see how powerful this approach can be when it comes to analytics and business in general. We will see some applications in upcoming posts.

If you have followed this so far, you already understand enough about probabilistic programming to starting exploring on your own. This is just a start though, the idea here was first to give an intuition about probabilistic programming; we will see in the follow-up post how we can apply what we learned here to answer business questions, run inference, analyse a/b testing and experimentation in general.

Conclusion

This post is a primer on probabilistic programming. We started with an introduction of the concept of probability and probability distribution. Probability is very useful to quantify uncertainty, on the other hand likelihood captures information from data. Likelihood is not a probability although it is very similar to it. We use likelihood for comparison purpose, to compare the hypothesis that is more likely to explain the data we are seeing.

The process we described in this post is the basis of a well defined way of doing statistics and is called Bayesian statistics. The idea is to start with a prior belief and then using new data to build likelihood ; the multiplication of the prior and the likelihood will yield the posterior after normalisation––posterior is a probability just like the prior. The treatment of Bayesian statistics is beyond the scope of this post, but hopefully you have a basic understanding of how it works.

In this post we also introduced the Beta function as a very useful probability distribution for prior. We showed that the Beta function has a really nice relation with the Binomial likelihood; in terms of Bayesian statistics we say the Beta function is the conjugate prior of the Binomial likelihood.

I hope this post will develop your curiosity about probabilistic programming and make you want to explore more. If you want to see some applications of what we learned so far, stay tuned for future posts.

Thank you for reading!

I would like thank Andreas Merentitis and Maryna Cherniavska for the revision and proofreading work.

--

--