Building A Probabilistic Risk Estimate Using Monte Carlo Simulations With Python MCerp

Zhijing Eu
16 min readAug 11, 2023

This article explains how to create a stochastic risk estimate via a Monte Carlo Simulation using a Python Library called MCerp

Image Source: Adapted from PlugPng

Monte Carlo Simulations are a way to model real world phenomena by defining a set of (pseudo) random variables as inputs and then repeatedly running simulations and analyzing the aggregated outcomes across all the simulations and to better understand the factors that drive the results.

If you are coming into this cold, I’d suggest reading the original 2020 article below as we will be recreating the same probabilistic holiday budget estimate but using Python this time instead of an Excel add-in.

If you want to follow along — the Jupyter Notebook is available within this Github repo.

Outline

1.A Quick Survey Of Monte Carlo Simulation Libraries In Python
2.Case Study Background Fantasy Island (Redux)
3.Input & Output Variables Set Up
4.Running The Simulation & Aggregating The Results
5.Modelling Dependencies Between Uncertain Variables — Correlation
6.Modelling Dependencies — Multiplicative “Meta-Variables”
7.Analysing The Key Factors Driving The Outputs
8.Conclusion

1.A Quick Survey Of Monte Carlo Simulation Libraries In Python

While it is possible to create Monte Carlo Simulations using basic libraries such as Numpy and SciPy, this can be a bit tedious to set up.

Thankfully, there are a few specialized Python Libraries that provide more choices for sampling methods and much needed shortcut functions that abstract away the other complicated process of managing sub-simulations and summarizing the aggregated simulation results.

PyMCSL by Filipe Chagas Ferraz

Monaco by Scott Shambaugh

MCerp by Abraham Lee

While all three are good frameworks to work with, I have specifically chosen MCerp for this coding walk through as it provides some useful functions for not just calculating the resultant correlation factors but also inducing correlation between the input variables as well.

The code referenced below borrows heavily from this article by fellow Medium author Heiko Onnen who also has a range of other articles related to more advanced topics on stochastic modelling.

2. Case Study Background: Fantasy Island (Redux)

As a quick recap — the example we will be analyzing involves a budget planning exercise for a vacation to a fictional winter island country to answer the question of “How Much Will My Holiday To Fantasy Island Cost ?

(Read the previous article for more detail as we’ll just be jumping headlong into the mechanics from here on)

Behold ! Fantasy Island — The holiday destination of your dreams (Although you might want to give white water rafting a pass though)

3. Input & Output Variables Set Up

Inputs to Probabilistic Estimates are typically made up of:

  • Uncertainty in existing components/sub-elements that are ‘continuous’ in nature (where numbers may go higher or lower)
  • Discrete Events that have a likelihood of occurrence and once triggered (i.e on-off switch) will have an impact

To start we first need to import a couple of libraries

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from mcerp import correlate, correlation_matrix, plotcorr
from mcerp import Beta, Uniform, PERT, Binomial
from mcerp import uv, stats

from scipy import stats as stats
from scipy.stats import rv_continuous, rv_histogram

np.set_printoptions(precision=3, suppress=True)
pd.options.display.float_format = '{:.3f}'.format

MCerp provides a number of different probability distributions that are built with scipy.stats but wrapped around with a couple of useful features. As an example, here’s how to set up a Beta Pert distribution and what MCerp provides as an output

npts=10000 #No of simulation runs 

x=PERT(1,4,10)
x.describe()
rx=x._mcpts # shows an array of all the individual simulation results
rx

We will also create a helper function to plot out the probability distribution function

def plot_hist(data, title):
hist = np.histogram(data, bins=100)
histdist = rv_histogram(hist)

X = np.linspace(data.min(), data.max(), 100)
plt.title(title)
plt.hist(data, density=True, bins=100)
plt.plot(X, histdist.pdf(X))
plt.show()

rx=x._mcpts
plot_hist(rx,"Example PERT Distribution")

With this all set up, let’s define all the Input Variables as per the example case study

#InputVariables

PlaneFare= PERT(5000,7000,12000,PertLambda) #BetaPert
AccommodationCosts = PERT(2000,2500,4000,PertLambda) #BetaPert
Meals=PERT(500,600,900,PertLambda) #BetaPert
MiscShoppingExpenses=1000 #Has No Range as It is UnRisked
HolidayActivityTours=Uniform(1500,4000) #Uniform
ClothingTravelGear=PERT(300,400,500,PertLambda) #BetaPert

MedicalEmergencyRiskLikehood=Binomial(1, 0.1)
MedicalEmergencyRiskImpact=Uniform(1000,5000)

TheftRiskLikelihood=Binomial(1, 0.1)
TheftRiskImpact=PERT(500,750,2000,PertLambda) #BetaPert
#OutputVariables
RiskedRanges = PlaneFare+AccommodationCosts+Meals+MiscShoppingExpenses+HolidayActivityTours+ClothingTravelGear
DiscreteRisks=MedicalEmergencyRiskLikehood*MedicalEmergencyRiskImpact+TheftRiskLikelihood*TheftRiskImpact
TotalHolidayCosts=RiskedRanges+DiscreteRisks

4.Running The Simulation & Aggregating The Results

We still need to pull all the individual sets of results together into a coherent table format which is what the next few lines of code do where I’ve used the “r” pre-fix to denote the array that stores all the results for the relevant input or output variable

# collect the results in array variables


rPlaneFare = PlaneFare._mcpts
rAccommodationCosts = AccommodationCosts._mcpts
rMeals = Meals._mcpts
rHolidayActivityTours = HolidayActivityTours._mcpts
rClothingTravelGear = ClothingTravelGear._mcpts
rMedicalEmergencyRiskLikehood = MedicalEmergencyRiskLikehood._mcpts
rMedicalEmergencyRiskImpact=MedicalEmergencyRiskImpact._mcpts
rTheftRiskLikelihood=TheftRiskLikelihood._mcpts
rTheftRiskImpact=TheftRiskImpact._mcpts
rRiskedRanges=RiskedRanges._mcpts
rDiscreteRisks=DiscreteRisks._mcpts
rTotalHolidayCosts=TotalHolidayCosts._mcpts

# combine the arrays in a 2-dimensional array
rand1 = np.vstack((rPlaneFare,rAccommodationCosts,rMeals,rHolidayActivityTours,rClothingTravelGear,
rMedicalEmergencyRiskLikehood,rMedicalEmergencyRiskImpact,
rTheftRiskLikelihood,rTheftRiskImpact,rRiskedRanges,rDiscreteRisks,rTotalHolidayCosts))

VarNames=["Input_PlaneFare", "Input_AccomCosts", "Input_Meals", "Input_HolidayActivityTours", "Input_ClothingTravelGear",
"Input_MedEmergencyChance","Input_MedEmergencyCost","Input_TheftRiskChance","Input_TheftRiskCost",
"Output_RiskedRanges","Output_DiscreteRisks","Output_TotalHolidayCosts"]

# copy the array to a dataframe for a more transparent layout
df1 = pd.DataFrame(data=rand1).T
df1.columns=VarNames
df1

Using the describe method in MCerp , we can pull out some key statistics

We will also calculate the 10th,25th,Median(50th),75th,90th Percentiles from the 10,000 simulation runs using the pandas quantile method

df1.quantile([.1, .25, .5, .75, .9], axis = 0)

However since these are a bit tricky to visualize, let’s create some probability distribution function plots next

VarSimResults=[rPlaneFare,rAccommodationCosts,rMeals,rHolidayActivityTours,rClothingTravelGear,
rMedicalEmergencyRiskLikehood,rMedicalEmergencyRiskImpact,rTheftRiskLikelihood,
rTheftRiskImpact,rRiskedRanges,rDiscreteRisks,rTotalHolidayCosts]



for n in range(0,len(VarSimResults)):
plot_hist(VarSimResults[n],VarNames[n])

5.Modelling Dependencies Between Uncertain Variables — Correlation

What’s missing from the above simulation is the correlation between risk inputs themselves as the random sampling of the values from the probability distributions are assumed to be independent.

However, due to the Central Limit Theorem, what happens when you have a lot of random inputs that are all independent is that the ‘individual randomness of each input cancels each other out” i.e you end up with a nice clean and narrow normal distribution. Unfortunately this does not reflect most real life situations where budgets tend to over-run rather than underrun

For example taking a plot of the (uncorrelated) risked range inputs

# get the correlation matrix BEFORE applying correlation
BeforeCorrApplied = correlation_matrix([AccommodationCosts, Meals, HolidayActivityTours])
BeforeCorrApplied

# plot correlations BEFORE applying correlation
corrplot0 = plotcorr([AccommodationCosts, Meals, HolidayActivityTours], labels=["Input_Accom",
"Input_Meals", "Input_HolidayActivityTours"])
The values which are near 0.00 indicate that there is little to no correlation between the selected input vairables

If we have already determined that there is a certain correlation behavior we’d expect between the variables, MCerp has a function that can apply this correlation using a mathematical method called the Iman Conover which a approximately induces a specified rank order (I.e Spearman’s) correlation among component variables without changing the univariate distribution of the components

(Here’s an article from an even more advanced Python Library called aggregate that we are not applying in the example but still provides a fairly detailed explanation for how it works)

applied_correlation=np.array([[1.00, 0.85,  0.85],
[0.85, 1.00, 0.85],
[0.85, 0.85, 1.0]])

# impose the targeted correlation matrix on the 3 input variables
correlate([AccommodationCosts, Meals, HolidayActivityTours], applied_correlation)

# plot the new correlation matrix of the input variables
corrplot2 = plotcorr([AccommodationCosts, Meals, HolidayActivityTours],
labels=["Input_Accom","Input_Meals", "Input_HolidayActivityTours"])

The next step is to confirm that the correlations have been induced to a value that is close enough to the expected target correlations

Eh….0.83 and 0.846 is close enough to 0.85 I suppose

We then need to “reset” the Output Variables to see the effect

#OutputVariables - Rerun with Corr Variables
RiskedRanges = PlaneFare+AccommodationCosts+Meals+MiscShoppingExpenses+HolidayActivityTours+ClothingTravelGear
DiscreteRisks=MedicalEmergencyRiskLikehood*MedicalEmergencyRiskImpact+TheftRiskLikelihood*TheftRiskImpact
TotalHolidayCosts=RiskedRanges+DiscreteRisks

and re-run the simulation

# run the simulation model, 
# and collect the results in array variables


rPlaneFare = PlaneFare._mcpts
rAccommodationCosts = AccommodationCosts._mcpts
rMeals = Meals._mcpts
rHolidayActivityTours = HolidayActivityTours._mcpts
rClothingTravelGear = ClothingTravelGear._mcpts
rMedicalEmergencyRiskLikehood = MedicalEmergencyRiskLikehood._mcpts
rMedicalEmergencyRiskImpact=MedicalEmergencyRiskImpact._mcpts
rTheftRiskLikelihood=TheftRiskLikelihood._mcpts
rTheftRiskImpact=TheftRiskImpact._mcpts
rRiskedRanges=RiskedRanges._mcpts
rDiscreteRisks=DiscreteRisks._mcpts
rTotalHolidayCosts=TotalHolidayCosts._mcpts

# combine the arrays in a 2-dimensional array
rand1 = np.vstack((rPlaneFare,rAccommodationCosts,rMeals,rHolidayActivityTours,rClothingTravelGear,
rMedicalEmergencyRiskLikehood,rMedicalEmergencyRiskImpact,
rTheftRiskLikelihood,rTheftRiskImpact,rRiskedRanges,rDiscreteRisks,rTotalHolidayCosts))

VarNames=["Input_PlaneFare", "Input_AccomCosts", "Input_Meals", "Input_HolidayActivityTours", "Input_ClothingTravelGear",
"Input_MedEmergencyChance","Input_MedEmergencyCost","Input_TheftRiskChance","Input_TheftRiskCost",
"Output_RiskedRanges","Output_DiscreteRisks","Output_TotalHolidayCosts"]

# copy the array to a dataframe for a more transparent layout
df_postCorr = pd.DataFrame(data=rand1).T
df_postCorr.columns=VarNames
df_postCorr

Repeating the same steps as before, we can now compare the Percentile results for the Correlation Adjusted outputs vs the Un-correlated Outputs

The results are very slightly different where Correlation Adjusted results have a slightly wider variability and a longer tail
The mean is the same but the variance has gone up

For reasons that will become clear in the next section let’s also convert the Fantasy Bucks denominated budget into USD using a fixed 1:4 ratio

USD Mean is just Fantasy Bucks Mean/4 and USD Variance is Fantasy Bucks Variance/4²

6.Modelling Dependencies — Multiplicative “Meta-Variables”

All the steps above display the results in terms of Fantasy Bucks but in this fictional example, we need to plan our budget in USD.

To provide more variability to the result, we can also model FOREX uncertainty that ‘amplifies’ the uncertainty of the overall result (as the other line items which in themselves have uncertainty is multiplied with this variable FOREX rate)

FOREXVariability=PERT(3.75,4.00,4.50) #BetaPert

#OutputVariables - Rerun with Corr Variables
RiskedRanges = PlaneFare+AccommodationCosts+Meals+MiscShoppingExpenses+HolidayActivityTours+ClothingTravelGear
DiscreteRisks=MedicalEmergencyRiskLikehood*MedicalEmergencyRiskImpact+TheftRiskLikelihood*TheftRiskImpact
TotalHolidayCosts=RiskedRanges+DiscreteRisks
TotalHolidayCostsInUSD=TotalHolidayCosts/FOREXVariability

Assuming the previous results were with a fixed rate of 4 Fantasy Bucks = 1 USD and comparing the results, the FOREX Variability creates a wider P10 to P90 range

7.Analyzing The Key Factors Driving The Outputs

Much like the previous XLRisk Example, once you have the outputs , the key thing is to then determine which risk inputs are the major drivers of the final output value we are interested in

#Viewing the results by ranking the correlations between each random input variable 
#(i.e uncertain ranges or risk events likelihoods or impact ranges) and the outcome

# check the new correlation matrix for the 3 input variables
OutputVsInputCorr = correlation_matrix([TotalHolidayCostsInUSD,PlaneFare,AccommodationCosts,
Meals,HolidayActivityTours,ClothingTravelGear,
MedicalEmergencyRiskLikehood, MedicalEmergencyRiskImpact,
TheftRiskLikelihood,TheftRiskImpact,
FOREXVariability])

df_Correlation=pd.DataFrame(data=OutputVsInputCorr)
df_Correlation.columns=["Output_TotalHolidayCosts","Input_PlaneFare", "Input_AccomCosts", "Input_Meals",
"Input_HolidayActivityTours", "Input_ClothingTravelGear",
"Input_MedEmergencyChance","Input_MedEmergencyCost","Input_TheftRiskChance","Input_TheftRiskCost",
"FOREX Variability"]

df_Correlation.T[0][1:].sort_values(ascending=False)

df_Correlation.T[0][1:].sort_values().plot.barh()

While the numerical results vary slightly from the XLRisk simulations, both agree on the top 5 input variables shaping the results based on a simple Pearson correlation of the Output Total Holiday Cost and the various risk inputs.

8.Conclusion

1.Monte Carlo Simulations work by repeatedly sampling from a set of pre-defined random variables within a model and aggregating the overall results to understand the overall range of outcomes

2.Inputs to Probabilistic Estimates are typically made up of:

  • Uncertainty in existing components/sub-elements that are ‘continuous’ in nature (where a numbers may go higher or lower)
  • Discrete Events that have a likelihood of occurrence and once triggered tend to create a significant impact that may be ‘discontinuous’
  • Dependencies between the said variables in form of correlations, 2nd order discrete events that have ‘multiplicative’ effects (i.e A random variable that is multiplied with several other random variables)

3. Monte Carlo simulations can be used to gain valuable insight into the interactions between the different model inputs and to understand the key drivers of variability in an estimate (and hence guide better decisions or development of response/mitigation plans)

4. However, the accuracy of Monte Carlo simulations are highly dependent on the robustness of the underlying assumptions / exclusions used in the model and the reliability of the information used to estimate uncertainty of the model sub-elements

Hopefully, this article has given you a good introduction in the use of the Monte Carlo Simulations to build probabilistic risk estimates.

.

.

.

.

.

.

.

.

.

.

.

Bonus Content!

09.0 Exploring Other Random Number Generators & Sampling Techniques
10.0 Beyond Correlations To Copulas
11.0 Best Practices When Conducting Probabilistic Analyses
11.1 Model Framing/Assumptions
11.2 Consistency In Risk Input Collection & Risk Definition
11.3 Translating The Findings Into ACTIONABLE Insights
12.0 In Conclusion… (This Time For Real)

9.Exploring Other Random Number Generators & Sampling Techniques

There are a few different techniques to optimize the ‘coin flips’ that happens for each iteration in a Monte Carlo Simulation. One area relates to how we perform the random sampling for the various distributions

The ‘naive’ random sampling approach unfortunately needs a relatively substantial number of samples to achieve accurate results (i.e. using a uniformly “random” algorithm will require more tries to fill up the space of outcomes)

Source : (PDF) Knowledge Extraction and the Development of a Decision Support System for the Conceptual Design of Liquefied Natural Gas Terminals under Risk and Uncertainty (researchgate.net)

The most common method used in commercial simulation software (Palisade @Risk , Vose Software Model Risk, Oracle Crystal Ball) is called Latin Hypercube Sampling where each variable is split into evenly spaced regions from which random samples are chosen to obtain a controlled random sample.

However, there are other methods such as the Sobol algorithm and Halton algorithm that can potemtially provide a lower discrepancy (I.e fill the space of possibilities more evenly) resulting in a faster convergence with less iterations and more stable estimates

Unfortunately, MCerp only defaults to Latin Hypercube Sampling and does not allow you to switch to other sampling techniques. However other Python libraries like Monaco do have this option and if you are so inclined you can even do this via the underlying Scipy Quasi Monte Carlo engine that allows for you to switch between Latin Hypercube, Sobol or Halton Sampling.

Other than the random sampling methodology, another area we could play with is the pseudo random number generator (PRNG) function under the hood. In most commercial simulation tools the PRNG tends to be something called the Mersenne Twister which is favored due to its computational efficiency

Unfortunately, I’ve not been able to figure out how to change this in MCerp. It seems like MCerp and all the other libraries referenced (Monaco and PyMCSL) are built on top of SciPy which according to the documentation is based on Numpy which uses something else called Permuted Congruential Generator (64-bit, PCG64) as the default PRNG

Any readers who have managed to come this far in the article and may be more knowledgeable — please leave a comment if you know how to switch the default PRNG in Numpy (and if it’s worthwhile to do so…)

10.Beyond Correlations To Copulas

Earlier in the article, we demonstrated how easy it is to apply the Iman Conover method. The danger though is if you start to induce correlations amongst too many input variables, you can end up in a situation called Multi Collinearity where everything becomes correlated with the target output variable.

The good news is that multicollinearity does not affect the percentile estimates. However it does make it hard to determine how the independent variables influence the dependent variable because the relative strength of the correlation factors between all the inputs and output will be identical. There are statistical methods (Variance Inflation Factors) to address this problem but it does creates more complexity in the model interpretability.

In addition there is also a downside that Iman Conover can’t handle situations where the strength of the correlation may vary depending on “which part” of the random variable is sampled.

For those of you who are keen to dive deeper — there is a more advanced approach via Copulas which is a method to describe/model the inter-correlation between random variables by allowing analysts to model and estimate the distribution of random vectors by estimating marginals and copulas separately.

The math does get involved but the intuition we can take away is that the Iman Conover method is analogous to applying a Gaussian copula where the “induced relationship” between the variables is the same across the entire range of possible values.

However, it is possible to also model other types of relationships. For example, in the image below, Gaussian Copulas simulate joint symmetric dependence but cannot reflect tail dependence (I.e. no joint extreme events). Compare this to the Student’s t Copula that can model joint symmetric tail dependence (i.e. an increased chance of joint extreme events).

Source : https://github.com/tinoproductions/DirtyQuant/blob/master/Introduction%20to%20Copulas.ipynb

Lest you may think this is an obscure and esoteric topic, note the following cautionary tale…

“When David X. Li’s Gaussian copula function was first published in 2000 investors exploited it as a quick — and fatally flawed — way to price hundreds of billions of dollars’ worth of CDOs filled with mortgages. Unfortunately in 2008 — when the markets started behaving in a way that the model could not cope with, it resulted in the loss of trillions of dollars and put the survival of the global banking system in serious peril”

11.Best Practices When Conducting Probabilistic Analyses

Even for this simplified Fantasy Island Holiday Budget example, there are several questions we could delve deeper into.

11.1 Model Framing/Assumptions

Image Source: Alamy
  • Why are choices being modelled as uncertainties?
  • For example if the cost uncertainty ranges reflect the spectrum from a five-star hotel to a youth hostel experience, then maybe a better approach may be to offer simpler and less granular scenarios and their cost impacts i.e A Luxury Holiday Itinerary vs Shoe-string Backpacker Itinerary
  • In this manner, a more detailed probabilistic budget analysis should only be run AFTER the initial decision has been narrowed down

11.2 Consistency In Risk Input Collection & Risk Definition

Image Source : Blogspot
  • How confident are we that the uncertainty ranges and risk events impact values don’t double dip and are consistently applied ?
  • In the example above, there is a range on Clothing/Travel Gear, and we denominated this in Fantasy Bucks but if we plan to buy this ahead of the trip from an ecommerce store in USD terms, it is incorrect to apply the FOREX uncertainty to this line item.
  • These are the sort of things to watch for especially when the model includes “second order” uncertainties that have a multiplicative effect
  • For example there is a popular technique in Schedule Risk Analysis called the Risk Driver Method that allows analysts to avoid having to worry about co-dependencies/correlations but assumes that we can estimate % hi-mid-low impacts in a mutually exclusive manner (Which — in practice, is near impossible to estimate)
  • Furthermore if the model includes “conditional risks” (i.e. child risk events that only have a chance of occurring IF the parent risk event happens) you will need to calibrate the likelihood estimates (As a trivial example, the odds of randomly selected person having lung cancer would be different if we knew that random person is a heavy smoker)

11.3 Translating The Findings Into ACTIONABLE Insights

  • A competent analyst needs to not only explain the basis of the analysis, model assumptions and conclusions but also make the insights actionable.
  • For example, it is not enough to highlight that the variability in Plane Fare costs has the highest impact on the overall budget uncertainty but to also test the underlying model assumptions and understand the holiday maker’s value drivers:
  • >>If the holiday is budget constrained then perhaps the holiday makers need to accept ‘odd hour’ flights to lower the cost exposure and to stress that the longer they delay booking the flight, the more expensive it gets (i.e Ensuring the decisions are TIME BOUND)
  • >>An even bigger question is to understand the date flexibility -i.e. Is it possible to move the holiday timing to a later date where it’s off-peak and Fantasy Island isn’t flooded by tourists to enjoy lower costs (but accepting that certain attractions may be closed)

12. In Conclusion… (This Time For Real)

This was mentioned earlier but it is so important that it bears repeating: The accuracy of Monte Carlo simulations are highly dependent on the robustness of the underlying assumptions / exclusions used in the model and the reliability of the input information used to estimate uncertainty of the model sub-elements.

Therefore, while this bonus section covered the more technical aspects of running a probabilistic analysis, the critical steps are not necessarily the modelling itself (i.e the cranking of the handle) but the upfront model definition and assumptions, ensuring that we ask the right question when gathering the input data and most importantly communicating the findings to support better decision making or action taking.

--

--

Zhijing Eu

Hi ! I’m “Z”. I am big on sci-fi, tech and digital trends.