Features engineering , training and testing machine learning models and hyperparameters tuning can be very fun and exciting. However, most of the time, especially if you are doing data science for business, people are interested not just in pure model output but in understanding how the model made this decision. This is where we try to plot feature importance, do shap values or some fancier methods of interpreting the model. These efforts can be game changer when it comes to increasing trust in machine learning models and increasing addoption rate in your organization.
Going even step further, besides asking how, there is huge value in asking why and answering questions like:
- “What is the cause for people to churn?”
- “If we improve this feature, how much will the churn be reduced?”
These questions are hard to answer with commonly used machine learning models because they ask from us to find the causality, while commonly used machine learning models are very good at capturing correlation.
This tutorial will walk you through CausalNex, a library created by QuantumBlack for tackling the question above. Library was published last year and it really drew my attention being that at the moment I wasn’t aware of many similar tools. It is very easy to use and many helpful tips are available at the website.
I will apply CausalNex to the diabetes dataset, show you potential pitfalls and how to avoid them, explain how to ask what if question and compare performance with logistic regression.
So, let’s start. :)
First, I import necessary libraries, dataset and check for column types and number of NaN values. Dataset contains 9 columns and it has 768 records. First 8 columns represent features related to the diagnosis of diabetes and the 9th column represent the diagnosis.
Learning Structure from Data
How CausalNex works?
First, you need to define structural model or infer structural model from the data. Structural models represent graph where edges of the graph indicates what node affects other nodes. This can be defined by domain experts. For example, we collaborate with some clinic and doctors suggest that the number of pregnancies affects diabetes outcome. Reason can be that during the pregnancy women have insulin resistance and some women cannot produce enough insulin to overcome this resistance, thus, they develop diabetes.
However, most of the time, we’re not lucky enough to have experts hand by hand, so inferring structure from the data can be very helpful. CausalNex uses algorithm NOTEARS (Non-combinatorial Optimization via Trace Exponential and Augmented lagRangian for Structure learning), published at NIPS conference in 2018 for inferring the structure. Algorithm learns from the data how nodes are connected between each other as a weighted adjacency matrix.
Structure can be inferred from numpy and pandas format. Keep in mind that all data types should be numerical, so in case you have categorical data, you should do encoding.
In the documentation I’ve found it is possible to learn structure with and without lasso regularization. In the paper, they’ve suggested that learning with lasso is helpful with small datasets and it forces the sparsity of DAG which is a convenient feature. I’ve applied both of them to this dataset, without any significant difference in the final structure. This however might be due to the dataset size (paper worked with dataset of 1000 samples).
Below is a structural model learned from data without regularization and with edge pruning. Edge pruning means that all edges with weight below defined threshold are removed from the graph. I use it here to reduce complexity of my graph and avoid false positive connections.
However, we see some non intuitive and weird connections. For example, Outcome is a diagnosis and it looks that Outcome affects everything. CausalNex allows us to remove or add each edge independently but we can also forbid features to be parent or child nodes and learn structure with this constraint.
Let’s put constraint on column Outcome and let’s do the lasso version.
Now, it looks better (Glucose causes Insulin level to raise, BMI can affect Blood Pressure …), but some edges still don’t make sense: the number of pregnancies doesn’t cause a number of years, nor does the history of diabetes in your family affect the number of pregnancies. This is the most tedious part of using this tool because you need to remove unrealistic edges and add missing edges. I will do that with very limited knowledge of how diabetes is actually developed and diagnosed. However, it is far easier to add or remove a couple of edges than to build the whole graph from scratch.
This is the final structural model.
Bayesian Networks is a probabilistic graphical model that represents dependencies between variables and their joint distribution. Bayesian Network is directed acyclic graph, DAG, where nodes are random variables and edges are causal connections between variables and represent conditional probability distribution. Once we have a structural model, we create our Bayesian Network and fit conditional probabilities.
Keep in mind that by adding and removing edges manually, you, can make your graph cyclic, which will raise an error when you fit your structural model to Bayesian Network. In case that happens, you need to remove edges which cause cycle.
At the moment, we have all numerical features which will become nodes of Bayesian Network. Currently, CausalNex supports only discrete probability distributions, so it is needed to discretize our features.
CausalNex conveniently offers several ways of discretizing features:
- uniform (specify the number of buckets and discretizer will create uniformly spaced buckets)
- quantile (specify the number of buckets and discretizer will create buckets with equal percentiles)
- outlier (specify percentile of outlier and dicretizer will create 3 buckets
- fixed (splitting points are specified manually)
I will first discretize features uniformly, being it’s easy and convenient, and fit conditional probabilities with it. Later, I will try to discretize data to be more aligned with current findings on normal levels of BMI, Glucose, Insulin etc… . For more information about discretisation, check my github.
Once probabilities are fitted, Outcome will be predicted and the outputs will be compared with the true value of the Outcome column. Being that Outcome is slightly unbalanced, I will calculate recall, precision, f1 and accuracy.
Results with uniform discretisation:
Results with custom discretisation:
We see significant improvement with the right discretization, which implies that domain knowledge is necessary not just for constructing the structural model, but also when fitting the Bayesian Network.
Let’s play a bit with the functionalities of CausalNex.
When querying BloodPressure node to get conditional probability distribution, we see that Obese group (patients with BMI above 30) tend to develop high blood pressure more than Normal Weight group (patients with BMI between 18.5 and 25).
We can spot some nonintuitive conditional probabilities. The example below indicates that probability of having high blood pressure for underweight group is higher than probability of having high blood pressure for normal weight group. Although this might be true, it is possible that this happens due to the low number of samples in Underweight group (only 15 samples).
Let’s do some inference now! Inference is a way of asking What if. I will list the questions which I will try to answer with inference of my Bayesian Network.
- What would be the Outcome if all people actually had healthy weight? — If all people were with healthy weight, there will be less positive diagnosis.(from 0.42 to 0.30)
- What would be the Outcome if all people had only a slight risk of diabetes in the family? — If all people had slight risk of diabetes in the family(based on the column DiabetesPedigreeFunction), there will be less positive diagnosis.(from 0.42 to 0.41)
Logistic Regression Comparison
In the end, let’s compare results with Logistic Regression.
Logistic Regression results:
Linear Regression has metrics that are quite close to the Bayesian Network. However, the argument for using Bayesian Network is not on the side of performance. It is on the side of values that are added by being able to ask more questions and understand and quantify better causality in your data.
I hope this was interesting and that you will find it useful. I will list resources that I’ve used and I look forward to testing more tools that can help to put data to action.
Full code can be found at my github account: https://github.com/ljubicavujovic/CausalNex
- Dataset: https://www.kaggle.com/uciml/pima-indians-diabetes-database
- CausalNex documentation: https://causalnex.readthedocs.io/en/latest/
- DAG with NOTEARS paper: https://arxiv.org/abs/1803.01422
- Introduction to Bayesian Network: https://machinelearningmastery.com/introduction-to-bayesian-belief-networks/