Science behind naturality- An approach to find abnormalities in a dataset

“Every Natural Science contains as much truth as the mathematics it contains.”
Immanuel Kant

Hey there my friend! Hope you are doing well and safe.

Today is sort of my unofficial beginning of the initiative I took up. I realize that on this first post I have already been a bit late than the commitment I aimed for. My sincerest apologies for that, I was adjusting to a lifestyle much different than I was before in. However, I think I am settled now and hope to reduce the delay in posting further on.

So coming back to what I wish to discuss today, I am going to talk about a book that I found extremely interesting! “Forensic Analytics – Methods and Accounting for Forensic Investigations by Mark J Nilgrini” is a comprehensive book that describes about approaches that a forensic investigator takes for analysis of a transactional dataset to identify fabricated or fraudulent observations in the dataset. This book broadly focuses on testing target datasets against the concept of a “Natural Dataset” according to “Benford’s Law.”

What is Benford’s Law and which datasets are considered natural?

Benford’s law very briefly states that the distribution of first digit of a random variable(k) follows a specific pattern as below:

A similar pattern is observed in case of first 2 digits as well which are descriptively mentioned in the python file I have linked in this post.

To give you an idea on which datasets can be tested against the Benford’s law, below 3 rules need to be satisfied: –

Records should present sizes of specific facts or events. Examples in finance could be market capitalizations of companies, daily sales volumes of a stock exchange, expense ledgers, etc. Essentially there should be very less impact of other events/environments on the facts presented in the datasets and must be sufficiently random as per your judgement.
There must be no minimum/maximum values of the data that can be seen. A built-in minimum of zero is fine. If there is a minimum in a dataset, you would see a high frequency of the first digits of the minimum value which would not conform to a “natural dataset”
The variable you are analyzing must not be used as an identity for a record. For e.g., application number/ record number etc.

Why I am talking about this?

In various cases that we encounter as a fraud investigator, we come across many datasets that do conform partially or completely to Benford’s Law. In order to give a direction to the investigation and assess preliminary findings from the datasets provided, Benford’s Law can be strongly used as an indicator/ lead provider in investigative methodologies.
The book talks about analyzing Benford’s Law using Microsoft Excel and Access. However, as I have realized through my time as a professional, the datasets we receive to analyze and sample have become much more comprehensive and larger is size. While MS Excel is an intuitive tool and also considered to be an industry standard for analysis, it does have limitations in handling large datasets.
I aim to perform the same analysis using the current trends/standard tools and programming languages aimed at data analysis, specifically Python. In the subsequent section and posts I would be attempting to perform each test mentioned in the book using Python scripts and packages for Data Analysis and Visualization.

Assessing a real-world dataset- Conformity to Benford’s Law

As I started reading the book and attempted to understand the science behind naturality, I immediately had an urge to apply the theoretical inferences made by Dr. Nilgrini to much recent datasets. I was intrigued and curious to understand whether a specific random dataset can be tested against the various properties that the Benford’s Law (and the research attached to it) states.
Secondly, I was motivated towards learning and understanding whether the same steps could be performed using Python.
Hence, I took up the task to cover through the various analysis tests performed in the book on Python. Here is a link to the python notebook describing and implementing the steps I was able to cover until now.(Github)
While the book analyzes a custom dataset, I analyzed data collected by US Census. This qualified the three conditions mentioned above and hence I wished to test the conditions under Benford’s law to it.

What next?

In the subsequent posts, I would attempt to perform the same analysis on a synthetically created fraudulent dataset. Results from the analysis would give a taste of identifying multiple indicators of a potentially fraudulent dataset.
Feedback on applicability on real fraudulent datasets is something I haven’t yet observed in my time as a professional. While the nature of being a fraud investigations analyst is more of a deterministic one, it is important to understand that statistical inferences ( such as this one) are more of a lead rather than a finding that needs to be further investigated upon using documentary checks.
My understanding put in a simple way is:- this analysis would only help in guiding the investigation towards a direction rather than obtaining a definite finding.

Do let me know your thoughts on this. Have a great week ahead. Until next time. 🙂

What is Benford’s Law and which datasets are considered natural?

Why I am talking about this?

Assessing a real-world dataset- Conformity to Benford’s Law

What next?

Share this:

Related

Leave a comment Cancel reply