Context Based Auto-Insurance Fraud Detection
Automotive Insutech Company
In today’s world, automotive fraud insurance claims cost the insurance company a huge amount every year. Most of the claims tend to be fraudulent which can be in the form of staged accidents, including previous minor damages, fictitious passengers etc.
Insurance companies face challenges in detecting who’s is a false claim and who’s is legitimate. Existing metadata such as customer income, accident history, insurance policy type, insurance policy premiums cost, customer debt, driver challan history etc. This kind of data is no more useful since, this data can be easily manipulated, and we cannot really detect the fraudulent claims.
An ideal way to detect fraudulent claims is by analysing the accident scenario (accident statement or recording) which is provided by the customer.
The United Kingdom is one of the largest automotive innovation hubs and home to few of the emerging AI adaptive insurTech startups. Our client is one of them, with a great vision to unlock the future of insurance with innovative technologies, solving auto insurance fraud detection and claims automation, while empowering insurers to deliver amazing customer experiences.
The primary challenge here was collecting the right and useful data for the use-case. The contextual information regarding the accident scenario was quite a challenge to get from the open sources. To tackle this problem, we have framed hundreds of our own accident scenarios which include legitimate accidents and fraudulent accidents. We have also used various data augmentation techniques to generate more data around it.
The next challenge was in the phase of preprocessing the data. We had to figure out what set of data should be removed without affecting the accident scenario description. We found out that removing stop words and punctuation is leading to loss of data. We did a lot of trials by removing such data and training the model, but we concluded that preserving the data Is a necessity.
We have trained and tested our data to classify whether the statement is fraud or legit on several text classification models.
For example, let’s take below statement:
“Yesterday, I parked my car in front of my office, in the parking area. Another car tried to park the car between my car and the other car, but there was not enough space and it hit my car from behind. The result is that the body from behind is smashed and the front of the car body is also damaged with broken headlights.”
Here this sentence is said to be fraud since, there is not much detailing about the scenario, there is no time mentioned, no date etc. and the damages mentioned are very much severe, but the incident looks a very small one.
Now let’s consider this scenario:
“On 27th august at 12:15pm, I parked my car in front of my office, in the parking area. Another car tried to park the car between my car and the other car, but there was not enough space and it hit my car from behind. There is a minor dent on the right side of the back bumper. It will cost around Rs. 2000 to get it fixed. I also have the CCTV footage to support my statement.”
Here, most of the important details have been given with proof of the incident that is the CCTV footage. Hence, we can say that this statement is Legit.
So, to classify this kind of statement, we have created a custom text classification model using transformer-based architecture which can classify fraud and legit statements, it gave a pretty good result. This model has the capability of modeling bidirectional contexts, it is a generalised autoregressive pre-training method.
By adding more data to our database and then training the model can achieve a significant improvement in the accuracy, like 5%-6% increment over the existing accuracy.
We can also take in consideration of metadata as well as the contextual data, to find out whether the scenario is fraud or legit.
We have tested some unseen data of fraud and legit statements given by some people which gave us an accuracy of 91.5%, which is quite a good number considering the data we have trained on.