Banks and other financial organizations process billions of transactions every day. As part of their day to day activities, they also have to detect and prevent payment frauds which is becoming more and more sophisticated with time. Historically, these organizations applied manual or rule based systems for fraud detection but that is no longer sufficient.
Advancements in data science mean that today we are able to build fast, and effective systems for fraud prediction that continuously learn and improve with evolving fraud patterns.
In this article, we introduce payment fraud prediction as a data science problem.
Understanding Payment Fraud
Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2018 found that half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.
This is an increase from the PwC 2016 study in which slightly more than a third of organizations surveyed (36%) had experienced economic crime.
What is Fraud ?
“The term fraud here refers to the abuse of an organization’s system without necessarily leading to direct legal consequences.”
In a competitive environment, fraud can become a business critical problem if it is very prevalent and if the prevention procedures are not fail-safe.
Payment Fraud Detection
Payment fraud detection, being part of the overall fraud control, automates and helps reduce the manual parts of a screening/checking process. It is a challenging problem because it impossible to be certain about the legitimacy of an intention behind an application or transaction.
Types of Credit Card Fraud
a). Identity Theft
Your card details are overseen by some other person. Fake phone call convincing you to share the details.
b). Stolen Cards
When your card is lost or stolen and the person possessing it knows how to get things done.
c). Hacking
Although it is most improbable, but a high-level hacking of the bank account details does happen.
Challenges for Fraud Detection Systems
There are many challenges for fraud detection systems, including :-
- Imbalanced Data
- Cost of Fraud Detection
- Enormous Quantities of Real Time Data
- Rapidly Changing Fraud Patterns
- Adaptive techniques used against the model by the scammers
How to Tackle these Challenges?
Speed
The model used must be simple and fast enough to detect the anomaly and classify it as a fraudulent transaction as quickly as possible.
Class Imbalance
Imbalance can be dealt with by properly using some methods which we will talk about in the next paragraph
Protecting Privacy
For protecting the privacy of the user the dimensionality of the data can be reduced. Key fields can be encrypted or anonymized.
Fraud Experts
A more trustworthy source must be taken which double-check the data, at least for training the model.
Analysis of Fraud Dataset
In this example we want to predict the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
This is a binary classification problem – i.e. our target variable is a binary attribute (Is the user making the click fraudlent or not?) and our goal is to classify users into “fraudlent” or “not fraudlent” as well as possible.
Data Organization
The data is broken into two files i.e. identity and transaction, which are joined by TransactionID.
Not all transactions have corresponding identity information.
Feature Analysis
Categorical Features – Transaction
- ProductCD
- card1 – card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 – M9
Categorical Features – Identity
- DeviceType
- DeviceInfo
- id_12 – id_38
TransactionDT
timedelta from a given reference datetime (not an actual timestamp).
“TransactionDT first value is 86400, which corresponds to the number of seconds in a day (60 * 60 * 24 = 86400) so I think the unit is seconds.
Using this, we know the data spans 6 months, as the maximum value is 15811131, which would correspond to day 183.”
TransactionAMT
transaction payment amount in USD
ProductCD
product code
The product for each transaction
card1 – card6
payment card information, such as card type, card category, issue bank, country, etc.
addr
address
dist
distance
P_ and (R__) emaildomain
Purchaser and Recipient email domain
C1-C14
counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
D1-D15
timedelta, such as days between previous transaction, etc.
M1-M9
match, such as names on card and address, etc.
Vxxx
Vesta engineered rich features, including ranking, counting, and other entity relations.
“For example, how many times the payment card associated with a IP and email or address appeared in a 24 hours time range, etc.”
Data Related Challenges
Main challenges involved in credit card fraud detection are:
- Enormous data is processed every day.
- Model must be fast enough to respond to the scam in time.
- Imbalanced data i.e. most of the transactions(99.8%) are not fraudulent which makes it really hard for detecting the fraudulent ones
- Data availability as the data is mostly private.
- Misclassified data can be another major issue, as not every fraudulent transaction is caught and reported.
References
https://www.kaggle.com/hassanamin/fraud-complete-eda/data