1 Introduction
As Ecommerce has grown explosively in recent years, many merchants have been providing some centralized platforms for consumers to buy products with ”OneClick”. Although online (cardnotpresent) type of transactions have offered the great benefit of consumer convenience, it also has increased the high risk of transaction frauds. As a result, merchants unavoidably have to employ many resources to develop an effective and efficient mechanism for fraud detection and transaction risk control. These control systems usually consist of two core engines: risk scoring engine and risk control engine.
Risk scoring engine is designed to measure the risk level of each transaction. Instead of assigning a transaction with explicit 01 (legitimacy  fraud) classification, the majority of merchants calculate the risk score for each transaction based on its attributes, such as purchase price, order quantity, payment information, product market, etc. Whenever a transaction with a higher score is seen, it is more likely to be fraudulent. With the help of big data and machine learning technologies, the modern scoring model has been significantly improved using streaming historical data.
Risk control engine gets involved once the risk score is calculated. Some transactions that violate predetermined policies or rules get instantly rejected. These predetermined rules and policies are set due to some governments and merchants made regulations, or they are needed when some obvious frauds require immediate blockade. However, the majority of frauds fail to be restrained by these rules, so risk control engine needs to step in and further prevent more fraudulent transactions using the risk scores. Conventional risk controls apply static risk cutoff score thresholds: approve transactions with risk scores lower than the low score threshold; reject transactions with scores higher than the high score threshold; utilize human intelligence (manual review) for further investigations on transactions with the risk scores inbetween. The cutoff score thresholds are set so that the inline fraud detection system can optimally prevent fraudsters’ attacks. This threshold band method is widely applied in ecommerce merchants and financial institutions. Despite the fact that risk score evaluation has been significantly improved during the past few years, due to the following three main reasons decisions made by risk scores are still not always reliable: 1) Rapid changes in fraudsters’ behavior patterns; 2) Loss of fraud signals from rejected transactions, and; 3) Long data maturity lead time. Because of these issues, the conventional fraud control engine is lack of flexibility and capability of realtime selfadjustment, and hence cannot always provide the most accurate risk decisions.
Research motivation for this paper stemmed not only from the drawbacks of the current fraud control system but also from the broader view of various risk control parties who contribute to the final decisions in different transaction flows. Merchants’ risk control decision making should not be isolated from the entire decision environment, where payment issuing banks and manual review team make followup decisions that constitute the final decisions on every transaction. Figure 1 depicts how a transaction is processed through different decision stations until it reaches its final decision.
When a transaction arrives, the risk scoring engine calculates its risk level score based on all its associated features. Risk control engine then makes a decision (approval, rejection, MR review) using some important attributes of this transaction (including its risk score). If the transaction is approved by the risk control engine, it is then sent to the bank for the followup decision (a bank authorized transaction is marked as Final Approval, and a bank declined transaction is marked as Final Rejection). If the transaction is rejected by the risk control engine, it is directly marked as Final Rejection. If the transaction is not approved nor rejected by risk control engine, it would also be sent to the bank first. Only if the bank authorizes the transaction, it has the chance to reach to the manual review (MR) agents for further investigation and for its final decision (a transaction that is authorized by bank and approved is marked as Final Approval, and marked Final Rejection otherwise). The blue box indicates the target of this research, and the grey boxes point out other involved decisionmaking parties.
Banks are regarded as a single decision party for simplicity. From the data, we found that when risk control engine approved and submitted transactions that included more frauds (false negative: wrongful approval) to the banks, when banks sensed it, they became more conservative and would decree more rejections of good transactions (false positive: wrongful rejection). Data also showed that when risk control engine submitted transactions that included fewer frauds (true negative: rightful approval) to the manual review (MR) teams, since fraud patterns are less massive and recognizable, manual review teams tended to have much harder time to make accurate risk decisions. Interactions of different decision parties, legitimate customers and fraudsters are demonstrated in Figure 2.
Considering the high total dollar amount of ecommerce transactions taking place in this such rapidly changing risk decision environment, there is a strong need to design a fraud control engine that can conquer all the aforementioned challenges and optimize the decision accuracy so that the higher profit can be reached. In this paper, the proposed control framework is designed to achieve the following:

Adaptive learning: the proposed risk control engine is trained using streaming transaction records which might include some incomplete information such as the immature label, and it can adaptively recognize the new decision environment;

Discriminative control: instead of using static uniform cutoff thresholds, the proposed control system can assign inline decision (Approve, Reject or Manual Review) in a realtime manner based on the attributes of each incoming transaction;

Datadriven: the risk control is entirely datadriven which helps avoid unreliable ad hoc humanmade hardcoding rules on risk decisions.
The field test on Microsoft real online transaction data suggested that the proposed control system could significantly improve the company’s profit by reducing the loss caused by inaccurate decisions (including both wrongful approvals and wrongful rejections).
The rest of this paper is structured as follows: In Section 2 previous research work related to fraud control is first outlined and the existence of the research gap is discussed. In Section 3 the Perfect State Dynamic Model with rigorous mathematical formulation are introduced and the intractableness of the model is then discussed. Three approximate dynamic control models are proposed in Section 4, and the test results of their performance are included in Section 5. Section 6 concludes this paper.
2 Related Research
Online shopping fraud detection research using machine learning methodologies started from early 90’s right after the occurrence of Ecommerce, in which the major research task was to evaluate fraud risk levels of transactions. Fraud risk level was measured using risk scores, and thus the research on risk scoring gained widespread attention. These scoring engines were inspired by neural network
GhoshReilly1994FraudDetectionNN ; Aleskerov1997FraudDetectionNN ; Dorronsoro1997FraudDetectionNN mena2002FraudDetectionBook ; Kirkos2007FraudDetectionDT ; Sahin2013FraudDetectionDT Bhattacharyya2011FraudDetectionRF , network approach APATE2015FraudDetectionNW and deep neural network Kang2016FraudDetectionCNN . Readers who are interested in this topic may also refer to Review2011DSS and references therein for other related papers that discussed different scoring methods. Despite the fact that current research admits the fact that fraud patterns keep changing and fraud risk score is not always that reliable, no existing papers discuss how to optimally utilize these scores in fraud control operations. On the other hand, data mining papers provide weak guidance in detailed operations, as risk score is indeed a blur expression of fraud. There is currently no literature demonstrating how to deal with the transactions in ”gray zone”, where the risk score of a transaction is neither too low nor too high. Additionally, no literature has addressed interactions of decisions made by multiple parties for transaction risk control. The main reason of lack of related literature is that ecommerce data are strictly confidential and thus provide very limited access for academic researches. Our this paper fills the gaps between transaction fraud evaluation and systematic risk control operations.Dynamic control research started from the 1940’s. We suggest MDP1994 and DP1995
for comprehensive introduction to dynamic optimal control methods, as well as their applications in communication, inventory control, production planning, quality control, etc.. In this research, we investigated an important segment of the dynamic control research, dynamic optimal control with incomplete information, as the main technical foundation of our paper which targeted the challenge of some fraud control systems that can only obtain and utilize partially mature data for modeling. One previous related research is Partially Observed Markovian Decision Process (POMDP). POMDP is a sequential decisionmaking model that deals with inaccurate and incomplete observations of the system state or decision environment. It models/infers transition probability matrix, and the underlying relationship between partially observed and true states (fully observed) information. However, POMDP brings in the significant computational challenges and often requires carefully designed heuristic algorithms to achieve suboptimal solutions. Structural properties of the reward function and computational algorithms of POMDP are available in
POMDP1973Finite , POMDP1978InfiniteDiscount , POMDP1980Platzman , POMDP1989SolutionProcedureWhite , POMDP1991WhiteSurvey , POMDP1994AAAI94 , POMDP1998HeuristicWhite , POMDP2004HeuriticWhite and POMDP2010Geometric. Another related research in dynamic control with incomplete information is Adaptive Dynamic Programming (ADP). ADP assumes that perfect information is not known a priori and needs to be gradually learned from historical data or feedback signals of the dynamic system. ADP concepts started from 1970’s and contributed as one of the core methods in reinforcement learning. In this paper, we will only highlight a number of papers addressing ActorCritic structure, one branch of ADP research that is closer in respect to the depth and width of our research. Readers who are interested in ADP should refer to
ADP2009Survey for a comprehensive review of ADP with respect to theoretical developments as well as application studies. The ActorCritic structure was first proposed in ACD1983Sutton , which suggested an optimal control of learning while improving. ActorCritic structure implies two steps: an actor applies an action to the environment and receives feedback from a critic; Action improvement is then guided by the evaluation signal feedback. The decision environment feedback is recognized and reinforced after receiving feedback rewards with neural network (ADPNN1989Werbos ,NNcontrol1991MSW and NeuroDP1996 ), probabilistic models RL1998 for bandit, Monto Carlo and Approximate MDP methods) and other stochastic models SLO2009Cao. There are two main challenges in solving ADP: (1) curse of dimensionality: as the dimensions of state space and action space get extremely high, a large amount of information must be stored and it makes the computational cost grows explosively
ADPCoD2009 ; (2) Implicit form of objective functions: reward/cost function in dynamic control does not have an explicit form, which needs to be carefully approximated OLADP2013 . Powell introduced several parametric approximation methods to mitigate curse of dimensionality in ADPCoD2009 . Powell et al. in OLADP2013 proposed general dynamic control heuristics in ADP, including myopic control and lookahead control with different approximation schemes for cost function and decision environment transition probabilities, while decision environment is learned using local searching, regression or Bayesian methods with either offline or online fashion.Our research is motivated by the current research gap in risk management literatures. Problem formulations in Section 3 is supported by POMDP literatures, and heuristic solution algorithms are inspired by the ideas in ADP literatures. We studied some realistic issues in fraud control domain, and adapted the general POMDP models and ADP heuristics to fit the structure of fraud control problem. The model and algorithms proposed in this paper are not limited to the application of transaction fraud control, and can be easily extended to other fraud control and defense applications in finance, healthcare, electrical system, robotics, and homeland security.
3 Problem Formulation
In this section, we rigorously formulate the dynamic control model assuming state information and state transition information in the dynamic control model can be exactly characterized. However, state information and state transition probabilities in perfect state model, called (Perfect) in Section 3.1, are not explicit, which need to be approximated from incomplete streaming data. Section 3.2 discusses challenges in solving the dynamic model.
3.1 Perfect State Dynamic Model
We first focus and investigate the expected profit in transaction level, which are the building blocks of the control system. Let , and denote risk score, profit margin and costs (cost of goods, manual review costs, chargeback fine, etc.) respectively. has a finite integral support with upper bound , and , are real numbers. According to system logistics shown in Figure 1, profits of approve (), review () and reject () this transaction can be formulated as follow:
where is unit labor cost for each manual review, and is the indicator function, i.e. given event ,
Given the fact that risk score is a comprehensive evaluation of the risk level, which is estimated using thousands of transaction attributes, we assume that for any two transactions that have the same risk score
, i.e. and , the interactive effect of bank or MR are identical, which can be expressed in the mathematical form as,(1) 
With Eq.(1), the expected profit for each risk operation for transaction can be derived as
(2a)  
(2b)  
(2c) 
functions in Eq.(2a)(2c) are probabilities of different events given risk score . function is short for gold function, whose values represent profitrelated probabilities associated with different risk decisions.
We further delve into a realistic dynamic system, in which bank and MR decision behaviors are changing dynamically. We consider a discrete time dynamic control model with infinite time horizon . Let be a set of transactions occurred during period . Element of this transaction set is the th transaction in period , with risk score , margin and costs . Let be the total number of transactions occurred during period , so
is then a random variable. We can then formally define the dynamic control model as follow.

State space: , which is a set of 5 functions values at all risk scores. In period , the state can be expressed as .

Action space in period : , which has feasible decision sequences. Let be one feasible action sequence in period , and for the th transaction, risk control engine can choose action .

State transition probability matrix: , where is the probability that system move from state to state when taking action sequence a. We assume that is fixed but implicit through out this paper.
Let be the rewardtogo function at the beginning of period , then this stochastic dynamic model can be formulated with Bellman’s equation as
(Perfect) 
where is a discount factor of future rewards, and reward function can be formulated as
Throughout the entire paper, we assume that a finite number of transactions occurred in each period, and reward of each transaction is bounded. Theorem 3.1 gives the condition that Model (Perfect) has unique optimal solution.
Theorem 3.1.
If (1) number of transaction occurred in each period is finite and margin/loss from each transaction is bounded, and (2) the arriving process of transactions is stationary, then there exists an optimal profit satisfying
and there is a unique solution to this equation.
3.2 Incomplete Information and Intractableness of (Perfect) Model
Although Theorem 3.1 provides solid guidance to find the optimal control strategy, there are several issues of implementing Model Perfect in reality.

Exact state information is unavailable: State information, i.e. functions, can only be inferred using partially mature data, since data maturity lead time is a latent random variable with range . We have no way to obtain the true time point of maturity for each transaction until the transaction is eventually marked as a chargeback. However, through analyzing the historical data we do have the knowledge that after periods of time the fraud status (having chargeback or not) should be all mature;

Reward functions are not entirely exact: Reward functions, , are based on estimations of functions, and is not known a priori. Therefore, reward functions could vary due the different estimated functions and the different ;

Transition probability matrix does not have an explicit form: State space has extremely high dimension (five functions estimated at risk scores); Action space has exponential dimension that explosively increase as number of transactions increases ( has possible decision sequences).
The lag of data maturity and the curse of dimensionality lead to the fact that Model (Perfect) is intractable. Thus we propose three approximate dynamic heuristics to obtain suboptimal control decisions. Details of these different control algorithms will be demonstrated in Section 4.
All dynamic control heuristics require a base module which utilizes incomplete information, such as mature old data and partially mature recent data, to infer future functions in these heuristic algorithms. Data mining results suggest that correlations exist between recent period’s partially mature chargeback rate and bank/MR behavior patterns. This fact implies that we should track partially mature chargeback rate of transactions portfolio in period , so that functions can be properly calibrated. This happens to have the same view with business intuitions in multiparty fraud control: If bank and MR learn that recently received transactions have high chargeback rate, they will become more conservative with their decision making by reducing the number of authorization/approval decisions to prevent more undesirable chargebacks. Two decision environment modules, Current Environment Inference (CEI) module and Future Environment Inference (FEI) module, are adopted from GFunctionEstimation2018TKDE . Discussion of these two modules are out of the scope of the current paper, we suggest readers refer to GFunctionEstimation2018TKDE for details of CEI and FEI modules. CEI and FEI utilize historical data to produce function estimations, which contribute to the datadriven property of our risk control framework.
4 Dynamic Risk Control Algorithms
In this section, we propose three different dynamic risk control algorithms: Naive, Myopic and Prospective control. Naive control is the simplest heuristic algorithm that only uses fully mature data before period . Myopic control estimates the current decision environment using CEI module with both mature and immature data in period . The most complex control model, Prospective control, further takes into account that current decision will influence not only the current profit but also the near future profit. Three models are demonstrated in Section 4.1  4.3.
4.1 Naive control
Figure 3 depicts decision flow of naive control. At the beginning of period , decision engine uses mature data before period to estimate functions.
In period , let =, , , , be the estimated current state, and be the action space of period . Then feasible action sequence has a form of , where . Naive model disregards the future effects. For transactions take place in period , we need to solve the following model to get action sequence .
(Naivet)  
Naive control repeats this procedure for each period . Theorem 4.1 claims that (Naivet) can be easily solved by greedily choosing the decision option that yields the highest expected reward for each incoming transaction. Details about Naive control policy is summarized in Algorithm 1.
Theorem 4.1.
Optimal action sequence of (Naivet) can be obtained by greedy algorithm, i.e. for , sequentially set
4.2 Myopic control
Figure 4 depicts the decision flow of myopic control.
This control model is designed to resolve the pattern recognition lag issue due to the delay of data maturity. We adopt CEI module from
GFunctionEstimation2018TKDE to infer current period decision environments. Mathematically, CEI maps matured function trajectories (, , , , and : ) and partially mature chargeback rate to estimate functions ( and ) at current period, .(CEI) 
where is calculated by
Then for transactions occurred in period , Myopic Dynamic Control model solves the following model to get action sequence .
(Myopict)  
CEI module is updated at the beginning of each period and (Myopict) is solved during each period to provide optimal control actions. Theorem 4.2 provides theoretical guarantee that (Myopict) can be solved by greedy method. Details of Myopic control policy is summarized in Algorithm 2.
Theorem 4.2.
Optimal action sequence of (Myopict) can be obtained by greedy algorithm, i.e. for , sequentially set
4.3 Prospective control
Figure 5 depicts decision flow of prospective control.
Prospective control model has a similar CEI module to diminish pattern recognition lag. In addition, FEI module is adopted from GFunctionEstimation2018TKDE to estimate future decision environment change due to the action taken at current period. These environments are characterized by the functions of period and . Similar with Myopic control, in period , we use the output of the CEI module as the state estimation, i.e. . Action space of period is still . While different from previous two control models, prospective control considers future effects caused by the current decisions: the action sequences will play a role on the behavior patterns of bank and MR in period . For transactions occurred in period , we need to solve the following model to get our action sequence .
(Prospectivet)  
where is a discount factor, and is a reference future profit of period . A reference sample from mature control group is bootstrapped from mature data set in order to provide reference future profit . Let this reference transaction set sample be with elements. FEI module includes two subprocedures:

Calculate estimated chargeback rate of period , : at a given time point during period , suppose we have received transaction request, and our decision action sequence is , we can then estimate charge back rate of period ,
(3) where is the indicator function.

Predict future functions (, , , and ) with matured function trajectories (, , , , and : ) and estimate weekly full chargeback rate . FEI is trained with mature data and
(FEI)
(Prospectivet) is hard to solve due to high dimension of and nonanalytic form of . A similar realtime updated greedy heuristic is introduced to obtain a sub optimal solution for (Prospectivet). This Realtime Greedy Heuristic (RGH) allows us to update estimation of on the fly and to adjust our strategy within period . Figure 6 illustrates the logics of RGH within period .
Let time be a decision time point in period where transaction occurs and risk team needs to make decision either to approve, reject or manual review this transaction. Suppose from , starting point of period , to current decision point , we have observed transactions. Hence, we can estimate the chargeback rate of period , if we approve, review or reject using Eq. (4).
(4) 
We further estimate expected reward of approve, review or reject . Note that future effect is first averaged to reward per transaction and then discounted by a factor of .
(5a)  
(5b)  
(5c) 
and for ,
(6)  
where are calculated using Eq. (4), and is derived by (FEI). RGH sequentially assigns action that has the largest prospective reward to each incoming transaction. For , we sequentially set
(ProspectiveRGH) 
Prospective control algorithm is summarized in Algorithm 3.
Repeat for period :
Comments
There are no comments yet.