E-Commerce Payment Fraud Detection using RandomForest & XGBoost

Full Jupyter notebook is found at https://gist.github.com/DonovanK13/5b13cc07755649e275d493e6f8c8ecb1

This machine learning project which is my first ever is a part of Jovian Data Science Bootcamp assignment and is based on the data from Kaggle competition. I did not rank high in the competition, but I learned quite a bit. Here are the “Lesson Learned”:

Lesson#1: “Two heads are better than one!” I realized that I had bitten off more than I could chew when I took on this project as my first ML project all by myself. If I had another collaborator to discuss and bounce off ideas with, not only would I have had more fun, but we could have gone much farther as a team.

Lesson#2: Machine learning, particularly in the areas of feature engineering and hyperparameter tuning, often seems more like an art than a science. Feature engineering requires a lot of creative thinking; and train-validation overfitting graphs for hyperparameter tuning looked nothing like I saw in the textbook!

También te puede interesarCómo hacer obras de arte generadas por computadora usando inteligencia artificial

Lesson#3: Finally, I learned a lot from this project and realized that there is still much more for me to learn. I look forward to continuing my education as a data scientist in the years to come.


This data is from the IEEE Fraud Detection competition held in September 2019, hosted by Kaggle. Vesta, an e-commerce payment solution provider, sponsored this competition and provided the real-world e-commerce payment processing data which consists of the four datasets — transaction.csv and identity.csv separated into training & test sets. The aim is to predict fraudulent online payment activities. Particular challenges defined in this machine learning projects are:

  • The two datasets are highly mismatched. While identity.csv has about 140k row with 41 features, transaction.csv has about 500k rows with 394 features. This structural imbalance in dataset poses an interesting challenge on how to impute missing values and the issue of “complete” utilization of data.
  • Competition link: https://www.kaggle.com/competitions/ieee-fraud-detection/overview

    También te puede interesarTrends in AI — March 2023

    Predicting fraudulent credit card transactions: Develop a ML model to predict the likelihood that an e-commerce credit card transaction is fraud or not, and identify factors that flag fraudulent transactions.

    import os
    import opendatasets as od


    test_id = pd.read_csv('./ieee-fraud-detection/test_identity.csv')
    test_trans = pd.read_csv('./ieee-fraud-detection/test_transaction.csv', low_memory=False)
    train_trans = pd.read_csv('./ieee-fraud-detection/train_transaction.csv', low_memory=False)
    train_id = pd.read_csv('./ieee-fraud-detection/train_identity.csv')
    submission = pd.read_csv('./ieee-fraud-detection/sample_submission.csv')

    print('test_id shape',test_id.shape)
    print('test_transaction shape', test_trans.shape)
    print('train_id shape', train_id.shape)
    print('train_transaction shape', train_trans.shape)

    También te puede interesar¿Qué es una célula LSTM?

    > test_id shape (141907, 41)
    > test_transaction shape (506691, 393)
    > train_id shape (144233, 41)
    > train_transaction shape (590540, 394)

    For the purpose of analysis, the two datasets are merged:

    train_raw = train_id.merge(train_trans, how='right', on='TransactionID')
    test_raw = test_id.merge(test_trans, how='right', on='TransactionID')

    Test and Train datasets are provided and split along the timeline which is given in seconds but converted into days for graphing purposes.

    Train & Test Data Split

    In addition to the obvious high percentages of the null values in the dataset. There is a definite pattern of column grouping with the same percentage of null values. This may indicate that the columns within the same group are likely from the same source.

    También te puede interesarLos pros y los contras de usar herramientas de escritura de IA

    Another interesting pattern is the high concentration of values within the columns. That is, the mode value often dominates the distribution of values within the columns, frequently exceeding 70% or higher. This is true when the fact that this merged data overstates the features from identity data.

    Below is a function to calculate the number of columns to be dropped if we are to differentiate the cut-off line for null count. For instance, if we were to drop any columns with 40% or more of null values, it would mean dropping 192 columns out of 394.

    def drop_col_in_df(df):
    print("From the total number of {} columns, the number of columns to be dropped at the null percentage cut-off are:".format(df.shape[1]))
    for i in range(1, 10, 1):
    drop_cols = high_null_cols(df, i/10)
    num_cols = len(drop_cols)
    print("At the cut-off of {0:0}% null count, {1} column(s)".format((i*10), num_cols))

    > From the total number of 394 columns, the number of columns to be dropped at the null percentage cut-off are:
    > At the cut-off of 10% null count, 282 column(s)
    > At the cut-off of 20% null count, 212 column(s)
    > At the cut-off of 30% null count, 192 column(s)
    > At the cut-off of 40% null count, 192 column(s)
    > At the cut-off of 50% null count, 174 column(s)
    > At the cut-off of 60% null count, 168 column(s)
    > At the cut-off of 70% null count, 168 column(s)
    > At the cut-off of 80% null count, 55 column(s)
    > At the cut-off of 90% null count, 2 column(s)

    The number of rows in the ‘identity’ dataset (train_id and test_id) is almost four times less than that of the ‘transaction’ dataset (train_trans and test_trans). If an outer merge (e.g., train_trans left merge with train_id) is performed between the two dataframes, it would result in a significant number of null values. During the null imputation process, it is highly likely that the data from the identity dataframe will be lost since the percentage of null values will exceed 75%.

    To avoid significant data loss, the two dataframes will be processed separately and then combined at the end. First, the identity dataframe will be left merged with the transaction dataframe to maximize the usefulness of the identity data. Second, the transaction dataframe will be processed independently.

    train_id = train_id.merge(train_trans, how='left', on='TransactionID')
    test_id = test_id.merge(test_trans, how='left', on='TransactionID')

    print('test_id shape',test_id.shape)
    print('train_id shape', train_id.shape)

    > test_id shape (141907, 433)
    > train_id shape (144233, 434)

    P_emaildomain & R_emaildomain denote the ‘purchaser email domain’ and the order ‘ recipient email domain’, respectively. We examined the fraud rates when the two domains be match vs mismatch.

    def P_R_emaildomain_match(df):
    col = df.apply(lambda row:
    (False if ((pd.notnull(row.P_emaildomain) == False) |
    (pd.notnull(row.R_emaildomain) == False) |
    (row.P_emaildomain != row.R_emaildomain))
    else True),
    return col

    train_trans['P_R_emaildomain_match'] = P_R_emaildomain_match(train_trans)
    test_trans['P_R_emaildomain_match'] = P_R_emaildomain_match(test_trans)
    train_id['P_R_emaildomain_match'] = P_R_emaildomain_match(train_id)
    test_id['P_R_emaildomain_match'] = P_R_emaildomain_match(test_id)

    #code below is to calculate the fraud rate within the matching & non-matching populations
    mismatch_domain = ((train_trans['P_R_emaildomain_match']==False) & (train_trans.isFraud == 1))
    match_domain = ((train_trans['P_R_emaildomain_match']==True) & (train_trans.isFraud == 1))
    print("Fraud rate for the mismatching e-mail domain is {0:.2f}%".format(mismatch_domain*100))
    print("Fraud rate for the matching e-mail domain is {0:.2f}%".format(match_domain*100))

    > Fraud rate for the mismatching e-mail domain is 2.21%
    > Fraud rate for the matching e-mail domain is 9.65%

    Surprisingly, the fraud rate is higher when the purchaser and order recipient email domains match. It seems that fraudsters are more likely to provide matching email domain name.

    Foreign Transaction

    For the TransactionAmt which has 3 decimal places, it is assumed that the transaction originated internationally, i.e. foreign transaction. The foreign transaction amounts are converted into the US dollars by applying foreign exchange rates and rounded-off to the third decimal point. These foreign transactions will be detected and categorized into “F” for foreign transactions and “D” for domestic transactions.

    def TransAmt_Foreign(df):
    df['foreign_trans'] = df.TransactionAmt.astype(str).str.split(".", n=1, expand=True)[1]
    df.foreign_trans = ['F' if len(x) == 3 else 'D' for x in df.foreign_trans]
    return df.foreign_trans


    dom_f_rate = ((train_trans['foreign_trans']=="D") & (train_trans.isFraud == 1))
    for_f_rate = ((train_trans['foreign_trans']=="F") & (train_trans.isFraud == 1))
    print("Fraud rate for the domestic transaction is {0:.2f}%".format(dom_f_rate*100))
    print("Fraud rate for the foreign transaction is {0:.2f}%".format(for_f_rate*100))

    > Fraud rate for the domestic transaction is 2.54%
    > Fraud rate for the foreign transaction is 11.72%

    Once again we examine the fraud rates within the domestic transaction vs foreign transactions. Those originating from foreign countries have a significantly higher fraud rate than domestically originated transactions.

    TransactionAmt t-statics Applied to Categorical Features

    This feature engineering is inspired by one of the participants of this competition (Andrew Lukyanenko) who calculated the grouped mean of TransactionAmt for each group of categorical features. I took this a step further and calculated the t-value of t-statics for each variable in each group within a column. Although this method does not yield an accurate t-value, it does give an approximation of the t-value. The t-value in this case points to the position of each transaction within the t-distribution of that categorical grouping.

    To put this into layman’s terms, it is assumed that fraudsters are more likely to make large value transactions before being discovered of their fraudulent behavior. Therefore, one can detect potential frauds by examining unusually large transaction volumes, i.e. high t-value.

    # the code below calculates the t-statics which is used to detect unusual spending behavior 
    def cat_transAmt_t_stat(df, col):
    df[col+'_transAmt_t_stat'] = df.groupby(col)['TransactionAmt'].transform(lambda row: (row-row.mean())/row.sem())
    return df[col+'_transAmt_t_stat']

    for col in identified_cat_cols1:
    cat_transAmt_t_stat(train_trans, col)

    for col in identified_cat_cols1:
    cat_transAmt_t_stat(test_trans, col)

    for col in identified_cat_cols:
    cat_transAmt_t_stat(train_id, col)

    for col in identified_cat_cols:
    cat_transAmt_t_stat(test_id, col)

    Feature ‘V258’ applied to ‘card_xx’ features

    Without any significant feature engineering, the data that was run through RandomForest and XGBoost models yielded an interesting observation. In both models, ‘V258’ was found to be the most powerful feature. One weakness of ‘V258’ is that it has a large number of null values. Therefore, ‘V258’ will be feature engineered with ‘card_xx’ features, in particular ‘card1’ which does not have any null values. The logic being that the information revealed in ‘V258’ will be transferable to card_xx features, particularly ‘card1’.

    def card_to_V258_stat(df, ccol):
    ''' this function assigns V258 values to card_xx series. V258 is proven to be
    very effective in identifying fraudulent transaction. Then any card info (thus card_xxx series) associated
    with V258 can be effectively evaluated for being fraud. '''
    df[ccol+'_V258_mean'] = df.groupby(ccol)['V258'].transform('mean')
    df[ccol+'_V258_std'] = df.groupby(ccol)['V258'].transform('std')
    return df[ccol+'_V258_mean'], df[ccol+'_V258_std']

    card_xx = ['card1','card2','card3','card5']

    for col in card_xx:
    card_to_V258_stat(train_trans, col)

    Drop Columns: We set a number of rules for removing columns:

  • ‘too_many_null_cols’ is a list of columns that has over 90% null values
  • Identifying numerical values:

    def num_col_def(df):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_df = df.select_dtypes(include=numerics)
    return list(numeric_df.columns)

    num_cols_trans = num_col_def(inputs_trans)
    num_cols_id = num_col_def(inputs_id)

    Min-Max Scaler

    Next step is to scale the numerical features. Before proceeding further, we removed the infinity values and converted them into np.nan values.

    def clean_inf_nan(df):
    return df.replace([np.inf, -np.inf], np.nan)

    inputs_trans = clean_inf_nan(inputs_trans)
    inputs_id = clean_inf_nan(inputs_id)
    test_inputs_trans = clean_inf_nan(test_inputs_trans)
    test_inputs_id = clean_inf_nan(test_inputs_id)

    Once all infinity values were dropped, the numerical values were scaled.

    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()


    inputs_id[num_cols_id] = scaler.transform(inputs_id[num_cols_id])
    test_inputs_id[num_cols_id] = scaler.transform(test_inputs_id[num_cols_id])

    inputs_trans[num_cols_trans] = scaler.transform(inputs_trans[num_cols_trans])
    test_inputs_trans[num_cols_trans] = scaler.transform(test_inputs_trans[num_cols_trans])

    Imputing Null Values

    As discussed previously, most of the features in the datasets have a high percentage of null values often exceeding 70~80% of the total. We have already removed the features or columns in which its null values exceeded 90%. For example, if we lower the standard say to 70%, the number of columns that needs to be removed will be 238 columns. However, considering that the placement of these null values do not necessarily overlap along the same row, the actual removal of data will likely approach 95% or more. Therefore, employing a standard imputation method of row or column removal cannot be considered.

    Various imputation techniques are considered — SimpleImputer, Interative Imputer, KNN Imputer, MissForest Imputer, MICE. Considering the large percentage of null values in the dataset as well as the structural imbalance, SimpleImputer and KNN Imputers will not be considered due to heavy lack of data in most features.

    “Multivariate imputation by chained equations (MICE) is an iterative approach to impute missing values. It takes an assumption that the data are missing at random, and it makes an educated guess about its true value by looking into the other sample values.” (Satyam Kumar, www.towardsdatascience.com on Feb 8, 2022). Although this imputing technique is expected to be an improvement from the univariate imputation technique, SimpleImputer, it still depends on the availability of other features.

    MissForest Imputer is based on XGBoost Classifier & Regressor models which have proven to be the most powerful tool in recent years. Moreover, it is considered the most efficient imputer technique by many. Therefore, MissForest will be considered for this project.



    !pip install MissForest
    from missforest.miss_forest import MissForest
    mf = MissForest()

    inputs_trans = mf.fit_transform(inputs_trans)
    inputs_id = mf.fit_transform(inputs_id)

    test_inputs_trans = mf.fit_transform(test_inputs_trans)
    test_inputs_id = mf.fit_transform(test_inputs_id)

    One-Hot Encoding

    We ran the standard One-Hoe Encode to encode the categorical values.

    from sklearn.preprocessing import OneHotEncoder
    encoder_trans = OneHotEncoder(sparse=False, handle_unknown='ignore' , min_frequency = .05).fit(inputs_trans[cat_cols_trans])
    encoded_cols_trans = list(encoder_trans.get_feature_names_out(cat_cols_trans))

    inputs_trans[encoded_cols_trans] = encoder_trans.transform(inputs_trans[cat_cols_trans])
    test_inputs_trans[encoded_cols_trans] = encoder_trans.transform(test_inputs_trans[cat_cols_trans])

    encoder_id = OneHotEncoder(sparse=False, handle_unknown='ignore' , min_frequency = .05).fit(inputs_id[cat_cols_id])
    encoded_cols_id = list(encoder_id.get_feature_names_out(cat_cols_id))

    inputs_id[encoded_cols_id] = encoder_id.transform(inputs_id[cat_cols_id])
    test_inputs_id[encoded_cols_id] = encoder_id.transform(test_inputs_id[cat_cols_id])

    Putting all these together, we finally end up with the training and testing data for our models. There are two sets of testing and training data — one for identity data and another for transaction data:

    X_train_trans = inputs_trans[num_cols_trans + encoded_cols_trans]
    X_test_trans = test_inputs_trans[num_cols_trans + encoded_cols_trans]
    X_train_id = inputs_id[num_cols_id + encoded_cols_id]
    X_test_id = test_inputs_id[num_cols_id + encoded_cols_id]

    target_trans = train_trans['isFraud'].to_frame()
    target_id = train_id['isFraud'].to_frame()

    Train_Test Split

    For the ML model development, we will train-test split the data.

    X_train_trans : Size of X_train_trans is large with a shape of (590540, 373), taking up 904.5MB memory. Sampling rate will be 20%.

    X_train_id : Similarly, X_train_id will be sampled at 40% since the data size is smaller.

    combo_trans = X_train_trans.join(target_trans)
    combo_trans = combo_trans.sample(frac=.20, random_state=42)
    target_trans_sample = combo_trans['isFraud']
    train_trans_sample = combo_trans.drop('isFraud', axis=1)

    from sklearn.model_selection import train_test_split
    train_t_inputs, val_t_inputs, train_t_targets, val_t_targets = train_test_split(
    train_trans_sample, target_trans_sample, test_size=0.25, random_state=42)

    combo_id = X_train_id.join(target_id)
    combo_id = combo_id.sample(frac=.40, random_state=42)
    target_id_sample = combo_id['isFraud']
    train_id_sample = combo_id.drop('isFraud', axis=1)

    train_i_inputs, val_i_inputs, train_i_targets, val_i_targets = train_test_split(
    train_id_sample, target_id_sample, test_size=0.25, random_state=42)

    Basemodel is built for the RandomForest based on the transaction data.

    from sklearn.ensemble import RandomForestClassifier

    #train_inputs, val_inputs, train_targets, val_targets
    rf_basemodel_tran = RandomForestClassifier(random_state=42, n_jobs=-1)


    base_train_t_acc = rf_basemodel_tran.score(train_t_inputs,train_t_targets)
    base_val_t_acc = rf_basemodel_tran.score(val_t_inputs, val_t_targets)
    base_t_accs = base_train_t_acc, base_val_t_acc
    print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(base_train_t_acc *100, base_val_t_acc*100))
    > BaseModel for Trans Data: Train Accuracy Score: 99.997742%, Val Accuracy Score: 97.615742%

    Feature Importance of the RandomForest BaseModel for Transaction Data

    Initial examination of the model revealed that the basemodel picked up many of the engineered features. As expected, the accuracy scores for training and validation data were extremely high at 99.997% and 97.616%, respectively.

    Hyperparameter Tuning — RF Model for Transaction Data

    For hyperparameter tuning, the following functions are used:

    from sklearn.metrics import mean_squared_error

    def test_params(**params):
    model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(train_inputs, train_targets)
    train_rmse = mean_squared_error(model.predict(train_inputs), train_targets, squared=False)
    val_rmse = mean_squared_error(model.predict(val_inputs), val_targets, squared=False)
    return train_rmse, val_rmse

    def test_param_and_plot(param_name, param_values):
    train_errors, val_errors = [], []
    for value in param_values:
    params = {param_name: value}
    train_rmse, val_rmse = test_params(**params)
    plt.title('Overfitting curve: ' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.legend(['Training', 'Validation'])
    for x,y in zip(param_values, val_errors):
    label = "{:.5f}".format(y)
    plt.annotate(label, # this is the text
    (x,y), # these are the coordinates to position the label
    textcoords="offset points", # how to position the text
    xytext=(0,-10), # distance from text to points (x,y)
    ha='center') # horizontal alignment can be left, right or center

    Putting it together — RandomForest Model for Transaction Data

    Based on the hyperparameter tuning learning, we put them all together for the tuned RF model.

    rf_trans_model = RandomForestClassifier(n_jobs=-1, 
    Feature Importance of RandomForest Tuned Model for Transaction Data
    X_train_t_acc = rf_trans_model.score(X_train_t_inputs, X_train_t_targets)
    X_val_t_acc = rf_trans_model.score(X_val_t_inputs, X_val_t_targets)
    print('Hypertuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(X_train_t_acc*100, X_val_t_acc*100))

    > Hypertuned Model: Train Accuracy Score: 99.708515%, Val Accuracy Score: 98.180648%

    The validation score improved by 0.565%. Noticeable is the fact that fewer engineered features are picked up.

    Below is the accuracy scores for the basemodel based on identity data:

    base_train_i_acc = rf_basemodel_id.score(train_i_inputs,train_i_targets)
    base_val_i_acc = rf_basemodel_id.score(val_i_inputs, val_i_targets)
    base_i_accs = base_train_i_acc, base_val_i_acc
    print('BaseModel for ID Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(base_train_i_acc *100, base_val_i_acc*100))

    > BaseModel for ID Data: Train Accuracy Score: 100.000000%, Val Accuracy Score: 96.519689%

    Feature Importance of RandomForest BaseModel based on Identity Data

    Hyperparameter Tuning — RF Model for Identity Data

    Putting it together — RandomForest Model for Identity Data

    Based on the hyperparameter tuning learning, we put them all together for the final model.

    rf_id_model = RandomForestClassifier(n_jobs=-1, 
    X_train_i_acc = rf_id_model.score(X_train_i_inputs, X_train_i_targets)
    X_val_i_acc = rf_id_model.score(X_val_i_inputs, X_val_i_targets)
    print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(X_train_i_acc*100, X_val_i_acc*100))
    Feature Importance of the RandomForest Tuned Model for Identity Data

    Compared to the basemodel, the hyper-tuned model yield 0.831% improvement. Again the notable observation is the fact that not much engineered features are selected.

    We begin by building the XGBoost basemodel for transaction data:

    from xgboost import XGBClassifier
    from sklearn.metrics import confusion_matrix
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score, KFold

    xgb_base_trans = XGBClassifier(n_jobs=-1, random_state=42)


    xgb_base_train_t_acc = xgb_base_trans.score(train_t_inputs,train_t_targets)
    xgb_base_val_t_acc = xgb_base_trans.score(val_t_inputs, val_t_targets)
    print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_base_train_t_acc *100, xgb_base_val_t_acc*100))
    > BaseModel for Trans Data: Train Accuracy Score: 98.629503%, Val Accuracy Score: 97.768144%
    Feature Importance of XGBoost BaseModel for Transaction Data

    Hyperparameter Tuning — XGBoost BaseModel for Transaction Data

    #1) max_depth specifies the maximum depth of each tree in XGBoost trees.

    Putting it together — XGBoost Model for Transaction Data

    Based on the hyperparameter tuning learning, we put them all together for the final model.

    xgb_trans_model = XGBClassifier(n_jobs=-1, 
    eta = .4,
    alpha=.3 )

    xgb_trans_model.fit(X_train_t_inputs, X_train_t_targets)

    xgb_train_t_acc = xgb_trans_model.score(X_train_t_inputs, X_train_t_targets)
    xgb_val_t_acc = xgb_trans_model.score(X_val_t_inputs, X_val_t_targets)
    print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_train_t_acc*100, xgb_val_t_acc*100))

    > Hyper-tuned Model: Train Accuracy Score: 99.734932%, Val Accuracy Score: 98.479358%

    Hyper-parameter tuning has improved the model performance by 0.71%

    Feature Importance of XGBoost Tuned Model for Transaction Data

    Feature ‘V258’ again stood out as the most powerful predictor.

    We begin by building the XGBoost basemodel for transaction data:

    xgb_base_id = XGBClassifier(n_jobs=-1, random_state=42)xgb_base_id.fit(train_i_inputs,train_i_targets)
    xgb_base_train_i_acc = xgb_base_id.score(train_i_inputs,train_i_targets)xgb_base_val_i_acc = xgb_base_id.score(val_i_inputs, val_i_targets)print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_base_train_i_acc *100, xgb_base_val_i_acc*100))> BaseModel for Trans Data: Train Accuracy Score: 99.144884%, Val Accuracy Score: 97.178314%
    Feature Importance of XGBoost BaseModel for Identity Data

    Hyperparameter Tuning — XGBoost BaseModel for Identity Data

    #1) max_depth specifies the maximum depth of each tree in XGBoost trees.

    Putting it together — XGBoost Model for Identity Data

    Based on the hyperparameter tuning learning, we put them all together for the final model.

    xgb_id_model = XGBClassifier(n_jobs=-1, 
    eta = .3,
    alpha=.5 )

    xgb_id_model.fit(X_train_i_inputs, X_train_i_targets)

    xgb_train_i_acc = xgb_id_model.score(X_train_i_inputs, X_train_i_targets)
    xgb_val_i_acc = xgb_id_model.score(X_val_i_inputs, X_val_i_targets)
    print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_train_i_acc*100, xgb_val_i_acc*100))

    > Hyper-tuned Model: Train Accuracy Score: 99.975965%, Val Accuracy Score: 97.878477%

    The validation score improved by 0.70%.

    Feature Importance for rXGBoost Tuned Model based on Identity Data

    XGBoost Transaction & Identity Model Predictions & Combining

    Based on the hyperparameter tuning learning, we put them all together for the final model.

    The four submission files in dataframe object are:

  • rf_submission_trans_df >> RandomForest Transaction data based predictions
  • Identity data based predictions are deemed to be superior since the identity data is merged with the transaction data. However, the length of identity data is again only 1/4 of the transaction data. Therefore, approximately 1/4 of the prediction will be based on the identity + transaction data while the remainder is based on the transaction data only.

    Below is the code to combine the two predictions:

    rf_submission = rf_submission_trans_df.merge(rf_submission_id_df, how='left', on='TransactionID')
    rf_submission['isFraud'] = np.where(rf_submission['isFraud_y'].isnull() == True, rf_submission['isFraud_x'], rf_submission['isFraud_y'])
    rf_submission.drop(['isFraud_x', 'isFraud_y'], axis=1, inplace=True)
    rf_submission.to_csv('rf_submission.csv', index=False)

    xgb_submission = submission_t_xgb.merge(submission_i_xgb, how='left', on='TransactionID')
    xgb_submission['isFraud'] = np.where(xgb_submission['isFraud_y'].isnull() == True, xgb_submission['isFraud_x'], xgb_submission['isFraud_y'])
    xgb_submission.drop(['isFraud_x', 'isFraud_y'], axis=1, inplace=True)
    xgb_submission.to_csv('xgb_submission.csv', index=False)

    The predicted results from RandomForest and XGBoost models were submitted to the Kaggle competition. Not surprisingly, the results are poor, likely placing me in the bottom quartile. More creative feature engineering and more detailed hyperparameter tuning would be required to improve the models.

    One interesting observation is that the RainForest model had a slightly better score. XGBoost is commonly considered by many as the superior ML model. The poor performance can only be attributed to the inexperienced model-builder — this is my first ML model project!

    One might expect that the prediction based on the merged identity data would be more accurate than the one based on the transaction data only. The finding is the contrary — the predictions based on the transaction data are more accurate.

    Examination of the Feature Importance tables reveals a few interesting observations:

  • All models picked up ‘V258’ * ‘C1’ as one of the most important features.
  • Below is a summary of ‘V258’. The data is extracted from the Excel sheet from the earlier EDA — IEEE_C_Card_Fraud_Worksheet.xlsx

    V258 has 78% null values, but the fraud rate within the null population is only 2.27% vs. 7.81% in the non-null population. Mode of V258 is “1” with a fraud rate of 3.97%, which is still higher than the 3.5% average. But the fraud rate increases as the feature values increase, reaching 78.21% when the value becomes 5. In addition, as the frequency of value occurrence increases so does the fraud rate.

  • Feature engineering idea: The data is a series of credit card transactions, and one could assume that customers made multiple transactions. If one could find a way to identity and group these transactions to their respective customers then we may be able to identify not just fraudulent transactions but the fraudsters themselves. This would be a powerful predictor of future fraud. After all, it is fraudsters that commit frauds.
  • Originally published at http://github.com.

    [post_relacionado id=»1719″]

    Scroll al inicio