Full Jupyter notebook is found at https://gist.github.com/DonovanK13/5b13cc07755649e275d493e6f8c8ecb1
This machine learning project which is my first ever is a part of Jovian Data Science Bootcamp assignment and is based on the data from Kaggle competition. I did not rank high in the competition, but I learned quite a bit. Here are the “Lesson Learned”:
Lesson#1: “Two heads are better than one!” I realized that I had bitten off more than I could chew when I took on this project as my first ML project all by myself. If I had another collaborator to discuss and bounce off ideas with, not only would I have had more fun, but we could have gone much farther as a team.
Lesson#2: Machine learning, particularly in the areas of feature engineering and hyperparameter tuning, often seems more like an art than a science. Feature engineering requires a lot of creative thinking; and train-validation overfitting graphs for hyperparameter tuning looked nothing like I saw in the textbook!
Lesson#3: Finally, I learned a lot from this project and realized that there is still much more for me to learn. I look forward to continuing my education as a data scientist in the years to come.
Introduction
This data is from the IEEE Fraud Detection competition held in September 2019, hosted by Kaggle. Vesta, an e-commerce payment solution provider, sponsored this competition and provided the real-world e-commerce payment processing data which consists of the four datasets — transaction.csv and identity.csv separated into training & test sets. The aim is to predict fraudulent online payment activities. Particular challenges defined in this machine learning projects are:
Competition link: https://www.kaggle.com/competitions/ieee-fraud-detection/overview
Predicting fraudulent credit card transactions: Develop a ML model to predict the likelihood that an e-commerce credit card transaction is fraud or not, and identify factors that flag fraudulent transactions.
import os
import opendatasets as odod.download('https://www.kaggle.com/competitions/ieee-fraud-detection')
test_id = pd.read_csv('./ieee-fraud-detection/test_identity.csv')
test_trans = pd.read_csv('./ieee-fraud-detection/test_transaction.csv', low_memory=False)
train_trans = pd.read_csv('./ieee-fraud-detection/train_transaction.csv', low_memory=False)
train_id = pd.read_csv('./ieee-fraud-detection/train_identity.csv')
submission = pd.read_csv('./ieee-fraud-detection/sample_submission.csv')
print('test_id shape',test_id.shape)
print('test_transaction shape', test_trans.shape)
print('train_id shape', train_id.shape)
print('train_transaction shape', train_trans.shape)
> test_id shape (141907, 41)
> test_transaction shape (506691, 393)
> train_id shape (144233, 41)
> train_transaction shape (590540, 394)
For the purpose of analysis, the two datasets are merged:
train_raw = train_id.merge(train_trans, how='right', on='TransactionID')
test_raw = test_id.merge(test_trans, how='right', on='TransactionID')
Test and Train datasets are provided and split along the timeline which is given in seconds but converted into days for graphing purposes.
In addition to the obvious high percentages of the null values in the dataset. There is a definite pattern of column grouping with the same percentage of null values. This may indicate that the columns within the same group are likely from the same source.
Another interesting pattern is the high concentration of values within the columns. That is, the mode value often dominates the distribution of values within the columns, frequently exceeding 70% or higher. This is true when the fact that this merged data overstates the features from identity data.
Below is a function to calculate the number of columns to be dropped if we are to differentiate the cut-off line for null count. For instance, if we were to drop any columns with 40% or more of null values, it would mean dropping 192 columns out of 394.
def drop_col_in_df(df):
print("From the total number of {} columns, the number of columns to be dropped at the null percentage cut-off are:".format(df.shape[1]))
for i in range(1, 10, 1):
drop_cols = high_null_cols(df, i/10)
num_cols = len(drop_cols)
print("At the cut-off of {0:0}% null count, {1} column(s)".format((i*10), num_cols))
returndrop_col_in_df(train_trans)
> From the total number of 394 columns, the number of columns to be dropped at the null percentage cut-off are:
> At the cut-off of 10% null count, 282 column(s)
> At the cut-off of 20% null count, 212 column(s)
> At the cut-off of 30% null count, 192 column(s)
> At the cut-off of 40% null count, 192 column(s)
> At the cut-off of 50% null count, 174 column(s)
> At the cut-off of 60% null count, 168 column(s)
> At the cut-off of 70% null count, 168 column(s)
> At the cut-off of 80% null count, 55 column(s)
> At the cut-off of 90% null count, 2 column(s)
The number of rows in the ‘identity’ dataset (train_id and test_id) is almost four times less than that of the ‘transaction’ dataset (train_trans and test_trans). If an outer merge (e.g., train_trans left merge with train_id) is performed between the two dataframes, it would result in a significant number of null values. During the null imputation process, it is highly likely that the data from the identity dataframe will be lost since the percentage of null values will exceed 75%.
To avoid significant data loss, the two dataframes will be processed separately and then combined at the end. First, the identity dataframe will be left merged with the transaction dataframe to maximize the usefulness of the identity data. Second, the transaction dataframe will be processed independently.
train_id = train_id.merge(train_trans, how='left', on='TransactionID')
test_id = test_id.merge(test_trans, how='left', on='TransactionID')print('test_id shape',test_id.shape)
print('train_id shape', train_id.shape)
> test_id shape (141907, 433)
> train_id shape (144233, 434)
P_emaildomain & R_emaildomain denote the ‘purchaser email domain’ and the order ‘ recipient email domain’, respectively. We examined the fraud rates when the two domains be match vs mismatch.
def P_R_emaildomain_match(df):
col = df.apply(lambda row:
(False if ((pd.notnull(row.P_emaildomain) == False) |
(pd.notnull(row.R_emaildomain) == False) |
(row.P_emaildomain != row.R_emaildomain))
else True),
axis=1)
return coltrain_trans['P_R_emaildomain_match'] = P_R_emaildomain_match(train_trans)
test_trans['P_R_emaildomain_match'] = P_R_emaildomain_match(test_trans)
train_id['P_R_emaildomain_match'] = P_R_emaildomain_match(train_id)
test_id['P_R_emaildomain_match'] = P_R_emaildomain_match(test_id)
#code below is to calculate the fraud rate within the matching & non-matching populations
mismatch_domain = ((train_trans['P_R_emaildomain_match']==False) & (train_trans.isFraud == 1))
.sum()/(train_trans['P_R_emaildomain_match']==False).sum()
match_domain = ((train_trans['P_R_emaildomain_match']==True) & (train_trans.isFraud == 1))
.sum()/(train_trans['P_R_emaildomain_match']==True).sum()
print("Fraud rate for the mismatching e-mail domain is {0:.2f}%".format(mismatch_domain*100))
print("Fraud rate for the matching e-mail domain is {0:.2f}%".format(match_domain*100))
> Fraud rate for the mismatching e-mail domain is 2.21%
> Fraud rate for the matching e-mail domain is 9.65%
Surprisingly, the fraud rate is higher when the purchaser and order recipient email domains match. It seems that fraudsters are more likely to provide matching email domain name.
Foreign Transaction
For the TransactionAmt which has 3 decimal places, it is assumed that the transaction originated internationally, i.e. foreign transaction. The foreign transaction amounts are converted into the US dollars by applying foreign exchange rates and rounded-off to the third decimal point. These foreign transactions will be detected and categorized into “F” for foreign transactions and “D” for domestic transactions.
def TransAmt_Foreign(df):
df['foreign_trans'] = df.TransactionAmt.astype(str).str.split(".", n=1, expand=True)[1]
df.foreign_trans = ['F' if len(x) == 3 else 'D' for x in df.foreign_trans]
return df.foreign_transTransAmt_Foreign(train_trans)
TransAmt_Foreign(test_trans)
TransAmt_Foreign(train_id)
TransAmt_Foreign(test_id)
dom_f_rate = ((train_trans['foreign_trans']=="D") & (train_trans.isFraud == 1))
.sum()/(train_trans['foreign_trans']=="D").sum()
for_f_rate = ((train_trans['foreign_trans']=="F") & (train_trans.isFraud == 1))
.sum()/(train_trans['foreign_trans']=="F").sum()
print("Fraud rate for the domestic transaction is {0:.2f}%".format(dom_f_rate*100))
print("Fraud rate for the foreign transaction is {0:.2f}%".format(for_f_rate*100))
> Fraud rate for the domestic transaction is 2.54%
> Fraud rate for the foreign transaction is 11.72%
Once again we examine the fraud rates within the domestic transaction vs foreign transactions. Those originating from foreign countries have a significantly higher fraud rate than domestically originated transactions.
TransactionAmt t-statics Applied to Categorical Features
This feature engineering is inspired by one of the participants of this competition (Andrew Lukyanenko) who calculated the grouped mean of TransactionAmt for each group of categorical features. I took this a step further and calculated the t-value of t-statics for each variable in each group within a column. Although this method does not yield an accurate t-value, it does give an approximation of the t-value. The t-value in this case points to the position of each transaction within the t-distribution of that categorical grouping.
To put this into layman’s terms, it is assumed that fraudsters are more likely to make large value transactions before being discovered of their fraudulent behavior. Therefore, one can detect potential frauds by examining unusually large transaction volumes, i.e. high t-value.
# the code below calculates the t-statics which is used to detect unusual spending behavior
def cat_transAmt_t_stat(df, col):
df[col+'_transAmt_t_stat'] = df.groupby(col)['TransactionAmt'].transform(lambda row: (row-row.mean())/row.sem())
return df[col+'_transAmt_t_stat']for col in identified_cat_cols1:
cat_transAmt_t_stat(train_trans, col)
for col in identified_cat_cols1:
cat_transAmt_t_stat(test_trans, col)
for col in identified_cat_cols:
cat_transAmt_t_stat(train_id, col)
for col in identified_cat_cols:
cat_transAmt_t_stat(test_id, col)
Feature ‘V258’ applied to ‘card_xx’ features
Without any significant feature engineering, the data that was run through RandomForest and XGBoost models yielded an interesting observation. In both models, ‘V258’ was found to be the most powerful feature. One weakness of ‘V258’ is that it has a large number of null values. Therefore, ‘V258’ will be feature engineered with ‘card_xx’ features, in particular ‘card1’ which does not have any null values. The logic being that the information revealed in ‘V258’ will be transferable to card_xx features, particularly ‘card1’.
def card_to_V258_stat(df, ccol):
''' this function assigns V258 values to card_xx series. V258 is proven to be
very effective in identifying fraudulent transaction. Then any card info (thus card_xxx series) associated
with V258 can be effectively evaluated for being fraud. '''
df[ccol+'_V258_mean'] = df.groupby(ccol)['V258'].transform('mean')
df[ccol+'_V258_std'] = df.groupby(ccol)['V258'].transform('std')
return df[ccol+'_V258_mean'], df[ccol+'_V258_std']card_xx = ['card1','card2','card3','card5']
for col in card_xx:
card_to_V258_stat(train_trans, col)
Drop Columns: We set a number of rules for removing columns:
Identifying numerical values:
def num_col_def(df):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_df = df.select_dtypes(include=numerics)
return list(numeric_df.columns)num_cols_trans = num_col_def(inputs_trans)
num_cols_id = num_col_def(inputs_id)
Min-Max Scaler
Next step is to scale the numerical features. Before proceeding further, we removed the infinity values and converted them into np.nan values.
def clean_inf_nan(df):
return df.replace([np.inf, -np.inf], np.nan) inputs_trans = clean_inf_nan(inputs_trans)
inputs_id = clean_inf_nan(inputs_id)
test_inputs_trans = clean_inf_nan(test_inputs_trans)
test_inputs_id = clean_inf_nan(test_inputs_id)
Once all infinity values were dropped, the numerical values were scaled.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()scaler.fit(inputs_id[num_cols_id])
inputs_id[num_cols_id] = scaler.transform(inputs_id[num_cols_id])
test_inputs_id[num_cols_id] = scaler.transform(test_inputs_id[num_cols_id])
---------
scaler.fit(inputs_trans[num_cols_trans])
inputs_trans[num_cols_trans] = scaler.transform(inputs_trans[num_cols_trans])
test_inputs_trans[num_cols_trans] = scaler.transform(test_inputs_trans[num_cols_trans])
Imputing Null Values
As discussed previously, most of the features in the datasets have a high percentage of null values often exceeding 70~80% of the total. We have already removed the features or columns in which its null values exceeded 90%. For example, if we lower the standard say to 70%, the number of columns that needs to be removed will be 238 columns. However, considering that the placement of these null values do not necessarily overlap along the same row, the actual removal of data will likely approach 95% or more. Therefore, employing a standard imputation method of row or column removal cannot be considered.
Various imputation techniques are considered — SimpleImputer, Interative Imputer, KNN Imputer, MissForest Imputer, MICE. Considering the large percentage of null values in the dataset as well as the structural imbalance, SimpleImputer and KNN Imputers will not be considered due to heavy lack of data in most features.
“Multivariate imputation by chained equations (MICE) is an iterative approach to impute missing values. It takes an assumption that the data are missing at random, and it makes an educated guess about its true value by looking into the other sample values.” (Satyam Kumar, www.towardsdatascience.com on Feb 8, 2022). Although this imputing technique is expected to be an improvement from the univariate imputation technique, SimpleImputer, it still depends on the availability of other features.
MissForest Imputer is based on XGBoost Classifier & Regressor models which have proven to be the most powerful tool in recent years. Moreover, it is considered the most efficient imputer technique by many. Therefore, MissForest will be considered for this project.
https://ragvenderrawat.medium.com/miss-forest-imputaion-the-best-way-to-handle-missing-data-feature-engineering-techniques-2e6922e5cecb
https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3
!pip install MissForest
from missforest.miss_forest import MissForest
mf = MissForest()inputs_trans = mf.fit_transform(inputs_trans)
inputs_id = mf.fit_transform(inputs_id)
test_inputs_trans = mf.fit_transform(test_inputs_trans)
test_inputs_id = mf.fit_transform(test_inputs_id)
One-Hot Encoding
We ran the standard One-Hoe Encode to encode the categorical values.
from sklearn.preprocessing import OneHotEncoder
encoder_trans = OneHotEncoder(sparse=False, handle_unknown='ignore' , min_frequency = .05).fit(inputs_trans[cat_cols_trans])
encoded_cols_trans = list(encoder_trans.get_feature_names_out(cat_cols_trans)) inputs_trans[encoded_cols_trans] = encoder_trans.transform(inputs_trans[cat_cols_trans])
test_inputs_trans[encoded_cols_trans] = encoder_trans.transform(test_inputs_trans[cat_cols_trans])
encoder_id = OneHotEncoder(sparse=False, handle_unknown='ignore' , min_frequency = .05).fit(inputs_id[cat_cols_id])
encoded_cols_id = list(encoder_id.get_feature_names_out(cat_cols_id))
inputs_id[encoded_cols_id] = encoder_id.transform(inputs_id[cat_cols_id])
test_inputs_id[encoded_cols_id] = encoder_id.transform(test_inputs_id[cat_cols_id])
Putting all these together, we finally end up with the training and testing data for our models. There are two sets of testing and training data — one for identity data and another for transaction data:
X_train_trans = inputs_trans[num_cols_trans + encoded_cols_trans]
X_test_trans = test_inputs_trans[num_cols_trans + encoded_cols_trans]
X_train_id = inputs_id[num_cols_id + encoded_cols_id]
X_test_id = test_inputs_id[num_cols_id + encoded_cols_id]target_trans = train_trans['isFraud'].to_frame()
target_id = train_id['isFraud'].to_frame()
Train_Test Split
For the ML model development, we will train-test split the data.
X_train_trans : Size of X_train_trans is large with a shape of (590540, 373), taking up 904.5MB memory. Sampling rate will be 20%.
X_train_id : Similarly, X_train_id will be sampled at 40% since the data size is smaller.
combo_trans = X_train_trans.join(target_trans)
combo_trans = combo_trans.sample(frac=.20, random_state=42)
target_trans_sample = combo_trans['isFraud']
train_trans_sample = combo_trans.drop('isFraud', axis=1)from sklearn.model_selection import train_test_split
train_t_inputs, val_t_inputs, train_t_targets, val_t_targets = train_test_split(
train_trans_sample, target_trans_sample, test_size=0.25, random_state=42)
combo_id = X_train_id.join(target_id)
combo_id = combo_id.sample(frac=.40, random_state=42)
target_id_sample = combo_id['isFraud']
train_id_sample = combo_id.drop('isFraud', axis=1)
train_i_inputs, val_i_inputs, train_i_targets, val_i_targets = train_test_split(
train_id_sample, target_id_sample, test_size=0.25, random_state=42)
Basemodel is built for the RandomForest based on the transaction data.
from sklearn.ensemble import RandomForestClassifier#train_inputs, val_inputs, train_targets, val_targets
rf_basemodel_tran = RandomForestClassifier(random_state=42, n_jobs=-1)
rf_basemodel_tran.fit(train_t_inputs,train_t_targets)
base_train_t_acc = rf_basemodel_tran.score(train_t_inputs,train_t_targets)
base_val_t_acc = rf_basemodel_tran.score(val_t_inputs, val_t_targets)
base_t_accs = base_train_t_acc, base_val_t_acc
print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(base_train_t_acc *100, base_val_t_acc*100))
> BaseModel for Trans Data: Train Accuracy Score: 99.997742%, Val Accuracy Score: 97.615742%
Initial examination of the model revealed that the basemodel picked up many of the engineered features. As expected, the accuracy scores for training and validation data were extremely high at 99.997% and 97.616%, respectively.
Hyperparameter Tuning — RF Model for Transaction Data
For hyperparameter tuning, the following functions are used:
from sklearn.metrics import mean_squared_errordef test_params(**params):
model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(train_inputs, train_targets)
train_rmse = mean_squared_error(model.predict(train_inputs), train_targets, squared=False)
val_rmse = mean_squared_error(model.predict(val_inputs), val_targets, squared=False)
return train_rmse, val_rmse
def test_param_and_plot(param_name, param_values):
train_errors, val_errors = [], []
for value in param_values:
params = {param_name: value}
train_rmse, val_rmse = test_params(**params)
train_errors.append(train_rmse)
val_errors.append(val_rmse)
plt.figure(figsize=(10,6))
plt.title('Overfitting curve: ' + param_name)
plt.plot(param_values, train_errors, 'b-o')
plt.plot(param_values, val_errors, 'r-o')
plt.ylabel('RMSE')
plt.legend(['Training', 'Validation'])
for x,y in zip(param_values, val_errors):
label = "{:.5f}".format(y)
plt.annotate(label, # this is the text
(x,y), # these are the coordinates to position the label
textcoords="offset points", # how to position the text
xytext=(0,-10), # distance from text to points (x,y)
fontsize=9,
ha='center') # horizontal alignment can be left, right or center
return
Putting it together — RandomForest Model for Transaction Data
Based on the hyperparameter tuning learning, we put them all together for the tuned RF model.
rf_trans_model = RandomForestClassifier(n_jobs=-1,
random_state=42,
n_estimators=75,
max_features=100,
max_depth=45)
X_train_t_acc = rf_trans_model.score(X_train_t_inputs, X_train_t_targets)
X_val_t_acc = rf_trans_model.score(X_val_t_inputs, X_val_t_targets)
print('Hypertuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(X_train_t_acc*100, X_val_t_acc*100))> Hypertuned Model: Train Accuracy Score: 99.708515%, Val Accuracy Score: 98.180648%
The validation score improved by 0.565%. Noticeable is the fact that fewer engineered features are picked up.
Below is the accuracy scores for the basemodel based on identity data:
base_train_i_acc = rf_basemodel_id.score(train_i_inputs,train_i_targets)
base_val_i_acc = rf_basemodel_id.score(val_i_inputs, val_i_targets)
base_i_accs = base_train_i_acc, base_val_i_acc
print('BaseModel for ID Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(base_train_i_acc *100, base_val_i_acc*100))> BaseModel for ID Data: Train Accuracy Score: 100.000000%, Val Accuracy Score: 96.519689%
Hyperparameter Tuning — RF Model for Identity Data
Putting it together — RandomForest Model for Identity Data
Based on the hyperparameter tuning learning, we put them all together for the final model.
rf_id_model = RandomForestClassifier(n_jobs=-1,
random_state=42,
n_estimators=163,
max_features=100,
max_depth=35)
X_train_i_acc = rf_id_model.score(X_train_i_inputs, X_train_i_targets)
X_val_i_acc = rf_id_model.score(X_val_i_inputs, X_val_i_targets)
print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(X_train_i_acc*100, X_val_i_acc*100))
Compared to the basemodel, the hyper-tuned model yield 0.831% improvement. Again the notable observation is the fact that not much engineered features are selected.
We begin by building the XGBoost basemodel for transaction data:
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFoldxgb_base_trans = XGBClassifier(n_jobs=-1, random_state=42)
print(xgb_base_trans)
%%time
xgb_base_trans.fit(train_t_inputs,train_t_targets)
xgb_base_train_t_acc = xgb_base_trans.score(train_t_inputs,train_t_targets)
xgb_base_val_t_acc = xgb_base_trans.score(val_t_inputs, val_t_targets)
print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_base_train_t_acc *100, xgb_base_val_t_acc*100))
> BaseModel for Trans Data: Train Accuracy Score: 98.629503%, Val Accuracy Score: 97.768144%
Hyperparameter Tuning — XGBoost BaseModel for Transaction Data
#1) max_depth specifies the maximum depth of each tree in XGBoost trees.
Putting it together — XGBoost Model for Transaction Data
Based on the hyperparameter tuning learning, we put them all together for the final model.
xgb_trans_model = XGBClassifier(n_jobs=-1,
random_state=42,
max_depth=12,
eta = .4,
gamma=.5,
subsample=.8,
max_delta_step=3,
alpha=.3 )xgb_trans_model.fit(X_train_t_inputs, X_train_t_targets)
xgb_train_t_acc = xgb_trans_model.score(X_train_t_inputs, X_train_t_targets)
xgb_val_t_acc = xgb_trans_model.score(X_val_t_inputs, X_val_t_targets)
print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_train_t_acc*100, xgb_val_t_acc*100))
> Hyper-tuned Model: Train Accuracy Score: 99.734932%, Val Accuracy Score: 98.479358%
Hyper-parameter tuning has improved the model performance by 0.71%
Feature ‘V258’ again stood out as the most powerful predictor.
We begin by building the XGBoost basemodel for transaction data:
xgb_base_id = XGBClassifier(n_jobs=-1, random_state=42)xgb_base_id.fit(train_i_inputs,train_i_targets)
xgb_base_train_i_acc = xgb_base_id.score(train_i_inputs,train_i_targets)xgb_base_val_i_acc = xgb_base_id.score(val_i_inputs, val_i_targets)print('BaseModel for Trans Data: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_base_train_i_acc *100, xgb_base_val_i_acc*100))> BaseModel for Trans Data: Train Accuracy Score: 99.144884%, Val Accuracy Score: 97.178314%
Hyperparameter Tuning — XGBoost BaseModel for Identity Data
#1) max_depth specifies the maximum depth of each tree in XGBoost trees.
Putting it together — XGBoost Model for Identity Data
Based on the hyperparameter tuning learning, we put them all together for the final model.
xgb_id_model = XGBClassifier(n_jobs=-1,
random_state=42,
n_estimators=150,
max_depth=10,
eta = .3,
gamma=.5,
subsample=.8,
max_delta_step=3,
alpha=.5 )xgb_id_model.fit(X_train_i_inputs, X_train_i_targets)
xgb_train_i_acc = xgb_id_model.score(X_train_i_inputs, X_train_i_targets)
xgb_val_i_acc = xgb_id_model.score(X_val_i_inputs, X_val_i_targets)
print('Hyper-tuned Model: Train Accuracy Score: {0:3f}%, Val Accuracy Score: {1:3f}%'.format(xgb_train_i_acc*100, xgb_val_i_acc*100))> Hyper-tuned Model: Train Accuracy Score: 99.975965%, Val Accuracy Score: 97.878477%
The validation score improved by 0.70%.
XGBoost Transaction & Identity Model Predictions & Combining
Based on the hyperparameter tuning learning, we put them all together for the final model.
The four submission files in dataframe object are:
Identity data based predictions are deemed to be superior since the identity data is merged with the transaction data. However, the length of identity data is again only 1/4 of the transaction data. Therefore, approximately 1/4 of the prediction will be based on the identity + transaction data while the remainder is based on the transaction data only.
Below is the code to combine the two predictions:
rf_submission = rf_submission_trans_df.merge(rf_submission_id_df, how='left', on='TransactionID')
rf_submission['isFraud'] = np.where(rf_submission['isFraud_y'].isnull() == True, rf_submission['isFraud_x'], rf_submission['isFraud_y'])
rf_submission.drop(['isFraud_x', 'isFraud_y'], axis=1, inplace=True)
rf_submission.to_csv('rf_submission.csv', index=False)xgb_submission = submission_t_xgb.merge(submission_i_xgb, how='left', on='TransactionID')
xgb_submission['isFraud'] = np.where(xgb_submission['isFraud_y'].isnull() == True, xgb_submission['isFraud_x'], xgb_submission['isFraud_y'])
xgb_submission.drop(['isFraud_x', 'isFraud_y'], axis=1, inplace=True)
xgb_submission.to_csv('xgb_submission.csv', index=False)
The predicted results from RandomForest and XGBoost models were submitted to the Kaggle competition. Not surprisingly, the results are poor, likely placing me in the bottom quartile. More creative feature engineering and more detailed hyperparameter tuning would be required to improve the models.
One interesting observation is that the RainForest model had a slightly better score. XGBoost is commonly considered by many as the superior ML model. The poor performance can only be attributed to the inexperienced model-builder — this is my first ML model project!
One might expect that the prediction based on the merged identity data would be more accurate than the one based on the transaction data only. The finding is the contrary — the predictions based on the transaction data are more accurate.
Examination of the Feature Importance tables reveals a few interesting observations:
Below is a summary of ‘V258’. The data is extracted from the Excel sheet from the earlier EDA — IEEE_C_Card_Fraud_Worksheet.xlsx
V258 has 78% null values, but the fraud rate within the null population is only 2.27% vs. 7.81% in the non-null population. Mode of V258 is “1” with a fraud rate of 3.97%, which is still higher than the 3.5% average. But the fraud rate increases as the feature values increase, reaching 78.21% when the value becomes 5. In addition, as the frequency of value occurrence increases so does the fraud rate.
Originally published at http://github.com.
[post_relacionado id=»1719″]