江苏城乡建设教育网站,茂名做网站报价,郴州信息港网站,怎样做网络销售平台SHAP#xff08;六#xff09;#xff1a;使用 XGBoost 和 HyperOpt 进行信用卡欺诈检测
本笔记本介绍了 XGBoost Classifier 在金融行业中的实现#xff0c;特别是在信用卡欺诈检测方面。 构建 XGBoost 分类器后#xff0c;它将使用 HyperOpt 库#xff08;sklearn 的 …SHAP六使用 XGBoost 和 HyperOpt 进行信用卡欺诈检测
本笔记本介绍了 XGBoost Classifier 在金融行业中的实现特别是在信用卡欺诈检测方面。 构建 XGBoost 分类器后它将使用 HyperOpt 库sklearn 的 GridSearchCV 和 RandomziedSearchCV 算法的替代方案来调整各种模型参数目标是实现正常交易和欺诈交易分类的最大 f1 分数。 作为模型评估的一部分将计算 f1 分数度量为分类构建混淆矩阵生成分类报告并绘制精确召回曲线。 最后将根据 XGBoost 的内部算法以及特征重要性的 SHAP 实现来计算和绘制特征重要性。
来源https://github.com/albazahm/Credit_Card_Fraud_Detection_with_XGBoost_and_HyperOpt/tree/master
1. Loading Libraries and Data
#loading libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, make_scorer, confusion_matrix, classification_report, precision_recall_curve, plot_precision_recall_curve, average_precision_score, auc
from sklearn.model_selection import train_test_split
import seaborn as sns
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
import xgboost as xgb
import shap
# Any results you write to the current directory are saved as output./kaggle/input/creditcardfraud/creditcard.csv#loading the data into a dataframe
credit_df pd.read_csv(./creditcard.csv)2. Data Overview
#preview of the first 10 rows of data
credit_df.head(10)TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.62010.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425...-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.69021.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.514654...0.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.66031.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024...-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.50042.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739...-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.99052.0-0.4259660.9605231.141109-0.1682520.420987-0.0297280.4762010.260314-0.568671...-0.208254-0.559825-0.026398-0.371427-0.2327940.1059150.2538440.0810803.67064.01.2296580.1410040.0453711.2026130.1918810.272708-0.0051590.0812130.464960...-0.167716-0.270710-0.154104-0.7800550.750137-0.2572370.0345070.0051684.99077.0-0.6442691.4179641.074380-0.4921990.9489340.4281181.120631-3.8078640.615375...1.943465-1.0154550.057504-0.649709-0.415267-0.051634-1.206921-1.08533940.80087.0-0.8942860.286157-0.113192-0.2715262.6695993.7218180.3701450.851084-0.392048...-0.073425-0.268092-0.2042331.0115920.373205-0.3841570.0117470.14240493.20099.0-0.3382621.1195931.044367-0.2221870.499361-0.2467610.6515830.069539-0.736727...-0.246914-0.633753-0.120794-0.385050-0.0697330.0941990.2462190.0830763.680
10 rows × 31 columns
#displaying descriptive statistics
credit_df.describe()TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClasscount284807.0000002.848070e052.848070e052.848070e052.848070e052.848070e052.848070e052.848070e052.848070e052.848070e05...2.848070e052.848070e052.848070e052.848070e052.848070e052.848070e052.848070e052.848070e05284807.000000284807.000000mean94813.8595753.919560e-155.688174e-16-8.769071e-152.782312e-15-1.552563e-152.010663e-15-1.694249e-15-1.927028e-16-3.137024e-15...1.537294e-167.959909e-165.367590e-164.458112e-151.453003e-151.699104e-15-3.660161e-16-1.206049e-1688.3496190.001727std47488.1459551.958696e001.651309e001.516255e001.415869e001.380247e001.332271e001.237094e001.194353e001.098632e00...7.345240e-017.257016e-016.244603e-016.056471e-015.212781e-014.822270e-014.036325e-013.300833e-01250.1201090.041527min0.000000-5.640751e01-7.271573e01-4.832559e01-5.683171e00-1.137433e02-2.616051e01-4.355724e01-7.321672e01-1.343407e01...-3.483038e01-1.093314e01-4.480774e01-2.836627e00-1.029540e01-2.604551e00-2.256568e01-1.543008e010.0000000.00000025%54201.500000-9.203734e-01-5.985499e-01-8.903648e-01-8.486401e-01-6.915971e-01-7.682956e-01-5.540759e-01-2.086297e-01-6.430976e-01...-2.283949e-01-5.423504e-01-1.618463e-01-3.545861e-01-3.171451e-01-3.269839e-01-7.083953e-02-5.295979e-025.6000000.00000050%84692.0000001.810880e-026.548556e-021.798463e-01-1.984653e-02-5.433583e-02-2.741871e-014.010308e-022.235804e-02-5.142873e-02...-2.945017e-026.781943e-03-1.119293e-024.097606e-021.659350e-02-5.213911e-021.342146e-031.124383e-0222.0000000.00000075%139320.5000001.315642e008.037239e-011.027196e007.433413e-016.119264e-013.985649e-015.704361e-013.273459e-015.971390e-01...1.863772e-015.285536e-011.476421e-014.395266e-013.507156e-012.409522e-019.104512e-027.827995e-0277.1650000.000000max172792.0000002.454930e002.205773e019.382558e001.687534e013.480167e017.330163e011.205895e022.000721e011.559499e01...2.720284e011.050309e012.252841e014.584549e007.519589e003.517346e003.161220e013.384781e0125691.1600001.000000
8 rows × 31 columns
#exploring datatypes and count of non-NULL rows for each feature
credit_df.info()class pandas.core.frame.DataFrame
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB3. Data Preparation
在这里我们查找并删除数据中的重复观测值定义用于分类的自变量 (X) 和因变量 (Y)并分离出验证集和测试集。
#checking for duplicated observations
credit_df.duplicated().value_counts()False 283726
True 1081
dtype: int64#dropping duplicated observations
credit_df credit_df.drop_duplicates()#defining independent (X) and dependent (Y) variables from dataframe
X credit_df.drop(columns Class)
Y credit_df[Class].values#splitting a testing set from the data
X_train, X_test, Y_train, Y_test train_test_split(X, Y, test_size 0.20, stratify Y, random_state 42)
#splitting a validation set from the training set to tune parameters
X_train, X_val, Y_train, Y_val train_test_split(X_train, Y_train, test_size 0.20, stratify Y_train, random_state 42)4. Model Set-Up and Training
在本节中我们基于 f1 度量创建一个评分器并为 XGBoost 模型定义参数搜索空间。 此外我们定义了一个包含分类器的函数提取其预测计算损失并将其提供给优化器。 最后我们使用所需的设置初始化优化器运行它并查看试验中的参数和分数。
#creating a scorer from the f1-score metric
f1_scorer make_scorer(f1_score)# defining the space for hyperparameter tuning
space {eta: hp.uniform(eta, 0.1, 1),max_depth: hp.quniform(max_depth, 3, 18, 1),gamma: hp.uniform (gamma, 1,9),reg_alpha : hp.quniform(reg_alpha, 50, 200, 1),reg_lambda : hp.uniform(reg_lambda, 0, 1),colsample_bytree : hp.uniform(colsample_bytree, 0.5, 1),min_child_weight : hp.quniform(min_child_weight, 0, 10, 1),n_estimators: hp.quniform(n_estimators, 100, 200, 10)}#defining function to optimize
def hyperparameter_tuning(space):clf xgb.XGBClassifier(n_estimators int(space[n_estimators]), #number of trees to useeta space[eta], #learning ratemax_depth int(space[max_depth]), #depth of treesgamma space[gamma], #loss reduction required to further partition treereg_alpha int(space[reg_alpha]), #L1 regularization for weightsreg_lambda space[reg_lambda], #L2 regularization for weightsmin_child_weight space[min_child_weight], #minimum sum of instance weight needed in childcolsample_bytree space[colsample_bytree], #ratio of column sampling for each treenthread -1) #number of parallel threads usedevaluation [(X_train, Y_train), (X_val, Y_val)]clf.fit(X_train, Y_train,eval_set evaluation,early_stopping_rounds 10,verbose False)pred clf.predict(X_val)pred [1 if i 0.5 else 0 for i in pred]f1 f1_score(Y_val, pred)print (SCORE:, f1)return {loss: -f1, status: STATUS_OK }# run the hyper paramter tuning
trials Trials()
best fmin(fn hyperparameter_tuning,space space,algo tpe.suggest,max_evals 100,trials trials)print (best)SCORE:
0.7552447552447553
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.8169014084507042
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.6666666666666666
SCORE:
0.7737226277372262
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.8169014084507042
SCORE:
0.8169014084507042
SCORE:
0.8169014084507042
SCORE:
0.7891156462585034
SCORE:
0.7401574803149605
SCORE:
0.7737226277372262
SCORE:
0.7971014492753624
SCORE:
0.7499999999999999
SCORE:
0.0
SCORE:
0.7552447552447553
SCORE:
0.0
SCORE:
0.7883211678832117
SCORE:
0.7891156462585034
SCORE:
0.7737226277372262
SCORE:
0.782608695652174
SCORE:
0.8055555555555555
SCORE:
0.7401574803149605
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7552447552447553
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7737226277372262
SCORE:
0.7499999999999999
SCORE:
0.0
SCORE:
0.8085106382978723
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7401574803149605
SCORE:
0.0
SCORE:
0.7972972972972973
SCORE:
0.608695652173913
SCORE:
0.7552447552447553
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7384615384615385
SCORE:
0.8169014084507042
SCORE:
0.802919708029197
SCORE:
0.8169014084507042
SCORE:
0.8201438848920864
SCORE:
0.8201438848920864
SCORE:
0.8201438848920864
SCORE:
0.8085106382978723
SCORE:
0.8169014084507042
SCORE:
0.8085106382978723
SCORE:
0.7910447761194029
SCORE:
0.0
SCORE:
0.7819548872180451
SCORE:
0.802919708029197
SCORE:
0.8085106382978723
SCORE:
0.8169014084507042
SCORE:
0.7910447761194029
SCORE:
0.7910447761194029
SCORE:
0.0
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7999999999999999
SCORE:
0.8085106382978723
SCORE:
0.8169014084507042
SCORE:
0.7692307692307692
SCORE:
0.7999999999999999
SCORE:
0.0
SCORE:
0.7737226277372262
SCORE:
0.0
SCORE:
0.0
SCORE:
0.7301587301587301
SCORE:
0.7786259541984732
SCORE:
0.7878787878787878
SCORE:
0.0
SCORE:
0.7878787878787878
SCORE:
0.7692307692307692
SCORE:
0.0
SCORE:
0.7499999999999999
SCORE:
0.8169014084507042
SCORE:
0.7910447761194029
100%|██████████| 100/100 [11:2400:00, 6.84s/trial, best loss: -0.8201438848920864]
{colsample_bytree: 0.9999995803500363, eta: 0.1316102455832729, gamma: 1.6313395777817137, max_depth: 5.0, min_child_weight: 3.0, n_estimators: 100.0, reg_alpha: 47.0, reg_lambda: 0.4901343161108276}#plotting feature space and f1-scores for the different trials
parameters space.keys()
cols len(parameters)f, axes plt.subplots(nrows1, ncolscols, figsize(20,5))
cmap plt.cm.jet
for i, val in enumerate(parameters):xs np.array([t[misc][vals][val] for t in trials.trials]).ravel()ys [-t[result][loss] for t in trials.trials]xs, ys zip(*sorted(zip(xs, ys)))axes[i].scatter(xs, ys, s20, linewidth0.01, alpha0.25, ccmap(float(i)/len(parameters)))axes[i].set_title(val)axes[i].grid()#printing best model parameters
print(best){colsample_bytree: 0.9999995803500363, eta: 0.1316102455832729, gamma: 1.6313395777817137, max_depth: 5.0, min_child_weight: 3.0, n_estimators: 100.0, reg_alpha: 47.0, reg_lambda: 0.4901343161108276}5. Model Test and Evaluation
本节将探讨并可视化模型在测试数据上的表现。
#initializing XGBoost Classifier with best model parameters
best_clf xgb.XGBClassifier(n_estimators int(best[n_estimators]), eta best[eta], max_depth int(best[max_depth]), gamma best[gamma], reg_alpha int(best[reg_alpha]), min_child_weight best[min_child_weight], colsample_bytree best[colsample_bytree], nthread -1)#fitting XGBoost Classifier with best model parameters to training data
best_clf.fit(X_train, Y_train)XGBClassifier(base_score0.5, boostergbtree, colsample_bylevel1,colsample_bynode1, colsample_bytree0.9999995803500363,eta0.1316102455832729, gamma1.6313395777817137,learning_rate0.1, max_delta_step0, max_depth5,min_child_weight3.0, missingNone, n_estimators100, n_jobs1,nthread-1, objectivebinary:logistic, random_state0,reg_alpha47, reg_lambda1, scale_pos_weight1, seedNone,silentNone, subsample1, verbosity1)#using the model to predict on the test set
Y_pred best_clf.predict(X_test)#printing f1 score of test set predictions
print(The f1-score on the test data is: {0:.2f}.format(f1_score(Y_test, Y_pred)))The f1-score on the test data is: 0.74#creating a confusion matrix and labels
cm confusion_matrix(Y_test, Y_pred)
labels [Normal, Fraud]#plotting the confusion matrix
sns.heatmap(cm, annot True, xticklabels labels, yticklabels labels, fmt d)
plt.xlabel(Predicted)
plt.ylabel(Actual)
plt.title(Confusion Matrix for Credit Card Fraud Detection)Text(0.5, 1.0, Confusion Matrix for Credit Card Fraud Detection)#printing classification report
print(classification_report(Y_test, Y_pred))precision recall f1-score support0 1.00 1.00 1.00 566511 0.87 0.64 0.74 95accuracy 1.00 56746macro avg 0.94 0.82 0.87 56746
weighted avg 1.00 1.00 1.00 56746Y_score best_clf.predict_proba(X_test)[:, 1]
average_precision average_precision_score(Y_test, Y_score)
fig plot_precision_recall_curve(best_clf, X_test, Y_test)
fig.ax_.set_title(Precision-Recall Curve: AP{0:.2f}.format(average_precision))Text(0.5, 1.0, Precision-Recall Curve: AP0.74)6. Feature Importances
本节将介绍两种算法一种在 XGBoost 中一种在 SHAP 中用于可视化特征重要性。 不幸的是由于该数据集的特征是使用主成分分析PCA进行编码的因此我们无法凭直觉得出模型如何从实际角度预测正常交易和欺诈交易的结论。
#extracting the booster from model
booster best_clf.get_booster()# scoring features based on information gain
importance booster.get_score(importance_type gain)#rounding importances to 2 decimal places
for key in importance.keys():importance[key] round(importance[key],2)# plotting feature importances
ax xgb.plot_importance(importance, importance_typegain, show_valuesTrue)
plt.title(Feature Importances (Gain))
plt.show()#obtaining SHAP values for XGBoost Model
explainer shap.TreeExplainer(best_clf)
shap_values explainer.shap_values(X_train)#plotting SHAP Values of Feature Importances
shap.summary_plot(shap_values, X_train)