Credit Risk Prediction

Credit Risk Prediction#

This is a German credit risk dataset that can be found on Kaggle German Risk. My goal is to create a predictive model, use this model to generate a score for each client, and ultimately classify clients into risk profiles, differentiating between the riskiest and least risky

Introduction#

Context

Each person is classified as having good or bad credit risk according to the set of attributes. The selected attributes are:

Age (numeric)
Sex (text: male, female)
Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
Housing (text: own, rent, or free)
Saving accounts (text - little, moderate, quite rich, rich)
Checking account (text - little, moderate, rich)
Credit amount (numeric, in DM)
Duration (numeric, in month)
Purpose(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
Risk (Value target - Good or Bad Risk)

The business team came to you because they want to understand the behavior and the profile of the most risk clients and our goal here is to create a predictive model to help them

My goal here is to create a prediction model. I’ll use Optuna for hyperparameter optimization then ‘rank’ the customer in scores and then use the shap to identify each variable as the most important

Dataset#

import numpy as np
import pandas as pd
import sys
import timeit
import gc
import sklearn
from sklearn.model_selection import KFold
import seaborn
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
import lightgbm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import optuna
import matplotlib.pylab as plt
import seaborn as sns
import plotly.offline as py 
py.init_notebook_mode(connected=True) 
import plotly.graph_objs as go
import plotly.tools as tls 
from collections import Counter

# read the dataset
df = pd.read_csv('german_credit_data.csv')

df

	Unnamed: 0	Age	Sex	Job	Housing	Saving accounts	Checking account	Credit amount	Duration	Purpose	Risk
0	0	67	male	2	own	NaN	little	1169	6	radio/TV	good
1	1	22	female	2	own	little	moderate	5951	48	radio/TV	bad
2	2	49	male	1	own	little	NaN	2096	12	education	good
3	3	45	male	2	free	little	little	7882	42	furniture/equipment	good
4	4	53	male	2	free	little	little	4870	24	car	bad
...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	female	1	own	little	NaN	1736	12	furniture/equipment	good
996	996	40	male	3	own	little	little	3857	30	car	good
997	997	38	male	2	own	little	NaN	804	12	radio/TV	good
998	998	23	male	2	free	little	little	1845	45	radio/TV	bad
999	999	27	male	2	own	moderate	moderate	4576	45	car	good

1000 rows × 11 columns

Looking at the Type of Data
Null Numbers or/and Unique values

# knowing the shape of the data and search for missing
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        1000 non-null   int64 
 1   Age               1000 non-null   int64 
 2   Sex               1000 non-null   object
 3   Job               1000 non-null   int64 
 4   Housing           1000 non-null   object
 5   Saving accounts   817 non-null    object
 6   Checking account  606 non-null    object
 7   Credit amount     1000 non-null   int64 
 8   Duration          1000 non-null   int64 
 9   Purpose           1000 non-null   object
 10  Risk              1000 non-null   object
dtypes: int64(5), object(6)
memory usage: 86.1+ KB
None

# looking unique values
print(df.nunique())

Unnamed: 0          1000
Age                   53
Sex                    2
Job                    4
Housing                3
Saving accounts        4
Checking account       3
Credit amount        921
Duration              33
Purpose                8
Risk                   2
dtype: int64

EDA#

Let’s start looking through the target variable and their distribution, here I’ll show only some variables that I thought they have some interesting distribution, to see the others look the Notebook in the GitHub Repository

df_age = df['Age'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Age'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Age'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_age,
    histnorm='probability',
    name="Overall Age"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Age", 
    xaxis = dict(title="Age")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

Age We can see that people with Bad Credit tend to more youth

df_housing = df['Housing'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Housing'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Housing'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_housing,
    histnorm='probability',
    name="Overall Housing"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Housing", 
    xaxis = dict(title="Housing")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

House People who own a house have better credit.

df_saving = df['Saving accounts'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Saving accounts'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Saving accounts'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_saving,
    histnorm='probability',
    name="Overall saving"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Saving", 
    xaxis = dict(title="Saving")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

Saving People with more savings accounts also have better credit

df_checking = df['Checking account'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Checking account'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Checking account'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_checking,
    histnorm='probability',
    name="Overall checking account"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Checking Account", 
    xaxis = dict(title="Checking Account")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

Checking

The same here, people with more checking account has better credit

df_credit = df['Credit amount'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Credit amount'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Credit amount'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_credit,
    histnorm='probability',
    name="Overall Credit amount"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Credit amount", 
    xaxis = dict(title="Credit amount")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

Credit People with more than 4k in credit amount have worse credit than people with less

df_purpose = df['Purpose'].values.tolist()
df_good = df.loc[df["Risk"] == 'good']['Purpose'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Purpose'].values.tolist()

hist_1 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

hist_2 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

hist_3 = go.Histogram(
    x=df_purpose,
    histnorm='probability',
    name="Overall Purpose"
)

data = [hist_1, hist_2, hist_3]

layout = dict(
    title="Type of Credit by Purpose", 
    xaxis = dict(title="Purpose")
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

Purpose People that the purpose is to buy radio/TV have a better credit

Now let’s see the distribution using two variables

df_good = df.loc[df["Risk"] == 'good']['Checking account'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Checking account'].values.tolist()
box_1 = go.Box(
    x=df_good,
    y=df['Credit amount'],
    name="Good Credit"
)

box_2 = go.Box(
    x=df_bad,
    y=df['Credit amount'],
    name="Bad Credit"
)



data = [box_1, box_2]

layout = go.Layout(
    yaxis=dict(
        title='Credit Amount by Checking Account'
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='box-age-cat')

Credit The credit amount is also less in rich people (checking account), even in those bad credit

df_good = df.loc[df["Risk"] == 'good']['Job'].values.tolist()
df_bad = df.loc[df["Risk"] == 'bad']['Job'].values.tolist()
box_1 = go.Box(
    x=df_good,
    y=df['Credit amount'],
    name="Good Credit"
)

box_2 = go.Box(
    x=df_bad,
    y=df['Credit amount'],
    name="Bad Credit"
)



data = [box_1, box_2]

layout = go.Layout(
    yaxis=dict(
        title='Job'
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='box-age-cat')

Credit Unskilled and non-residents with bad credit have more credit amount than others

Preprocessing#

df.dtypes

Unnamed: 0           int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object

df.isna().sum()

Unnamed: 0            0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

We will use one-hot encoding for the sex, housing, and purpose variables.

one_hot = {
    "Sex": "sex",
    "Housing": "hous",
    "Purpose": "purp"
}

And ordinal encoding for the others

ordinal_encoding = {
    "Saving accounts": {
        None: 0,
        "little": 1,
        "moderate": 2,
        "quite rich": 3,
        "rich": 4,
    },
    "Checking account": {
        None: 0,
        "little": 1,
        "moderate": 2,
        "rich": 3,
    },
    "Risk": {
        "bad": 1,
        "good": 0,
    }
}

def one_hot_enconding(df, col_prefix: dict):
    df = df.copy()
    for col, prefix in col_prefix.items():
        df = pd.get_dummies(data=df, prefix=prefix, columns=[col])
    return df

def encode_ordinal(df, custom_ordinals: dict):
    df = df.copy()
    for col, map_dict in custom_ordinals.items():
        df[col] = df[col].replace(map_dict)
    return df

df_encode = df.copy()
df_encode = one_hot_enconding(df_encode, one_hot)
df_encode = encode_ordinal(df_encode, ordinal_encoding)

df_encode

	Unnamed: 0	Age	Job	Saving accounts	Checking account	Credit amount	Duration	Risk	sex_female	sex_male	...	hous_own	hous_rent	purp_business	purp_car	purp_domestic appliances	purp_education	purp_furniture/equipment	purp_radio/TV	purp_repairs	purp_vacation/others
0	0	67	2	0	1	1169	6	0	0	1	...	1	0	0	0	0	0	0	1	0	0
1	1	22	2	1	2	5951	48	1	1	0	...	1	0	0	0	0	0	0	1	0	0
2	2	49	1	1	0	2096	12	0	0	1	...	1	0	0	0	0	1	0	0	0	0
3	3	45	2	1	1	7882	42	0	0	1	...	0	0	0	0	0	0	1	0	0	0
4	4	53	2	1	1	4870	24	1	0	1	...	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	1	1	0	1736	12	0	1	0	...	1	0	0	0	0	0	1	0	0	0
996	996	40	3	1	1	3857	30	0	0	1	...	1	0	0	1	0	0	0	0	0	0
997	997	38	2	1	0	804	12	0	0	1	...	1	0	0	0	0	0	0	1	0	0
998	998	23	2	1	1	1845	45	1	0	1	...	0	0	0	0	0	0	0	1	0	0
999	999	27	2	2	2	4576	45	0	0	1	...	1	0	0	1	0	0	0	0	0	0

1000 rows × 21 columns

df_encode.dtypes

Unnamed: 0                  int64
Age                         int64
Job                         int64
Saving accounts             int64
Checking account            int64
Credit amount               int64
Duration                    int64
Risk                        int64
sex_female                  uint8
sex_male                    uint8
hous_free                   uint8
hous_own                    uint8
hous_rent                   uint8
purp_business               uint8
purp_car                    uint8
purp_domestic appliances    uint8
purp_education              uint8
purp_furniture/equipment    uint8
purp_radio/TV               uint8
purp_repairs                uint8
purp_vacation/others        uint8
dtype: object

df_encode.isna().sum()

Unnamed: 0                  0
Age                         0
Job                         0
Saving accounts             0
Checking account            0
Credit amount               0
Duration                    0
Risk                        0
sex_female                  0
sex_male                    0
hous_free                   0
hous_own                    0
hous_rent                   0
purp_business               0
purp_car                    0
purp_domestic appliances    0
purp_education              0
purp_furniture/equipment    0
purp_radio/TV               0
purp_repairs                0
purp_vacation/others        0
dtype: int64

# Check for duplicate rows
df.duplicated().sum()

df_encode.corr()['Risk'].sort_values()

hous_own                   -0.134589
purp_radio/TV              -0.106922
Age                        -0.091127
sex_male                   -0.075493
Saving accounts            -0.033871
purp_domestic appliances    0.008016
purp_repairs                0.020828
purp_furniture/equipment    0.020971
purp_car                    0.022621
purp_vacation/others        0.028058
Job                         0.032735
Unnamed: 0                  0.034606
purp_business               0.036129
purp_education              0.049085
sex_female                  0.075493
hous_free                   0.081556
hous_rent                   0.092785
Credit amount               0.154739
Checking account            0.197788
Duration                    0.214927
Risk                        1.000000
Name: Risk, dtype: float64

df_encode.columns

Index(['Unnamed: 0', 'Age', 'Job', 'Saving accounts', 'Checking account',
       'Credit amount', 'Duration', 'Risk', 'sex_female', 'sex_male',
       'hous_free', 'hous_own', 'hous_rent', 'purp_business', 'purp_car',
       'purp_domestic appliances', 'purp_education',
       'purp_furniture/equipment', 'purp_radio/TV', 'purp_repairs',
       'purp_vacation/others'],
      dtype='object')

Getting all the coluns that we are going to use in our model.

model_cols = ['Age', 'Job', 'Saving accounts', 'Checking account',
       'Credit amount', 'Duration', 'sex_female', 'sex_male',
       'hous_free', 'hous_own', 'hous_rent', 'purp_business', 'purp_car',
       'purp_domestic appliances', 'purp_education',
       'purp_furniture/equipment', 'purp_radio/TV', 'purp_repairs',
       'purp_vacation/others']

df_encode.loc[df_encode['Risk']==0].mean()

Unnamed: 0                   492.960000
Age                           36.224286
Job                            1.890000
Saving accounts                1.211429
Checking account               0.877143
Credit amount               2985.457143
Duration                      19.207143
Risk                           0.000000
sex_female                     0.287143
sex_male                       0.712857
hous_free                      0.091429
hous_own                       0.752857
hous_rent                      0.155714
purp_business                  0.090000
purp_car                       0.330000
purp_domestic appliances       0.011429
purp_education                 0.051429
purp_furniture/equipment       0.175714
purp_radio/TV                  0.311429
purp_repairs                   0.020000
purp_vacation/others           0.010000
dtype: float64

df_encode.loc[df_encode['Risk']==1].mean()

Unnamed: 0                   514.760000
Age                           33.963333
Job                            1.936667
Saving accounts                1.140000
Checking account               1.290000
Credit amount               3938.126667
Duration                      24.860000
Risk                           1.000000
sex_female                     0.363333
sex_male                       0.636667
hous_free                      0.146667
hous_own                       0.620000
hous_rent                      0.233333
purp_business                  0.113333
purp_car                       0.353333
purp_domestic appliances       0.013333
purp_education                 0.076667
purp_furniture/equipment       0.193333
purp_radio/TV                  0.206667
purp_repairs                   0.026667
purp_vacation/others           0.016667
dtype: float64

some correlation

df_encode.astype(float).corr().abs().sort_values(by='Risk',ascending=False)['Risk']

Risk                        1.000000
Duration                    0.214927
Checking account            0.197788
Credit amount               0.154739
hous_own                    0.134589
purp_radio/TV               0.106922
hous_rent                   0.092785
Age                         0.091127
hous_free                   0.081556
sex_female                  0.075493
sex_male                    0.075493
purp_education              0.049085
purp_business               0.036129
Unnamed: 0                  0.034606
Saving accounts             0.033871
Job                         0.032735
purp_vacation/others        0.028058
purp_car                    0.022621
purp_furniture/equipment    0.020971
purp_repairs                0.020828
purp_domestic appliances    0.008016
Name: Risk, dtype: float64

Duration, checking account, credit amount, and owning house have the most Corr

plt.figure(figsize=(15,15))
sns.heatmap(df_encode.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True,  linecolor='white', annot=True)
plt.show()

png

Training some Models#

X = df_encode.loc[:,model_cols]
y = df_encode.loc[:,'Risk']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape 

((700, 19), (300, 19), (700,), (300,))

Here we are going to train 5 models

# prepare models
lgbmparameters = {'verbose': -1}
models = []
models.append(('XGB', XGBClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('LGBM', LGBMClassifier(**lgbmparameters)))
models.append(('RF', RandomForestClassifier()))
models.append(('NB', GaussianNB()))
# evaluate each model in turn
results = []
names = []
scoring = 'roc_auc'
n_splits = 10
for name, model in models:
        kfold = KFold(n_splits=n_splits)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        

XGB: 0.720764 (0.055151)
CART: 0.598842 (0.066576)
LGBM: 0.733702 (0.062989)
RF: 0.722074 (0.052949)
NB: 0.685554 (0.082030)

box_1 = go.Box(
    x=n_splits*['XGB'],
    y=results[0],
    name="XGB"
)
box_2 = go.Box(
    x=n_splits*['CART'],
    y=results[1],
    name="CART"
)
box_3 = go.Box(
    x=n_splits*['LGBM'],
    y=results[2],
    name="LGBM"
)
box_4 = go.Box(
    x=n_splits*['RF'],
    y=results[3],
    name="RF"
)
box_5 = go.Box(
    x=n_splits*['NB'],
    y=results[4],
    name="NB"
)

data = [box_1, box_2, box_3, box_4, box_5]
layout = go.Layout(
    yaxis=dict(
        title='Model Results'
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='box-age-cat')

Models

The best models were RandomForest and LGBM, we are going to train this model and use Optuna for hyperparameter optimization.

lgbm_model = LGBMClassifier(**lgbmparameters).fit(X_train, y_train)
y_prob_lgbm = lgbm_model.predict_proba(X_test)

print('For the LGBM Model, the test AUC is: '+str(roc_auc_score(y_test,y_prob_lgbm[:,1])))
print('For the LGBM Model, the test Accu is: '+ str(accuracy_score(y_test,y_prob_lgbm[:,1].round())))

For the LGBM Model, the test AUC is: 0.7434670592565329
For the LGBM Model, the test Accu is: 0.7533333333333333

rf_model = RandomForestClassifier().fit(X_train, y_train)
y_prob_rf = rf_model.predict_proba(X_test)

print('For the RandomForest Model, the test AUC is: '+str(roc_auc_score(y_test,y_prob_rf[:,1])))
print('For the RandomForest Model, the test Accu is: '+ str(accuracy_score(y_test,y_prob_rf[:,1].round())))

For the RandomForest Model, the test AUC is: 0.7218308007781692
For the RandomForest Model, the test Accu is: 0.7266666666666667

Hyperparameter Optimization using Optuna#

def auc_ks_metric(y_test, y_prob):
    '''
    Input:
        y_prob: model predict prob
        y_test: target
    Output: Metrics of validation
        auc, ks (Kolmogorov-Smirnov)
    '''
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob)
    auc = metrics.auc(fpr, tpr)
    ks = max(tpr - fpr)
    return auc, ks

def objective(trial, X_train, y_train, X_test, y_test, balanced, method):
    '''
    Input:
        trial: trial of the test
        X_train:
        y_train:
        X_test:
        y_test:
        balanced:balanced or None
        method: XGBoost, CatBoost or LGBM
    Output: Metrics of validation
        auc, ks, log_loss
        auc_logloss_ks(y_test, y_pred)[0]
    '''
    gc.collect()
    if method=='LGBM':
        param_grid = {'learning_rate': trial.suggest_float('learning_rate', 0.0001, 0.1, log=True),
                      'num_leaves': trial.suggest_int('num_leaves', 2, 256),
                      'lambda_l1': trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
                      'lambda_l2': trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
                      'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 5, 100),
                      'max_depth': trial.suggest_int('max_depth', 5, 64),
                      'feature_fraction': trial.suggest_float("feature_fraction", 0.4, 1.0),
                      'bagging_fraction': trial.suggest_float("bagging_fraction", 0.4, 1.0),
                      'bagging_freq': trial.suggest_int("bagging_freq", 1, 7),
                      'verbose': -1
  
                     }
        model = LGBMClassifier(**param_grid,tree_method='gpu_hist',gpu_id=0)

        print('LGBM - Optimization using optuna')
        model.fit(X_train, y_train)
        
        y_pred = model.predict_proba(X_test)[:,1]
    if method=='RF':
        param_grid = {
                      'max_features': trial.suggest_int('max_features', 4, 20),
                      'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 25),
                      'max_depth': trial.suggest_int('max_depth', 5, 64),
                      'min_samples_split': trial.suggest_int("min_samples_split", 2, 30),
                      'n_estimators': trial.suggest_int("n_estimators", 100, 2000)
  
                     }
        model = RandomForestClassifier(**param_grid)

        print('RandomForest - Optimization using optuna')
        model.fit(X_train, y_train)
        
        y_pred = model.predict_proba(X_test)[:,1]
        
    if method=='XGBoost':
        param_grid = {'learning_rate': trial.suggest_float('learning_rate', 0.0001, 0.1, log=True),
                      'max_depth': trial.suggest_int('max_depth', 3, 16),
                      'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
                      'gamma': trial.suggest_float('gamma', 1e-8, 1.0, log = True),
                      'alpha': trial.suggest_float('alpha', 1e-8, 1.0, log = True),
                      'lambda': trial.suggest_float('lambda', 0.0001, 10.0, log = True),
                      'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.8),
                      'booster': 'gbtree',
                      'random_state': 42,
                     }
        model = XGBClassifier(**param_grid,tree_method='gpu_hist',gpu_id=0)
        print('XGBoost - Optimization using optuna')
        model.fit(X_train, y_train,verbose=False)
        y_pred = model.predict_proba(X_test)[:,1]
    
    auc_res = auc_ks_metric(y_test, y_pred)[0]
    print('auc:'+str(auc_res))
    return auc_ks_metric(y_test, y_pred)[0]

def tuning(X_train, y_train, X_test, y_test, balanced, method):
    '''
    Input:
        trial: 
        x_train:
        y_train:
        X_test:
        y_test:
        balanced:balanced or not balanced
        method: XGBoost, CatBoost or LGBM
    Output: Metrics of validation
        auc, ks, log_loss
        auc_logloss_ks(y_test, y_pred)[0]
    '''
    study = optuna.create_study(direction='maximize', study_name=method+' Classifier')
    func = lambda trial: objective(trial, X_train, y_train, X_test, y_test, balanced, method)
    print('Starting the optimization')
    time_max_tuning = 60*30 # max time in seconds to stop
    study.optimize(func, timeout=time_max_tuning)
    return study

def train(X_train, y_train, X_test, y_test, balanced, method):
    '''
    Input:
        X_train:
        y_train:
        X_test:
        y_test:
        balanced:balanced or None
        method: XGBoost, CatBoost or LGBM
    Output: predict model
    '''
    print('Tuning')
    study = tuning(X_train, y_train, X_test, y_test, balanced, method)
    if method=='LGBM':
        model = LGBMClassifier(**study.best_params)
        print('Last Fit')
        model.fit(X_train, y_train, eval_set=[(X_test,y_test)],
                 callbacks = [lightgbm.early_stopping(stopping_rounds=100), lightgbm.log_evaluation(period=5000)])
    if method=='XGBoost':
        model = XGBClassifier(**study.best_params)
        print('Last Fit')
        model.fit(X_train, y_train, eval_set=[(X_test,y_test)],
                 early_stopping_rounds=100,verbose = False)
    if method=='RF':
        model = RandomForestClassifier(**study.best_params)
        print('Last Fit')
        model.fit(X_train, y_train)
    return model, study

lgbm_model, study_lgbm = train(X_train, y_train, X_test, y_test, balanced='balanced', method='LGBM')

[I 2023-09-25 07:48:59,504] A new study created in memory with name: LGBM Classifier
[I 2023-09-25 07:48:59,607] Trial 0 finished with value: 0.7707818497292181 and parameters: {'learning_rate': 0.04564317750022488, 'num_leaves': 254, 'lambda_l1': 0.17602474289716696, 'lambda_l2': 2.936736356867574, 'min_data_in_leaf': 92, 'max_depth': 41, 'feature_fraction': 0.630771183128692, 'bagging_fraction': 0.8791863972428846, 'bagging_freq': 5}. Best is trial 0 with value: 0.7707818497292181.
[I 2023-09-25 07:48:59,695] Trial 1 finished with value: 0.7673905042326096 and parameters: {'learning_rate': 0.03163545356039165, 'num_leaves': 93, 'lambda_l1': 5.331694642994698e-07, 'lambda_l2': 0.0016117988828970487, 'min_data_in_leaf': 62, 'max_depth': 17, 'feature_fraction': 0.5207793700543741, 'bagging_fraction': 0.6988688771949946, 'bagging_freq': 2}. Best is trial 0 with value: 0.7707818497292181.


Tuning
Starting the optimization
LGBM - Optimization using optuna
auc:0.7707818497292181
LGBM - Optimization using optuna
auc:0.7673905042326096
LGBM - Optimization using optuna


[I 2023-09-25 07:48:59,778] Trial 2 finished with value: 0.7639202902360797 and parameters: {'learning_rate': 0.00032903168736575527, 'num_leaves': 221, 'lambda_l1': 0.013182095631109458, 'lambda_l2': 0.004903360053701577, 'min_data_in_leaf': 50, 'max_depth': 35, 'feature_fraction': 0.81375010947157, 'bagging_fraction': 0.47255383236900694, 'bagging_freq': 6}. Best is trial 0 with value: 0.7707818497292181.
[I 2023-09-25 07:48:59,870] Trial 3 finished with value: 0.7488301172511699 and parameters: {'learning_rate': 0.07583419812542502, 'num_leaves': 227, 'lambda_l1': 0.001263229821256988, 'lambda_l2': 0.6714031923624736, 'min_data_in_leaf': 23, 'max_depth': 48, 'feature_fraction': 0.47371647441012454, 'bagging_fraction': 0.5357410570154348, 'bagging_freq': 3}. Best is trial 0 with value: 0.7707818497292181.
[I 2023-09-25 07:48:59,958] Trial 4 finished with value: 0.7586361007413639 and parameters: {'learning_rate': 0.0001610953746996855, 'num_leaves': 228, 'lambda_l1': 4.74483283120879, 'lambda_l2': 0.00011656154418021165, 'min_data_in_leaf': 88, 'max_depth': 60, 'feature_fraction': 0.5768682497889083, 'bagging_fraction': 0.9363888441877074, 'bagging_freq': 5}. Best is trial 0 with value: 0.7707818497292181.


auc:0.7639202902360797
LGBM - Optimization using optuna
auc:0.7488301172511699
LGBM - Optimization using optuna
auc:0.7586361007413639
LGBM - Optimization using optuna


[I 2023-09-25 07:49:00,049] Trial 5 finished with value: 0.7586098112413902 and parameters: {'learning_rate': 0.0001249838804070837, 'num_leaves': 62, 'lambda_l1': 1.0950722639611093e-08, 'lambda_l2': 1.6247452419757427, 'min_data_in_leaf': 88, 'max_depth': 20, 'feature_fraction': 0.7390198578425595, 'bagging_fraction': 0.5451921961124094, 'bagging_freq': 4}. Best is trial 0 with value: 0.7707818497292181.
[I 2023-09-25 07:49:00,135] Trial 6 finished with value: 0.7462274567537727 and parameters: {'learning_rate': 0.021165567217590234, 'num_leaves': 219, 'lambda_l1': 2.634681758289909e-06, 'lambda_l2': 2.1170808617536877e-06, 'min_data_in_leaf': 70, 'max_depth': 52, 'feature_fraction': 0.8886838094462373, 'bagging_fraction': 0.7972849312237408, 'bagging_freq': 5}. Best is trial 0 with value: 0.7707818497292181.
[I 2023-09-25 07:49:00,229] Trial 7 finished with value: 0.7455702192544299 and parameters: {'learning_rate': 0.028054358171711414, 'num_leaves': 106, 'lambda_l1': 0.2794244438193041, 'lambda_l2': 0.02038032703976737, 'min_data_in_leaf': 41, 'max_depth': 8, 'feature_fraction': 0.9320015422653435, 'bagging_fraction': 0.97584085801718, 'bagging_freq': 1}. Best is trial 0 with value: 0.7707818497292181.

[I 2023-09-25 08:18:59,736] Trial 5841 finished with value: 0.7210946947789053 and parameters: {'learning_rate': 0.0017454581670019865, 'num_leaves': 71, 'lambda_l1': 0.00030822208833371846, 'lambda_l2': 0.001178948478008598, 'min_data_in_leaf': 82, 'max_depth': 64, 'feature_fraction': 0.9975320930746641, 'bagging_fraction': 0.9612726835980929, 'bagging_freq': 1}. Best is trial 3444 with value: 0.7828487302171513.


LGBM - Optimization using optuna
auc:0.7210946947789053
Last Fit

y_prob_lgbm = lgbm_model.predict_proba(X_test)

[LightGBM] [Warning] min_data_in_leaf is set=81, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=81
[LightGBM] [Warning] feature_fraction is set=0.5174527298564775, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.5174527298564775
[LightGBM] [Warning] lambda_l2 is set=0.0003124668197733085, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0003124668197733085
[LightGBM] [Warning] lambda_l1 is set=2.524882043205203e-06, reg_alpha=0.0 will be ignored. Current value: lambda_l1=2.524882043205203e-06
[LightGBM] [Warning] bagging_fraction is set=0.7962210156422196, subsample=1.0 will be ignored. Current value: bagging_fraction=0.7962210156422196
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1

print('For the LGBM Model, the test AUC is: '+str(roc_auc_score(y_test,y_prob_lgbm[:,1])))
print('For the LGBM Model, the KS is: '+str(auc_ks_metric(y_test,y_prob_lgbm[:,1])[1]))
print('For the LGBM Model, the test Accu is: '+ str(accuracy_score(y_test,y_prob_lgbm[:,1].round())))

For the LGBM Model, the test AUC is: 0.7828487302171513
For the LGBM Model, the KS is: 0.4945580735054419
For the LGBM Model, the test Accu is: 0.7133333333333334

confusion_hard = confusion_matrix(y_test, y_prob_lgbm[:,1].round())
plt.figure(figsize=(8, 6))
ax = sns.heatmap(confusion_hard, vmin=10, vmax=190,annot = True, fmt='d')
ax.set_title('Confusion Matrix')

Confusion Matrix in LGBM png

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_prob_lgbm[:,1])

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

png

optuna.visualization.plot_param_importances(study_lgbm)

rf_model, study_rf = train(X_train, y_train, X_test, y_test, balanced='balanced', method='RF')

[I 2023-09-25 08:32:34,955] A new study created in memory with name: RF Classifier

Tuning
Starting the optimization
RandomForest - Optimization using optuna

[I 2023-09-25 08:32:35,237] Trial 0 finished with value: 0.7599768652400232 and parameters: {'max_features': 7, 'min_samples_leaf': 21, 'max_depth': 25, 'min_samples_split': 24, 'n_estimators': 231}. Best is trial 0 with value: 0.7599768652400232.

auc:0.7599768652400232
RandomForest - Optimization using optuna

[I 2023-09-25 08:32:36,115] Trial 1 finished with value: 0.7550344392449656 and parameters: {'max_features': 6, 'min_samples_leaf': 25, 'max_depth': 63, 'min_samples_split': 27, 'n_estimators': 1132}. Best is trial 0 with value: 0.7599768652400232.

Last Fit

y_prob_rf = rf_model.predict_proba(X_test)

print('For the RandomForest Model, the test AUC is: '+str(roc_auc_score(y_test,y_prob_rf[:,1])))
print('For the RandomForest, the KS is: '+str(auc_ks_metric(y_test,y_prob_rf[:,1])[1]))
print('For the RandomForest Model, the test Accu is: '+ str(accuracy_score(y_test,y_prob_rf[:,1].round())))

For the RandomForest Model, the test AUC is: 0.7489352752510646
For the RandomForest, the KS is: 0.40080971659919035
For the RandomForest Model, the test Accu is: 0.7233333333333334

confusion_hard = confusion_matrix(y_test, y_prob_rf[:,1].round())
plt.figure(figsize=(8, 6))
ax = sns.heatmap(confusion_hard, vmin=10, vmax=190,annot = True, fmt='d')
ax.set_title('Confusion Matrix')

Confusion Matrix in RandomForest png

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_prob_rf[:,1])

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

png

LGBM model has a better performance after the optimization using Optuna, so we’ll this model as our final model.

Ranking the final model#

import shap
explainer = shap.TreeExplainer(lgbm_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values[1], X_train,show=False)

SHAP results in LGBM png

df_test = pd.concat([X_test, y_test],axis=1)

shap_test = explainer.shap_values(X_test)

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Here we are going to create new variables from shap. The goal is to make more easier to use our final model, for example we want to select the clients with high scores and have more cash in their checking account

def shap_col(shap_):
    col = ['Age', 'Job', 'Saving accounts', 'Checking account', 'Credit amount',
       'Duration', 'sex_female', 'sex_male', 'hous_free', 'hous_own',
       'hous_rent', 'purp_business', 'purp_car', 'purp_domestic appliances',
       'purp_education', 'purp_furniture/equipment', 'purp_radio/TV',
       'purp_repairs', 'purp_vacation/others']
    df_shap = pd.DataFrame(shap_test[1],columns=col)
#     shap_cols = {}
#     shap_cols['shap_1'] = np.nan
#     shap_cols['shap_2'] = np.nan
#     shap_cols['shap_3'] = np.nan
#     shap_cols['shap_4'] = np.nan
#     shap_cols['shap_5'] = np.nan
#     shap_cols['shap_6'] = np.nan
    df_shap.loc[df_shap['Checking account']>0.2, 'shap_1'] = 'Little Check Account'
    df_shap.loc[df_shap['Duration']>0.2, 'shap_2'] = 'More Credit Duration'
    df_shap.loc[df_shap['Credit amount']>0.2, 'shap_3'] = 'More Credit Amount'
    df_shap.loc[df_shap['Age']>0.2, 'shap_4'] = 'More Junior Client'
    df_shap.loc[df_shap['hous_own']>0.2, 'shap_5'] = 'Have House'
    df_shap.loc[df_shap['purp_radio/TV']>0.2, 'shap_6'] = 'The purpose is to buy Radio/TV'
    df_shap.loc[df_shap['Checking account']<-0.2, 'shap_7'] = 'Moderate/Rich Check Account'
    df_shap.loc[df_shap['Duration']<-0.2, 'shap_8'] = 'Less Credit Duration'
    df_shap.loc[df_shap['Credit amount']<-0.2, 'shap_9'] = 'Less Credit Amount'
    df_shap.loc[df_shap['Age']<-0.2, 'shap_10'] = 'More Senior Client'
    df_shap.loc[df_shap['hous_own']<-0.2, 'shap_11'] = 'Does not have House'
# pd.DataFrame(shap_test[1],columns=col).apply(shap_col, axis=1, result_type='expand')

    return df_shap[['shap_1','shap_2','shap_3','shap_4','shap_5','shap_6',
                    'shap_7','shap_8','shap_9','shap_10','shap_11']]

df_shap_arg = pd.DataFrame(shap_col(shap_test[1]))

df_shap_arg

	shap_1	shap_2	shap_3	shap_4	shap_5	shap_6	shap_7	shap_8	shap_9	shap_10	shap_11
0	Little Check Account	NaN	NaN	More Junior Client	NaN	NaN	NaN	NaN	Less Credit Amount	NaN	NaN
1	Little Check Account	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Little Check Account	More Credit Duration	NaN	NaN	NaN	NaN	NaN	NaN	Less Credit Amount	NaN	NaN
3	Little Check Account	NaN	More Credit Amount	More Junior Client	Have House	NaN	NaN	Less Credit Duration	NaN	NaN	NaN
4	NaN	NaN	More Credit Amount	NaN	NaN	NaN	Moderate/Rich Check Account	NaN	NaN	More Senior Client	NaN
...	...	...	...	...	...	...	...	...	...	...	...
295	NaN	NaN	NaN	More Junior Client	NaN	NaN	Moderate/Rich Check Account	NaN	Less Credit Amount	NaN	NaN
296	Little Check Account	More Credit Duration	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
297	NaN	NaN	More Credit Amount	More Junior Client	NaN	NaN	Moderate/Rich Check Account	Less Credit Duration	NaN	NaN	NaN
298	Little Check Account	More Credit Duration	More Credit Amount	NaN	Have House	NaN	NaN	NaN	NaN	NaN	NaN
299	Little Check Account	NaN	More Credit Amount	More Junior Client	Have House	NaN	NaN	Less Credit Duration	NaN	NaN	NaN

300 rows × 11 columns

df_final = pd.concat([df_test.reset_index() ,df_shap_arg],axis=1)

df_final

	index	Age	Job	Saving accounts	Checking account	Credit amount	Duration	sex_female	sex_male	hous_free	...	shap_2	shap_3	shap_4	shap_5	shap_6	shap_7	shap_8	shap_9	shap_10	shap_11
0	521	24	2	1	1	3190	18	1	0	0	...	NaN	NaN	More Junior Client	NaN	NaN	NaN	NaN	Less Credit Amount	NaN	NaN
1	737	35	1	2	1	4380	18	0	1	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	740	32	2	2	1	2325	24	0	1	0	...	More Credit Duration	NaN	NaN	NaN	NaN	NaN	NaN	Less Credit Amount	NaN	NaN
3	660	23	2	1	3	1297	12	0	1	0	...	NaN	More Credit Amount	More Junior Client	Have House	NaN	NaN	Less Credit Duration	NaN	NaN	NaN
4	411	35	3	1	0	7253	33	0	1	0	...	NaN	More Credit Amount	NaN	NaN	NaN	Moderate/Rich Check Account	NaN	NaN	More Senior Client	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
295	468	26	2	1	0	2764	33	1	0	0	...	NaN	NaN	More Junior Client	NaN	NaN	Moderate/Rich Check Account	NaN	Less Credit Amount	NaN	NaN
296	935	30	3	2	2	1919	30	0	1	0	...	More Credit Duration	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
297	428	20	2	1	0	1313	9	0	1	0	...	NaN	More Credit Amount	More Junior Client	NaN	NaN	Moderate/Rich Check Account	Less Credit Duration	NaN	NaN	NaN
298	7	35	3	1	2	6948	36	0	1	0	...	More Credit Duration	More Credit Amount	NaN	Have House	NaN	NaN	NaN	NaN	NaN	NaN
299	155	20	2	1	1	1282	12	1	0	0	...	NaN	More Credit Amount	More Junior Client	Have House	NaN	NaN	Less Credit Duration	NaN	NaN	NaN

300 rows × 32 columns

df_final = df_final.fillna(0)

creating a column score from our predeict_proba

df_final['score'] = y_prob_lgbm[:,1]

we will divide our clients into 5 groups based on the score, this number we can use any number to see if our model is ordering some variables

df_final['rank'] = pd.qcut(df_final['score'], 5,labels = False)

Group by some variables

df_final.groupby('rank')[['Checking account','Duration', 'Age','Credit amount', 'hous_own',
                          'Saving accounts','purp_radio/TV','purp_car','sex_male','sex_female']].agg('mean')

	Checking account	Duration	Age	Credit amount	hous_own	Saving accounts	purp_radio/TV	purp_car	sex_male	sex_female
rank
0	0.083333	15.350000	41.300000	2261.750000	0.900000	1.183333	0.400000	0.300000	0.766667	0.233333
1	0.716667	16.266667	35.266667	2352.266667	0.783333	1.283333	0.216667	0.266667	0.766667	0.233333
2	1.216667	18.566667	36.950000	3140.466667	0.766667	1.033333	0.250000	0.300000	0.683333	0.316667
3	1.450000	18.466667	33.133333	2666.100000	0.683333	1.283333	0.266667	0.383333	0.716667	0.283333
4	1.383333	31.783333	31.700000	4354.200000	0.383333	1.033333	0.166667	0.333333	0.600000	0.400000

df_final.groupby('rank')[['Checking account','Duration', 'Age','Credit amount', 'hous_own',
                          'Saving accounts','purp_radio/TV','purp_car','sex_male','sex_female']].agg('mean').style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

	Checking account	Duration	Age	Credit amount	hous_own	Saving accounts	purp_radio/TV	purp_car	sex_male	sex_female
rank
0	0.083333	15.350000	41.300000	2261.750000	0.900000	1.183333	0.400000	0.300000	0.766667	0.233333
1	0.716667	16.266667	35.266667	2352.266667	0.783333	1.283333	0.216667	0.266667	0.766667	0.233333
2	1.216667	18.566667	36.950000	3140.466667	0.766667	1.033333	0.250000	0.300000	0.683333	0.316667
3	1.450000	18.466667	33.133333	2666.100000	0.683333	1.283333	0.266667	0.383333	0.716667	0.283333
4	1.383333	31.783333	31.700000	4354.200000	0.383333	1.033333	0.166667	0.333333	0.600000	0.400000

We managed to create a good discrimination between our audience with higher scores and those with lower scores

df_final.groupby('rank')[['Risk']].agg('sum').style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

	Risk
rank
0	4
1	8
2	13
3	30
4	36

We were able to order the amount of bad credit

we can see using other’s numbers to divide

df_final['rank'] = pd.qcut(df_final['score'], 3,labels = False)
df_final.groupby('rank')[['Checking account','Duration', 'Age','Credit amount', 'hous_own',
                          'Saving accounts','purp_radio/TV','purp_car','sex_male','sex_female']].agg('mean')

	Checking account	Duration	Age	Credit amount	hous_own	Saving accounts	purp_radio/TV	purp_car	sex_male	sex_female
rank
0	0.31	15.43	39.69	2282.97	0.86	1.17	0.36	0.28	0.77	0.23
1	1.17	18.19	35.30	2868.43	0.77	1.18	0.20	0.31	0.71	0.29
2	1.43	26.64	32.02	3713.47	0.48	1.14	0.22	0.36	0.64	0.36

df_final.groupby('rank')[['Checking account','Duration', 'Age','Credit amount', 'hous_own',
                          'Saving accounts','purp_radio/TV','purp_car','sex_male','sex_female']].agg('mean').style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

	Checking account	Duration	Age	Credit amount	hous_own	Saving accounts	purp_radio/TV	purp_car	sex_male	sex_female
rank
0	0.310000	15.430000	39.690000	2282.970000	0.860000	1.170000	0.360000	0.280000	0.770000	0.230000
1	1.170000	18.190000	35.300000	2868.430000	0.770000	1.180000	0.200000	0.310000	0.710000	0.290000
2	1.430000	26.640000	32.020000	3713.470000	0.480000	1.140000	0.220000	0.360000	0.640000	0.360000

df_final.groupby('rank')[['Risk']].agg('sum').style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

	Risk
rank
0	8
1	26
2	57

Our final model is:

df_final[['score','rank','shap_1','shap_2','shap_3','shap_4','shap_5','shap_6',
        'shap_7','shap_8','shap_9','shap_10','shap_11']]

	score	rank	shap_1	shap_2	shap_3	shap_4	shap_5	shap_6	shap_7	shap_8	shap_9	shap_10	shap_11
0	0.344688	2	Little Check Account	0	0	More Junior Client	0	0	0	0	Less Credit Amount	0	0
1	0.296853	1	Little Check Account	0	0	0	0	0	0	0	0	0	0
2	0.410630	2	Little Check Account	More Credit Duration	0	0	0	0	0	0	Less Credit Amount	0	0
3	0.427560	2	Little Check Account	0	More Credit Amount	More Junior Client	Have House	0	0	Less Credit Duration	0	0	0
4	0.184806	1	0	0	More Credit Amount	0	0	0	Moderate/Rich Check Account	0	0	More Senior Client	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
295	0.187536	1	0	0	0	More Junior Client	0	0	Moderate/Rich Check Account	0	Less Credit Amount	0	0
296	0.245917	1	Little Check Account	More Credit Duration	0	0	0	0	0	0	0	0	0
297	0.167448	1	0	0	More Credit Amount	More Junior Client	0	0	Moderate/Rich Check Account	Less Credit Duration	0	0	0
298	0.658138	2	Little Check Account	More Credit Duration	More Credit Amount	0	Have House	0	0	0	0	0	0
299	0.651962	2	Little Check Account	0	More Credit Amount	More Junior Client	Have House	0	0	Less Credit Duration	0	0	0

300 rows × 13 columns