【Kaggle】AAAMLP读书笔记 Cat-in-the-dat II-摩杜云开发者社区

本人希望从0开始，自己的Kaggle竞赛

12月拿到自己的第一块Kaggle奖牌
短期内读完Abhishek Thakur的Approaching (Almost) Any Machine Learning Problem并且发博客记录https://github.com/abhishekkrthakur/approachingalmost
12月至少发21篇博客
每天保持八小时的学习时间

Approaching categorical variables（实验部分）

Cat-in-the-dat

数据分析：

总共有25列，一个id，一个target，23个变量

bin_0	bin_1	bin_2	bin_3	bin_4	nom_0	nom_1	nom_2	nom_3	nom_4	nom_5	nom_6	nom_7	nom_8	nom_9	ord_0	ord_1	ord_2	ord_3	ord_4	ord_5	day	month
0	0	0	F	N	Red	Trapezoid	Hamster	Russia	Bassoon	de4c57ee2	a64bc7ddf	598080a91	0256c7a4b	02e7c8990	3	Contributor	Hot	c	U	Pw	6	3
1	1	0	F	Y	Red	Star	Axolotl		Theremin	2bb3c3e5c	3a3a936e8	1dddb8473	52ead350c	f37df64af	3	Grandmaster	Warm	e	X	pE	7	7
2值变量	2值变量	2值变量	2值变量	2值变量	颜色	形状	动物	国家	乐器	编码	编码	编码	编码	编码	3	等级	天气	字母	字母	字母	5	9

bin_0
0.0     528377
1.0      53729
NONE     17894
Name: count, dtype: int64

bin_1
0.0     474018
1.0     107979
NONE     18003
Name: count, dtype: int64

bin_2
0.0     419845
1.0     162225
NONE     17930
Name: count, dtype: int64

bin_3
F       366212
T       215774
NONE     18014
Name: count, dtype: int64

bin_4
N       312344
Y       269609
NONE     18047
Name: count, dtype: int64

nom_0
Red      323286
Blue     205861
Green     52601
NONE      18252
Name: count, dtype: int64

nom_1
Triangle     164190
Polygon      152563
Trapezoid    119438
Circle       104995
Square        26503
NONE          18156
Star          14155
Name: count, dtype: int64

nom_2
Hamster    164897
Axolotl    152319
Lion       119504
Dog        104825
Cat         26276
NONE        18035
Snake       14144
Name: count, dtype: int64

nom_3
India         164869
Costa Rica    151827
Russia        119840
Finland       104601
Canada         26425
NONE           18121
China          14317
Name: count, dtype: int64

nom_4
Theremin    308621
Bassoon     196639
Oboe         49996
Piano        26709
NONE         18035
Name: count, dtype: int64

nom_5
NONE         17778
fc8fc7e56      977
360a16627      972
423976253      961
7917d446c      961
             ...  
7335087fd        5
30019ce8a        3
0385d0739        1
b3ad70fcb        1
d6bb2181a        1
Name: count, Length: 1221, dtype: int64

nom_6
NONE         18131
ea8c5e181      805
9fa481341      798
2b94ada45      792
32e9bd1ff      788
             ...  
f0732a795        4
322548bed        3
d6ea07c05        2
b4b8de4b9        2
3a121fefb        1
Name: count, Length: 1520, dtype: int64

nom_7
NONE         18003
4ae48e857     5035
c79d2197d     5031
86ec768cd     4961
a7059911d     4945
             ...  
b39008216      195
1828818ab      182
75d0e3ef8      157
deec583dd       93
e9c57c4aa       79
Name: count, Length: 223, dtype: int64

nom_8
NONE         17755
7d7c02c57     5052
15f03b1f4     4994
5859a8a06     4989
d7e75499d     4987
             ...  
8d31d1ab3      207
4584d6fcd      174
607c26084      149
115d9fd8b      105
6492aecc3       61
Name: count, Length: 223, dtype: int64

nom_9
NONE         18073
8f3276a6e      565
65b262989      564
c5361037c      560
9bc905a9d      558
             ...  
978258393        2
432e3fc6a        2
1538d82e9        2
5f565a682        1
d1e6704ed        1
Name: count, Length: 2219, dtype: int64

ord_0
1.0     227917
3.0     197798
2.0     155997
NONE     18288
Name: count, dtype: int64

ord_1
Novice         160597
Expert         139677
Contributor    109821
Grandmaster     95866
Master          75998
NONE            18041
Name: count, dtype: int64

ord_2
Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
NONE            18075
Name: count, dtype: int64

ord_3
n       70982
a       65321
m       57980
c       56675
h       55744
o       45464
b       44795
e       38904
k       38718
i       34763
d       30634
f       29450
NONE    17916
g        6180
j        3639
l        2835
Name: count, dtype: int64

ord_4
N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
G        3404
V        3107
J        1950
L        1657
Name: count, dtype: int64

ord_5
NONE    17713
Fl      10562
DN       9527
Sz       8654
RV       5648
        ...  
vw        189
gV        124
vQ        120
eA         91
Zv         87
Name: count, Length: 191, dtype: int64

day
3.0     113835
5.0     110464
6.0      97432
7.0      86435
1.0      84724
2.0      65495
4.0      23663
NONE     17952
Name: count, dtype: int64

month
8.0     79245
3.0     70160
5.0     68906
12.0    68340
6.0     60478
7.0     53480
1.0     52154
11.0    51165
2.0     40700
9.0     20620
NONE    17988
4.0     14614
10.0     2150
Name: count, dtype: int64

target
0    487677
1    112323
Name: count, dtype: int64

bin0 10

bin1 5

bin2 2.5

bin3 1.7

bin4 1.19

nom5 1221

nom6 1520

nom7 223

nom8 223

nom9 2219

target 4.34

ord1 ord2 有相对大小

LogisticRegression 逻辑回归

使用如下代码对进行逻辑回归训练：

import pandas as pd
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
    # 读取分层k折交叉检验数据
    df = pd.read_csv("input/cat_train.csv")
    # 取除"id", "target", "kfold"外的其他特征列
    features = [
        f for f in df.columns if f not in ("id", "target", "kfold")
    ]
    # 遍历特征列表
    for col in features:
        # 将空值置为"NONE"
        df.loc[:, col] = df[col].astype(str).fillna("NONE")
    # 取训练集（kfold列中不为fold的样本，重置索引）
    df_train = df[df.kfold != fold].reset_index(drop=True)
    # 取验证集（kfold列中为fold的样本，重置索引）
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    # 独热编码
    ohe = preprocessing.OneHotEncoder()
    # 将训练集、验证集沿行合并
    full_data = pd.concat([df_train[features], df_valid[features]], axis=0)
    ohe.fit(full_data[features])
    # 转换训练集
    x_train = ohe.transform(df_train[features])
    # 转换测试集
    x_valid = ohe.transform(df_valid[features])
    # 逻辑回归
    model = linear_model.LogisticRegression()
    # 使用训练集训练模型
    model.fit(x_train, df_train.target.values)
    # 使用验证集得到预测标签
    valid_preds = model.predict_proba(x_valid)[:, 1]
    # 计算auc指标
    auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
    print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
    for fold in range(5):
        run(fold)

结果

Fold = 0, AUC = 0.7847865042255127

Fold = 1, AUC = 0.7853553605899214

Fold = 2, AUC = 0.7879321942914885

Fold = 3, AUC = 0.7870315929550808

Fold = 4, AUC = 0.7864668243125608

RandomForestClassifier 随机森林模型

默认参数：

import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
    df = pd.read_csv("input/cat_train.csv")
    features = [
        f for f in df.columns if f not in ("id", "target", "kfold")
    ]
    for col in features:
        df.loc[:, col] = df[col].astype(str).fillna("NONE")
        lbl = preprocessing.LabelEncoder()
        lbl.fit(df[col])
        df.loc[:, col] = lbl.transform(df[col])
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    # 将训练集、验证集沿行合并
    x_train = df_train[features].values
    x_valid = df_valid[features].values
    model = ensemble.RandomForestClassifier(n_jobs=-1)
    model.fit(x_train, df_train.target.values)
    valid_preds = model.predict_proba(x_valid)[:, 1]
    auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
    print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
    # 运行折叠0
    for fold in range(5):
        run(fold)

结果

Fold = 0, AUC = 0.7143420371128966

Fold = 1, AUC = 0.7182654891323974

Fold = 2, AUC = 0.7162629185564836

Fold = 3, AUC = 0.7138862032799431

Fold = 4, AUC = 0.7169939048511448