【Kaggle】AAAMLP读书笔记 Cat-in-the-dat II
  uKHDYRvlooeP 2023年12月23日 58 0

本人希望从0开始,自己的Kaggle竞赛

  1. 12月拿到自己的第一块Kaggle奖牌
  2. 短期内读完Abhishek Thakur的Approaching (Almost) Any Machine Learning Problem并且发博客记录https://github.com/abhishekkrthakur/approachingalmost
  3. 12月至少发21篇博客
  4. 每天保持八小时的学习时间

Approaching categorical variables(实验部分)

Cat-in-the-dat

数据分析:

总共有25列,一个id,一个target,23个变量

bin_0

bin_1

bin_2

bin_3

bin_4

nom_0

nom_1

nom_2

nom_3

nom_4

nom_5

nom_6

nom_7

nom_8

nom_9

ord_0

ord_1

ord_2

ord_3

ord_4

ord_5

day

month

0

0

0

F

N

Red

Trapezoid

Hamster

Russia

Bassoon

de4c57ee2

a64bc7ddf

598080a91

0256c7a4b

02e7c8990

3

Contributor

Hot

c

U

Pw

6

3

1

1

0

F

Y

Red

Star

Axolotl

Theremin

2bb3c3e5c

3a3a936e8

1dddb8473

52ead350c

f37df64af

3

Grandmaster

Warm

e

X

pE

7

7

2值变量

2值变量

2值变量

2值变量

2值变量

颜色

形状

动物

国家

乐器

编码

编码

编码

编码

编码

3

等级

天气

字母

字母

字母

5

9


bin_0
0.0     528377
1.0      53729
NONE     17894
Name: count, dtype: int64

bin_1
0.0     474018
1.0     107979
NONE     18003
Name: count, dtype: int64

bin_2
0.0     419845
1.0     162225
NONE     17930
Name: count, dtype: int64

bin_3
F       366212
T       215774
NONE     18014
Name: count, dtype: int64

bin_4
N       312344
Y       269609
NONE     18047
Name: count, dtype: int64

nom_0
Red      323286
Blue     205861
Green     52601
NONE      18252
Name: count, dtype: int64

nom_1
Triangle     164190
Polygon      152563
Trapezoid    119438
Circle       104995
Square        26503
NONE          18156
Star          14155
Name: count, dtype: int64

nom_2
Hamster    164897
Axolotl    152319
Lion       119504
Dog        104825
Cat         26276
NONE        18035
Snake       14144
Name: count, dtype: int64

nom_3
India         164869
Costa Rica    151827
Russia        119840
Finland       104601
Canada         26425
NONE           18121
China          14317
Name: count, dtype: int64

nom_4
Theremin    308621
Bassoon     196639
Oboe         49996
Piano        26709
NONE         18035
Name: count, dtype: int64

nom_5
NONE         17778
fc8fc7e56      977
360a16627      972
423976253      961
7917d446c      961
             ...  
7335087fd        5
30019ce8a        3
0385d0739        1
b3ad70fcb        1
d6bb2181a        1
Name: count, Length: 1221, dtype: int64

nom_6
NONE         18131
ea8c5e181      805
9fa481341      798
2b94ada45      792
32e9bd1ff      788
             ...  
f0732a795        4
322548bed        3
d6ea07c05        2
b4b8de4b9        2
3a121fefb        1
Name: count, Length: 1520, dtype: int64

nom_7
NONE         18003
4ae48e857     5035
c79d2197d     5031
86ec768cd     4961
a7059911d     4945
             ...  
b39008216      195
1828818ab      182
75d0e3ef8      157
deec583dd       93
e9c57c4aa       79
Name: count, Length: 223, dtype: int64

nom_8
NONE         17755
7d7c02c57     5052
15f03b1f4     4994
5859a8a06     4989
d7e75499d     4987
             ...  
8d31d1ab3      207
4584d6fcd      174
607c26084      149
115d9fd8b      105
6492aecc3       61
Name: count, Length: 223, dtype: int64

nom_9
NONE         18073
8f3276a6e      565
65b262989      564
c5361037c      560
9bc905a9d      558
             ...  
978258393        2
432e3fc6a        2
1538d82e9        2
5f565a682        1
d1e6704ed        1
Name: count, Length: 2219, dtype: int64

ord_0
1.0     227917
3.0     197798
2.0     155997
NONE     18288
Name: count, dtype: int64

ord_1
Novice         160597
Expert         139677
Contributor    109821
Grandmaster     95866
Master          75998
NONE            18041
Name: count, dtype: int64

ord_2
Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
NONE            18075
Name: count, dtype: int64

ord_3
n       70982
a       65321
m       57980
c       56675
h       55744
o       45464
b       44795
e       38904
k       38718
i       34763
d       30634
f       29450
NONE    17916
g        6180
j        3639
l        2835
Name: count, dtype: int64

ord_4
N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
G        3404
V        3107
J        1950
L        1657
Name: count, dtype: int64

ord_5
NONE    17713
Fl      10562
DN       9527
Sz       8654
RV       5648
        ...  
vw        189
gV        124
vQ        120
eA         91
Zv         87
Name: count, Length: 191, dtype: int64

day
3.0     113835
5.0     110464
6.0      97432
7.0      86435
1.0      84724
2.0      65495
4.0      23663
NONE     17952
Name: count, dtype: int64

month
8.0     79245
3.0     70160
5.0     68906
12.0    68340
6.0     60478
7.0     53480
1.0     52154
11.0    51165
2.0     40700
9.0     20620
NONE    17988
4.0     14614
10.0     2150
Name: count, dtype: int64

target
0    487677
1    112323
Name: count, dtype: int64

bin0  10

bin1  5

bin2  2.5  

bin3  1.7

bin4  1.19

nom5 1221

nom6 1520

nom7 223

nom8 223

nom9 2219


target 4.34

ord1 ord2 有相对大小

LogisticRegression 逻辑回归

使用如下代码对进行逻辑回归训练:

import pandas as pd
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
    # 读取分层k折交叉检验数据
    df = pd.read_csv("input/cat_train.csv")
    # 取除"id", "target", "kfold"外的其他特征列
    features = [
        f for f in df.columns if f not in ("id", "target", "kfold")
    ]
    # 遍历特征列表
    for col in features:
        # 将空值置为"NONE"
        df.loc[:, col] = df[col].astype(str).fillna("NONE")
    # 取训练集(kfold列中不为fold的样本,重置索引)
    df_train = df[df.kfold != fold].reset_index(drop=True)
    # 取验证集(kfold列中为fold的样本,重置索引)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    # 独热编码
    ohe = preprocessing.OneHotEncoder()
    # 将训练集、验证集沿行合并
    full_data = pd.concat([df_train[features], df_valid[features]], axis=0)
    ohe.fit(full_data[features])
    # 转换训练集
    x_train = ohe.transform(df_train[features])
    # 转换测试集
    x_valid = ohe.transform(df_valid[features])
    # 逻辑回归
    model = linear_model.LogisticRegression()
    # 使用训练集训练模型
    model.fit(x_train, df_train.target.values)
    # 使用验证集得到预测标签
    valid_preds = model.predict_proba(x_valid)[:, 1]
    # 计算auc指标
    auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
    print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
    for fold in range(5):
        run(fold)
结果

Fold = 0, AUC = 0.7847865042255127

Fold = 1, AUC = 0.7853553605899214

Fold = 2, AUC = 0.7879321942914885

Fold = 3, AUC = 0.7870315929550808

Fold = 4, AUC = 0.7864668243125608


RandomForestClassifier 随机森林模型

默认参数:
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
    df = pd.read_csv("input/cat_train.csv")
    features = [
        f for f in df.columns if f not in ("id", "target", "kfold")
    ]
    for col in features:
        df.loc[:, col] = df[col].astype(str).fillna("NONE")
        lbl = preprocessing.LabelEncoder()
        lbl.fit(df[col])
        df.loc[:, col] = lbl.transform(df[col])
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    # 将训练集、验证集沿行合并
    x_train = df_train[features].values
    x_valid = df_valid[features].values
    model = ensemble.RandomForestClassifier(n_jobs=-1)
    model.fit(x_train, df_train.target.values)
    valid_preds = model.predict_proba(x_valid)[:, 1]
    auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
    print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
    # 运行折叠0
    for fold in range(5):
        run(fold)
结果

Fold = 0, AUC = 0.7143420371128966

Fold = 1, AUC = 0.7182654891323974

Fold = 2, AUC = 0.7162629185564836

Fold = 3, AUC = 0.7138862032799431

Fold = 4, AUC = 0.7169939048511448

 

【版权声明】本文内容来自摩杜云社区用户原创、第三方投稿、转载,内容版权归原作者所有。本网站的目的在于传递更多信息,不拥有版权,亦不承担相应法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@moduyun.com

  1. 分享:
最后一次编辑于 2023年12月23日 0

暂无评论

uKHDYRvlooeP