本人希望从0开始,自己的Kaggle竞赛
- 12月拿到自己的第一块Kaggle奖牌
- 短期内读完Abhishek Thakur的Approaching (Almost) Any Machine Learning Problem并且发博客记录https://github.com/abhishekkrthakur/approachingalmost
- 12月至少发21篇博客
- 每天保持八小时的学习时间
Approaching categorical variables(实验部分)
Cat-in-the-dat
数据分析:
总共有25列,一个id,一个target,23个变量
bin_0 |
bin_1 |
bin_2 |
bin_3 |
bin_4 |
nom_0 |
nom_1 |
nom_2 |
nom_3 |
nom_4 |
nom_5 |
nom_6 |
nom_7 |
nom_8 |
nom_9 |
ord_0 |
ord_1 |
ord_2 |
ord_3 |
ord_4 |
ord_5 |
day |
month |
0 |
0 |
0 |
F |
N |
Red |
Trapezoid |
Hamster |
Russia |
Bassoon |
de4c57ee2 |
a64bc7ddf |
598080a91 |
0256c7a4b |
02e7c8990 |
3 |
Contributor |
Hot |
c |
U |
Pw |
6 |
3 |
1 |
1 |
0 |
F |
Y |
Red |
Star |
Axolotl |
Theremin |
2bb3c3e5c |
3a3a936e8 |
1dddb8473 |
52ead350c |
f37df64af |
3 |
Grandmaster |
Warm |
e |
X |
pE |
7 |
7 |
|
2值变量 |
2值变量 |
2值变量 |
2值变量 |
2值变量 |
颜色 |
形状 |
动物 |
国家 |
乐器 |
编码 |
编码 |
编码 |
编码 |
编码 | 3 |
等级 |
天气 |
字母 |
字母 |
字母 |
5 |
9 |
bin_0
0.0 528377
1.0 53729
NONE 17894
Name: count, dtype: int64
bin_1
0.0 474018
1.0 107979
NONE 18003
Name: count, dtype: int64
bin_2
0.0 419845
1.0 162225
NONE 17930
Name: count, dtype: int64
bin_3
F 366212
T 215774
NONE 18014
Name: count, dtype: int64
bin_4
N 312344
Y 269609
NONE 18047
Name: count, dtype: int64
nom_0
Red 323286
Blue 205861
Green 52601
NONE 18252
Name: count, dtype: int64
nom_1
Triangle 164190
Polygon 152563
Trapezoid 119438
Circle 104995
Square 26503
NONE 18156
Star 14155
Name: count, dtype: int64
nom_2
Hamster 164897
Axolotl 152319
Lion 119504
Dog 104825
Cat 26276
NONE 18035
Snake 14144
Name: count, dtype: int64
nom_3
India 164869
Costa Rica 151827
Russia 119840
Finland 104601
Canada 26425
NONE 18121
China 14317
Name: count, dtype: int64
nom_4
Theremin 308621
Bassoon 196639
Oboe 49996
Piano 26709
NONE 18035
Name: count, dtype: int64
nom_5
NONE 17778
fc8fc7e56 977
360a16627 972
423976253 961
7917d446c 961
...
7335087fd 5
30019ce8a 3
0385d0739 1
b3ad70fcb 1
d6bb2181a 1
Name: count, Length: 1221, dtype: int64
nom_6
NONE 18131
ea8c5e181 805
9fa481341 798
2b94ada45 792
32e9bd1ff 788
...
f0732a795 4
322548bed 3
d6ea07c05 2
b4b8de4b9 2
3a121fefb 1
Name: count, Length: 1520, dtype: int64
nom_7
NONE 18003
4ae48e857 5035
c79d2197d 5031
86ec768cd 4961
a7059911d 4945
...
b39008216 195
1828818ab 182
75d0e3ef8 157
deec583dd 93
e9c57c4aa 79
Name: count, Length: 223, dtype: int64
nom_8
NONE 17755
7d7c02c57 5052
15f03b1f4 4994
5859a8a06 4989
d7e75499d 4987
...
8d31d1ab3 207
4584d6fcd 174
607c26084 149
115d9fd8b 105
6492aecc3 61
Name: count, Length: 223, dtype: int64
nom_9
NONE 18073
8f3276a6e 565
65b262989 564
c5361037c 560
9bc905a9d 558
...
978258393 2
432e3fc6a 2
1538d82e9 2
5f565a682 1
d1e6704ed 1
Name: count, Length: 2219, dtype: int64
ord_0
1.0 227917
3.0 197798
2.0 155997
NONE 18288
Name: count, dtype: int64
ord_1
Novice 160597
Expert 139677
Contributor 109821
Grandmaster 95866
Master 75998
NONE 18041
Name: count, dtype: int64
ord_2
Freezing 142726
Warm 124239
Cold 97822
Boiling Hot 84790
Hot 67508
Lava Hot 64840
NONE 18075
Name: count, dtype: int64
ord_3
n 70982
a 65321
m 57980
c 56675
h 55744
o 45464
b 44795
e 38904
k 38718
i 34763
d 30634
f 29450
NONE 17916
g 6180
j 3639
l 2835
Name: count, dtype: int64
ord_4
N 39978
P 37890
Y 36657
A 36633
R 33045
U 32897
M 32504
X 32347
C 32112
H 31189
Q 30145
T 29723
O 25610
B 25212
E 21871
K 21676
I 19805
NONE 17930
D 17284
F 16721
W 8268
Z 5790
S 4595
G 3404
V 3107
J 1950
L 1657
Name: count, dtype: int64
ord_5
NONE 17713
Fl 10562
DN 9527
Sz 8654
RV 5648
...
vw 189
gV 124
vQ 120
eA 91
Zv 87
Name: count, Length: 191, dtype: int64
day
3.0 113835
5.0 110464
6.0 97432
7.0 86435
1.0 84724
2.0 65495
4.0 23663
NONE 17952
Name: count, dtype: int64
month
8.0 79245
3.0 70160
5.0 68906
12.0 68340
6.0 60478
7.0 53480
1.0 52154
11.0 51165
2.0 40700
9.0 20620
NONE 17988
4.0 14614
10.0 2150
Name: count, dtype: int64
target
0 487677
1 112323
Name: count, dtype: int64
bin0 10
bin1 5
bin2 2.5
bin3 1.7
bin4 1.19
nom5 1221
nom6 1520
nom7 223
nom8 223
nom9 2219
target 4.34
ord1 ord2 有相对大小
LogisticRegression 逻辑回归
使用如下代码对进行逻辑回归训练:
import pandas as pd
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
# 读取分层k折交叉检验数据
df = pd.read_csv("input/cat_train.csv")
# 取除"id", "target", "kfold"外的其他特征列
features = [
f for f in df.columns if f not in ("id", "target", "kfold")
]
# 遍历特征列表
for col in features:
# 将空值置为"NONE"
df.loc[:, col] = df[col].astype(str).fillna("NONE")
# 取训练集(kfold列中不为fold的样本,重置索引)
df_train = df[df.kfold != fold].reset_index(drop=True)
# 取验证集(kfold列中为fold的样本,重置索引)
df_valid = df[df.kfold == fold].reset_index(drop=True)
# 独热编码
ohe = preprocessing.OneHotEncoder()
# 将训练集、验证集沿行合并
full_data = pd.concat([df_train[features], df_valid[features]], axis=0)
ohe.fit(full_data[features])
# 转换训练集
x_train = ohe.transform(df_train[features])
# 转换测试集
x_valid = ohe.transform(df_valid[features])
# 逻辑回归
model = linear_model.LogisticRegression()
# 使用训练集训练模型
model.fit(x_train, df_train.target.values)
# 使用验证集得到预测标签
valid_preds = model.predict_proba(x_valid)[:, 1]
# 计算auc指标
auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
for fold in range(5):
run(fold)
结果
Fold = 0, AUC = 0.7847865042255127
Fold = 1, AUC = 0.7853553605899214
Fold = 2, AUC = 0.7879321942914885
Fold = 3, AUC = 0.7870315929550808
Fold = 4, AUC = 0.7864668243125608
RandomForestClassifier 随机森林模型
默认参数:
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import preprocessing
def run(fold):
df = pd.read_csv("input/cat_train.csv")
features = [
f for f in df.columns if f not in ("id", "target", "kfold")
]
for col in features:
df.loc[:, col] = df[col].astype(str).fillna("NONE")
lbl = preprocessing.LabelEncoder()
lbl.fit(df[col])
df.loc[:, col] = lbl.transform(df[col])
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid = df[df.kfold == fold].reset_index(drop=True)
# 将训练集、验证集沿行合并
x_train = df_train[features].values
x_valid = df_valid[features].values
model = ensemble.RandomForestClassifier(n_jobs=-1)
model.fit(x_train, df_train.target.values)
valid_preds = model.predict_proba(x_valid)[:, 1]
auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
# 运行折叠0
for fold in range(5):
run(fold)
结果
Fold = 0, AUC = 0.7143420371128966
Fold = 1, AUC = 0.7182654891323974
Fold = 2, AUC = 0.7162629185564836
Fold = 3, AUC = 0.7138862032799431
Fold = 4, AUC = 0.7169939048511448