【飞桨学习赛:MarTech Challenge 点击反欺诈预测】第10名方案


该内容展示了一个模型从基线版本V1到V5的迭代过程,分数从86.746提升至89.0787。过程中进行了数据探索,处理缺失值和object类型字段,优化特征工程(如构造面积、时间差等特征),尝试LightGBM、XGBoost等模型,采用交叉验证,最终得到分数89.1093的最佳版本。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【飞桨学习赛:martech challenge 点击反欺诈预测】第10名方案 -

In [ ]
#best版本-89.1093分数import pandas as pdimport warnings
warnings.filterwarnings('ignore')# 数据加载和去除Unnameed字段train = pd.read_csv('./train.csv')
test = pd.read_csv('./test1.csv')
train = train.iloc[:, 1:]
test = test.iloc[:,1:]
res = pd.DataFrame(test['sid'])# 去除数据探索发现问题的字段col = train.columns.tolist()
remove_list = ['lan', 'os','label', 'sid']for i in remove_list:
    col.remove(i)
features = train[col]
test_features = test[col]# 对osv进行数据清洗def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')    if str(x).find('.')>0:
        temp_index1 = x.find('.')        if x.find(' ')>0:
            temp_index2 = x.find(' ')        else:
            temp_index2 = len(x) 
        if x.find('-')>0:
            temp_index2 = x.find('-')
            
        result = x[0:temp_index1] + '.' + x[temp_index1+1:temp_index2].replace('.', '')        try:            return float(result)        except:            print('有错误: '+x)            return 0
    try:        return float(x)    except:        print('有错误: '+x)        return 0features['osv'].fillna('8.1.0', inplace=True)
features['osv'] = features['osv'].apply(osv_trans)
test_features['osv'].fillna('8.1.0', inplace=True)
test_features['osv'] = test_features['osv'].apply(osv_trans)# 对timestamp进行数据清洗与特征变换,from datetime import datetime
features['timestamp'] = features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
test_features['timestamp'] = test_features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['hour'] = temp.hour
features['minute'] = temp.minute
features['week_day'] = temp.weekday #星期几start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds/3600/24temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute
test_features['week_day'] = temp.weekday #星期几 test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds/3600/24features = features.drop(['timestamp'],axis = 1)
test_features = test_features.drop(['timestamp'],axis = 1)# 对version进行数据清洗与特征变换def version_trans(x):
    if x=='V3':        return 3
    if x=='v1':        return 1
    if x=='P_Final_6':        return 6
    if x=='V6':        return 6
    if x=='GA3':        return 3
    if x=='GA2':        return 2
    if x=='V2':        return 2
    if x=='50':        return 5
    return int(x)
features['version'] = features['version'].apply(version_trans)
test_features['version'] = test_features['version'].apply(version_trans)
features['version'] = features['version'].astype('int')
test_features['version'] = test_features['version'].astype('int')# 对lan进行数据清洗与特征变换 对于有缺失的lan 设置为22    lan_map = {'zh-CN': 1, 'zh_CN':2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans':5, 'zh': 6, 'ZH': 7, 'cn':8, 'CN':9, 'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13,             'zh-MO':14, 'en':15, 'en-GB': 16, 'en-US': 17, 'ko': 18, 'ja': 19, 'it': 20, 'mi':21} 
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)# 构造面积特征和构造相除特征features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']
features['dev_rato'] = features['dev_height'] / features['dev_width']
test_features['dev_rato'] = test_features['dev_height'] / test_features['dev_width']# APP版本与操作系统版本差features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']# 对fea_hash与fea1_hash特征变换features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))#通过特征比,寻找关键特征,构造新特征,新特征字段 = 原始特征字段 + 1def find_key_feature(train, selected):
    temp = pd.DataFrame(columns = [0,1])
    temp0 = train[train['label'] == 0]
    temp1 = train[train['label'] == 1]
    temp[0] = temp0[selected].value_counts() / len(temp0) * 100
    temp[1] = temp1[selected].value_counts() / len(temp1) * 100
    temp[2] = temp[1] / temp[0]    #选出大于10倍的特征
    result = temp[temp[2] > 10].sort_values(2, ascending = False).index    return result
selected_cols = ['osv','apptype', 'carrier', 'dev_height', 'dev_ppi','dev_width', 'media_id', 
                 'package', 'version', 'fea_hash', 'location', 'fea1_hash','cus_type']
key_feature = {}for selected in selected_cols:
    key_feature[selected] = find_key_feature(train, selected)def f(x, selected):
    if x in key_feature[selected]:        return 1
    else:        return 0for selected in selected_cols:    if len(key_feature[selected]) > 0:
        features[selected+'1'] = features[selected].apply(f, args = (selected,))
        test_features[selected+'1'] = test_features[selected].apply(f, args = (selected,))        print(selected+'1 created')#CatBoost模型from catboost import CatBoostClassifierfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import roc_auc_score
model=CatBoostClassifier(
            loss_function="Logloss",
            eval_metric="AUC",
            task_type="GPU",
            learning_rate=0.1,
            iterations=1000,
            random_seed=2025,
            od_type="Iter",
            depth=7)

n_folds =10 #十折交叉校验answers = []
mean_score = 0data_x=features
data_y=train['label']
sk = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=2025)
all_test = test_features.copy()for train, test in sk.split(data_x, data_y):  
    x_train = data_x.iloc[train]
    y_train = data_y.iloc[train]
    x_test = data_x.iloc[test]
    y_test = data_y.iloc[test]
    clf = model.fit(x_train,y_train, eval_set=(x_test,y_test),verbose=500) # 500条打印一条日志
    
    yy_pred_valid=clf.predict(x_test,prediction_type='Probability')[:,-1]    print('cat验证的auc:{}'.format(roc_auc_score(y_test, yy_pred_valid)))
    mean_score += roc_auc_score(y_test, yy_pred_valid) / n_folds
    
    y_pred_valid = clf.predict(all_test,prediction_type='Probability')[:,-1]
    answers.append(y_pred_valid) 
print('mean valAuc:{}'.format(mean_score))
cat_pre=sum(answers)/n_folds
cat_pre
res['label']=[1 if x>=0.5 else 0 for x in cat_pre]
res.to_csv('./baselinev6.csv',index=False)
       
有错误: f073b_changxiang_v01_b1b8_20180915
有错误: %E6%B1%9F%E7%81%B5OS+5.0
有错误: GIONEE_YNGA
       

项目思考的过程与baseline迭代版本

BaseLine V1_lgb--分数: 86.746

切换盘符:

jupyter notebook D:\
       

一、数据探索

1、去除Unnameed字段

train = train.iloc[:, 1:]
test = test.iloc[:,1:]
       

2、查看字段类型

写法1:

train.info()
       

写法2:

或者直接查看类型为object的列

train.select_dtypes(include='object').columns
       

发现以下字段为object类型需要进行数值变换

 7   lan         316720 non-null  object 
 10  os          500000 non-null  object 
 11  osv         493439 non-null  object 
 15  version     500000 non-null  object 
 16  fea_hash    500000 non-null  object
       

以lan为例查看里面数据情况

train['lan'].value_counts()
       

3、查看缺失值的个数

写法1:

train.isnull().sum()
       

写法2:

t = train.isnull().sum()
t[t>0]
       

发现以下字段缺少比较多

lan           183280osv             6561
       

4、唯一值的个数

查看唯一值的个数

features = train.columns.tolist()for feature in features:    if train[feature].nunique() ==1:        print(feature,train[feature].nunique())
       

发现os字段的唯一值个数太少

os 2
       

查看os

train['os'].value_counts()
       

发现os数据都为android

Openflow Openflow

一键极速绘图,赋能行业工作流

Openflow 88 查看详情 Openflow
android    303175Android    196825Name: os, dtype: int64
       

5、数据探索的结论

object类型字段有:lan、osv 、osv、version、fea_hash

缺失值较多的字段有:lan、osv

唯一值个数较少且意义不大:os

没有意义的字段:sid

BaselineV1中也先去除timestamp

6、特征的相关性分析(补充)

# 对特征列进行相关性分析import matplotlib.pyplot as plt
%matplotlib inlineimport seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(train.corr(),cbar=True,annot=True,cmap='Blues')
       

二、数据预处理

最终去掉:【lan】【os】【osv】【version】【label】【sid】【timestamp】

remove_list = ['lan', 'os', 'osv', 'version', 'label', 'sid','timestamp']
col = features #字段名for i in remove_list:
    col.remove(i)
features = train[col]
       

三、特征工程

1、fea_hash特征变换

#查看数据值train['fea_hash'].value_counts()#查看统计信息train['fea_hash'].describe()#查看映射的长度特征情况train['fea_hash'].map(lambda x:len(str(x))).value_counts()
       

fea_hash进行特征变换

# fea_hash的长度为新特征features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))# 如果fea_hash很长,都归为0,否则为自己的本身features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
       

四、模型建立

test 做和train同样处理,利用lightgbm进行训练与预测,并保存,上诉过程全部合并代码如下:

#BaselineV1import pandas as pdimport warningsimport lightgbm as lgb
warnings.filterwarnings('ignore')# 数据加载train = pd.read_csv('./train.csv')
test = pd.read_csv('./test1.csv')# 去除Unnameed字段train = train.iloc[:, 1:]
test = test.iloc[:,1:]# 去除数据探索发现问题的字段col = train.columns.tolist()
remove_list = ['lan', 'os', 'osv', 'version', 'label', 'sid','timestamp']for i in remove_list:
    col.remove(i)
features = train[col]
test_features = test[col]# fea_hash特征变换features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))#lightgbm进行训练与预测model = lgb.LGBMClassifier()
model.fit(features,train['label'])
result = model.predict(test_features)#res包括sid字段与label字段res = pd.DataFrame(test['sid'])
res['label'] = result#保存在csv中res.to_csv('./baselineV1.csv',index=False)
       

BaseLine V2_lgb--分数: 88.2007

一、特征工程优化

1、利用osv特征

# 对osv进行数据清洗def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')    if str(x).find('.')>0:
        temp_index1 = x.find('.')        if x.find(' ')>0:
            temp_index2 = x.find(' ')        else:
            temp_index2 = len(x) 
        if x.find('-')>0:
            temp_index2 = x.find('-')
            
        result = x[0:temp_index1] + '.' + x[temp_index1+1:temp_index2].replace('.', '')        try:            return float(result)        except:            print('有错误: '+x)            return 0
    try:        return float(x)    except:        print('有错误: '+x)        return 0features['osv'].fillna('8.1.0', inplace=True)
features['osv'] = features['osv'].apply(osv_trans)
test_features['osv'].fillna('8.1.0', inplace=True)
test_features['osv'] = test_features['osv'].apply(osv_trans)
       

2、利用TimeStamp特征

提取时间多尺度并计算时间diff(时间差)

# 对timestamp进行数据清洗与特征变换from datetime import datetime
features['timestamp'] = features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
test_features['timestamp'] = test_features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['hour'] = temp.hour
features['minute'] = temp.minute
features['week_day'] = temp.weekday #星期几start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds/3600/24temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute
test_features['week_day'] = temp.weekday #星期几 test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds/3600/24col = features.columns.tolist()
col.remove('timestamp')
features = features[col]
test_features = test_features[col]
       

3、利用Version特征

# 对version进行数据清洗与特征变换def version_trans(x):
    if x=='V3':        return 3
    if x=='v1':        return 1
    if x=='P_Final_6':        return 6
    if x=='V6':        return 6
    if x=='GA3':        return 3
    if x=='GA2':        return 2
    if x=='V2':        return 2
    if x=='50':        return 5
    return int(x)
features['version'] = features['version'].apply(version_trans)
test_features['version'] = test_features['version'].apply(version_trans)
features['version'] = features['version'].astype('int')
test_features['version'] = test_features['version'].astype('int')
       

二、模型建立

上诉过程合并代码如下:

import pandas as pdimport warningsimport lightgbm as lgb
warnings.filterwarnings('ignore')# 数据加载和去除Unnameed字段train = pd.read_csv('./train.csv')
test = pd.read_csv('./test1.csv')
train = train.iloc[:, 1:]
test = test.iloc[:,1:]# 去除数据探索发现问题的字段col = train.columns.tolist()
remove_list = ['lan', 'os','label', 'sid']for i in remove_list:
    col.remove(i)
features = train[col]
test_features = test[col]# 对osv进行数据清洗def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')    if str(x).find('.')>0:
        temp_index1 = x.find('.')        if x.find(' ')>0:
            temp_index2 = x.find(' ')        else:
            temp_index2 = len(x) 
        if x.find('-')>0:
            temp_index2 = x.find('-')
            
        result = x[0:temp_index1] + '.' + x[temp_index1+1:temp_index2].replace('.', '')        try:            return float(result)        except:            print('有错误: '+x)            return 0
    try:        return float(x)    except:        print('有错误: '+x)        return 0features['osv'].fillna('8.1.0', inplace=True)
features['osv'] = features['osv'].apply(osv_trans)
test_features['osv'].fillna('8.1.0', inplace=True)
test_features['osv'] = test_features['osv'].apply(osv_trans)# 对timestamp进行数据清洗与特征变换,from datetime import datetime
features['timestamp'] = features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
test_features['timestamp'] = test_features['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['hour'] = temp.hour
features['minute'] = temp.minute
features['week_day'] = temp.weekday #星期几start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds/3600/24temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute
test_features['week_day'] = temp.weekday #星期几 test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds/3600/24features = features.drop(['timestamp'],axis = 1)
test_features = test_features.drop(['timestamp'],axis = 1)# 对version进行数据清洗与特征变换def version_trans(x):
    if x=='V3':        return 3
    if x=='v1':        return 1
    if x=='P_Final_6':        return 6
    if x=='V6':        return 6
    if x=='GA3':        return 3
    if x=='GA2':        return 2
    if x=='V2':        return 2
    if x=='50':        return 5
    return int(x)
features['version'] = features['version'].apply(version_trans)
test_features['version'] = test_features['version'].apply(version_trans)
features['version'] = features['version'].astype('int')
test_features['version'] = test_features['version'].astype('int')# 对fea_hash与fea1_hash特征变换features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))#lightgbm进行训练与预测model = lgb.LGBMClassifier()
model.fit(features,train['label'])
result = model.predict(test_features)#res包括sid字段与label字段res = pd.DataFrame(test['sid'])
res['label'] = result#保存在csv中res.to_csv('./baselineV2.csv',index=False)print("已完成")
       

BaseLine V3_xgb--分数: 88.5073

一、特征工程优化

1、构造面积特征和相除特征

features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']
features['dev_rato'] = features['dev_height'] / features['dev_width']
test_features['dev_rato'] = test_features['dev_height'] / test_features['dev_width']
       

2、APP版本与操作系统版本差

features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']
       

二、xgboost模型

1、LightGBM 祖传参数

clf = lgb.LGBMClassifier(
            num_le*es=2**5-1, reg_alpha=0.25, reg_lambda=0.25, objective='multiclass',
            max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2025,
            n_estimators=2000, subsample=1, colsample_bytree=1)
device = gpu
gpu_platform_id = 0gpu_device_id = 0
       

2、XGBoost祖传参数

model_xgb = xgb.XGBClassifier(
            max_depth=9, learning_rate=0.005, n_estimators=2000, 
            objective='multi:softprob', tree_method='gpu_hist', 
            subsample=0.8, colsample_bytree=0.8, 
            min_child_samples=3, eval_metric='logloss', reg_lambda=0.5)
       

3、使用xgboost并使用祖传参数

%%time#lightgbm进行训练与预测import xgboost as xgb
model_xgb = xgb.XGBClassifier(
            max_depth=15, learning_rate=0.05, n_estimators=5000, 
            objective='binary:logistic', tree_method='gpu_hist', 
            subsample=0.8, colsample_bytree=0.8, 
            min_child_samples=3, eval_metric='auc', reg_lambda=0.5
        )
model_xgb.fit(features,train['label'])
result_xgb = model.predict(test_features)
res = pd.DataFrame(test['sid'])
res['label'] = result_xgb
res.to_csv('./baselineV3.csv',index=False)print("已完成")
       

使用了xgboost的祖传参数

参数 含义
max_depth 含义:树的最大深度,用来避免过拟合的。max_depth越大,模型会学到更具体更局部的样本,需要使用CV函数来进行调优。 
默认值:6,典型值:3-10。
调参:值越大,越容易过拟合;值越小,越容易欠拟合。
learning_rate 含义:学习率,控制每次迭代更新权重时的步长
默认值:0.3,典型值:0.01-0.2。 
调参:值越小,训练越慢。
n_estimators 总共迭代的次数,即决策树的个数,相当于训练的轮数
objective 回归任务:reg:linear (默认) reg: logistic 
二分类 binary:logistic (概率) binary:logitraw (类别) 
多分类 multi:softmax num_class=n (返回类别) multi:softprob num_class=n(返回概率)
tree_method 可调用gpu:gpu_hist。使用功能的树的构建方法,hist代表使用直方图优化的近似贪婪的算法
subsample 含义:训练样本采样率(行采样),训练每棵树时,使用的数据占全部训练集的比例。这个参数控制对于每棵树,随机采样的比例。 减小这个参数的值,算法会更加保守,避免过拟合。但是,如果这个值设置得过小,它可能会导致欠拟合。
默认值:1,典型值:0.5-1。
调参:防止过拟合。
colsample_bytree 含义:训练每棵树时,使用的数据占全部训练集的比例。默认值为1,典型值为0.5-1。和GBM中的subsample参数一模一样。这个参数控制对于每棵树,随机采样的比例。 减小这个参数的值,算法会更加保守,避免过拟合。但是,如果这个值设置得过小,它可能会导致欠拟合。 
典型值:0.5-1 
调参:防止过拟合。
min_child_samples
eval_metric 用户可以添加多种评价指标,对于Python用户要以list传递参数对给程序
可供的选择如下: 
回归任务(默认rmse) :rmse--均方根误差 mae--平均绝对误差 
分类任务(默认error) : auc--roc曲线下面积 error--错误率(二分类) merror--错误率(多分类) logloss--负对数似然函数(二分类) mlogloss--负对数似然函数(多分类)
reg_lambda L2正则化系数

4、可视化的方式查看特征的重要程度

from xgboost import plot_importanceimport matplotlib.pyplot as plt
plot_importance(model_xgb)
       

BaseLine V4_xgb--分数: 88.946

一、使用十折交叉验证优化

%%time# 定义10折子模型from sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import accuracy_scoredef xgb_model(clf,train_x,train_y,test):
    sk=StratifiedKFold(n_splits=10,random_state=2025,shuffle = True)
    prob=[]
    mean_acc=0
    for k,(train_index,val_index) in enumerate(sk.split(train_x,train_y)):
        train_x_real=train_x.iloc[train_index]
        train_y_real=train_y.iloc[train_index]
        val_x=train_x.iloc[val_index]
        val_y=train_y.iloc[val_index]        #模型训练及验证集测试
        clf=clf.fit(train_x_real,train_y_real)
        val_y_pred=clf.predict(val_x)
        acc_val=accuracy_score(val_y,val_y_pred)        print('第{}个子模型 accuracy{}'.format(k+1,acc_val))
        mean_acc+=mean_acc/10
        #预测测试集
        test_y_pred=clf.predict_proba(test)
        prob.append(test_y_pred)    print(mean_acc)
    mean_prob=sum(prob)/10
    return mean_prob 
 
import xgboost as xgb
model_xgb2 = xgb.XGBClassifier(
            max_depth=15, learning_rate=0.005, n_estimators=5300, 
            objective='binary:logistic', tree_method='gpu_hist', 
            subsample=0.7, colsample_bytree=0.7, 
            min_child_samples=3, eval_metric='auc', reg_lambda=0.5
        )
result_xgb=xgb_model(model_xgb2,features,train['label'],test_features) 
result_xgb2=[x[1] for x in result_xgb]
result_xgb2=[1 if x>=0.5 else 0 for x in result_xgb2]
 
res = pd.DataFrame(test['sid'])
res['label'] = result_xgb2
res.to_csv('./baselineV4.csv', index=False)print('已完成')
       

BaseLine V5_xgb--分数: 89.0787

一、特征工程优化

通过特征比,寻找关键特征,构造新特征,新特征字段 = 原始特征字段 + 1

#通过特征比,寻找关键特征,构造新特征,新特征字段 = 原始特征字段 + 1def find_key_feature(train, selected):
    temp = pd.DataFrame(columns = [0,1])
    temp0 = train[train['label'] == 0]
    temp1 = train[train['label'] == 1]
    temp[0] = temp0[selected].value_counts() / len(temp0) * 100
    temp[1] = temp1[selected].value_counts() / len(temp1) * 100
    temp[2] = temp[1] / temp[0]    #选出大于10倍的特征
    result = temp[temp[2] > 10].sort_values(2, ascending = False).index    return result

selected_cols = ['osv','apptype', 'carrier', 'dev_height', 'dev_ppi','dev_width', 'media_id', 
                 'package', 'version', 'fea_hash', 'location', 'fea1_hash','cus_type']
key_feature = {}for selected in selected_cols:
    key_feature[selected] = find_key_feature(train, selected)
key_featuredef f(x, selected):
    if x in key_feature[selected]:        return 1
    else:        return 0for selected in selected_cols:    if len(key_feature[selected]) > 0:
        features[selected+'1'] = features[selected].apply(f, args = (selected,))
        test_features[selected+'1'] = test_features[selected].apply(f, args = (selected,))        print(selected+'1 created')
   

以上就是【飞桨学习赛:MarTech Challenge 点击反欺诈预测】第10名方案的详细内容,更多请关注其它相关文章!


# git  # 操作系统  # ai  # 系统版本  # yy  # red  # python  # 加载  # 迭代  # 李老师网站运营推广  # 关键词排名优化公司  # 北京好的网站建设收费  # 营销推广方案ppt内容主题海报  # 前端seo关键词  # 怎样选择seo公司  # 谷歌seo优化时间  # 晋城抖音推广关键词排名  # 蓟县文教书籍网站建设  # 中文网  # 默认值  # 典型值  # 棵树  # 一言  # 新特征  # 有错误  # type  # fig  # igs  # 海外seo代理哪个好 


相关栏目: 【 Google疑问12 】 【 Facebook疑问10 】 【 优化推广96088 】 【 技术知识133117 】 【 IDC资讯59369 】 【 网络运营7196 】 【 IT资讯61894


相关推荐: 能抓取玻璃碎片、水下透明物,清华提出通用型透明物体抓取框架,成功率极高  码刻 | 48小时Hackathon,源码见证新生代AI创新的发生  华为推出两款商用 AI 大模型存储新品,支持 1200 万 IOPS 性能  看懂AI,找到增长新势能 | 笔记侠AI峰会等你来  《爱康未来之夜嘉宾官宣,携手共赴AI未来》  创新科学家成功研发FAST激光靶标维护机器人  苹果在韩举办首届中小企业智能制造论坛,加速推动工业4.0发展  无人机巡检方案是什么,该如何选择适合的巡检方案  讯飞听见会写“会议摘要”功能全面升级,AI更懂你的关注点  人工智能颠覆软件测试四大方式  GPT-4是如何工作的?哈佛教授亲自讲授  深度学习模型综述:用于3D MRI和CT扫描的应用  鸿蒙智能座舱的AI大模型革新,引领智能座舱领域的变革吗?  食品分销跨国企业Sysco CIDO:我们的增长秘诀是以IT为中心  华为云盘古大模型3.0发布 AI云服务同时上线:200亿亿次性能  创新全场景清洁方案!海尔商用机器人首发上市  Bing 聊天机器人现支持在桌面端用语音提问  城市在采用人工智能方面进展如何?  这款在《自然通讯》发表的机器人,为变形金刚来到现实创造可能性  OpenAI 静默关闭 AI 文本检测工具,准确率仅为 26%  Meta 人工智能业务落后竞争对手,研究人员大量离职成重要原因  Meta Connect 2025已确定时间为9月27-28,主题涵盖Quest 3与AI技术  OpenAI高管:AI能创造新的就业机会 但也会淘汰一些  「从未被制造出的最重要机器」,艾伦·图灵及图灵机那些事  如何提高集群协作效率?中外团队合作研发基于均值偏移的机器人队形控制策略  苹果AI战略与微软谷歌大相径庭,到底是领先还是落后?  人工智能快速发展 打开就业新空间  Unity 内测 Safe Voice 服务,利用 AI 自动识别玩家不当聊天内容  ChatGPT设计出的第一个机器人来了!【附人工智能行业预测】  京东 AI 大模型官宣 7 月 13 日发布,还有重磅合作  AI 大模型重塑软件开发,有哪些落地前景和痛点?| ArchSummit  禁止艺术家使用 AI 创作《龙与地下城》游戏插图的决定已在 D&D Beyond 生效  磐镭发布全新 GeForce RTX 4080 ARMOUR 显卡,售价为 9499 元  笔神作文声讨学而思AI大模型 称用“爬虫”技术盗取数据  十个AI算法常用库J*a版  视觉中国宣布推出AI灵感绘图、画面扩展功能  微软 Copilot 团队主管呼吁用户与 AI 交流时应使用恰当的礼貌用语  人工智能如何与智能家居集成  13条咒语挖掘GPT-4最大潜力,Github万星AI导师火了,网友:隔行再也不隔山了  元宇宙迈入2.0时代,它和生成式人工智能有何关联吗?  Meta发布语音AI模型 Voicebox 助虚拟助手与NPC对话  云深处与昇腾CANN携手合作:开设ROS四足机器狗开发训练营  ​布局智能物联新时代,中国移动“5G+物联网”亮相2025 MWC  2025VR&AR显示技术峰会展示歌尔光学最新一代光学模组  美踏控股推出创新人工智能大数据模型“心乐舞河”:虚拟人音舞社交的新体验  纪录片 《寻找人工智能》全集1080P超清  如何用AI开创智慧能源新时代?固德威正让能源“通人性”!  一图速览 | 十大脑机接口关键技术发布  搭载星火认知大模型 讯飞听见智慧屏开启AI办公新体验  跟着AI大热的“光模块”到底是什么? 

 2025-08-01

了解您产品搜索量及市场趋势,制定营销计划

同行竞争及网站分析保障您的广告效果

点击免费数据支持

提交您的需求,1小时内享受我们的专业解答。

运城市盐湖区信雨科技有限公司


运城市盐湖区信雨科技有限公司

运城市盐湖区信雨科技有限公司是一家深耕海外推广领域十年的专业服务商,作为谷歌推广与Facebook广告全球合作伙伴,聚焦外贸企业出海痛点,以数字化营销为核心,提供一站式海外营销解决方案。公司凭借十年行业沉淀与平台官方资源加持,打破传统外贸获客壁垒,助力企业高效开拓全球市场,成为中小企业出海的可靠合作伙伴。

 8156699

 13765294890

 8156699@qq.com

Notice

We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy.
You can consent to the use of such technologies by closing this notice, by interacting with any link or button outside of this notice or by continuing to browse otherwise.