如何建立复杂的机器学习专案？最新消息

翻译 | 光城

责编 | 郭芮

scikit-learn提供最先进的机器学习算法。但是，这些算法不能直接用于原始资料。原始资料需要事先进行预处理。因此，除了机器学习算法之外，scikit-learn还提供了一套预处理方法。此外，scikit-learn提供用于流水线化这些估计器的联结器(即变压器，回归器，分类器，聚类器等)。

在本文中，将介绍scikit-learn功能集，允许流水线估计器、评估这些流水线、使用超引数优化调整这些流水线以及建立复杂的预处理步骤。

基本用例：训练和测试分类器

对于第一个示例，我们将在资料集上训练和测试一个分类器。我们将使用此示例来回忆scikit-learn的API。

我们将使用digits资料集，这是一个手写数字的资料集。

# 完成资料集的载入

from sklearn.datasets import load_digits

# return_X_y预设为False，这种情况下则为一个Bunch物件，改为True，可以直接得到(data, target)

X, y = load_digits(return_X_y=True)

X中的每行包含64个影象画素的强度，对于X中的每个样本，我们得到表示所写数字对应的y。

# 下面完成灰度图的绘制

# 灰度显示影象

plt.imshow(X[0].reshape(8, 8), cmap=‘gray‘);

# 关闭座标轴

plt.axis(‘off‘)

# 格式化打印

print(‘The digit in the image is {}‘.format(y[0]))

输出：The digit in the image is 0

在机器学习中，我们应该通过在不同的资料集上进行训练和测试来评估我们的模型。train_test_split是一个用于将资料拆分为两个独立资料集的效用函式，stratify引数可强制将训练和测试资料集的类分布与整个资料集的类分布相同。

# 划分资料为训练集与测试集,新增stratify引数，以使得训练和测试资料集的类分布与整个资料集的类分布相同。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

一旦我们拥有独立的培训和测试集，我们就可以使用fit方法学习机器学习模型。我们将使用score方法来测试此方法，依赖于预设的准确度指标。

# 求出Logistic回归的精确度得分

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, max_iter=5000, random_state=42)

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

Accuracy score of the LogisticRegression is 0.95

scikit-learn的API在分类器中是一致的。因此，我们可以通过RandomForestClassifier轻松替换LogisticRegression分类器。这些更改很小，仅与分类器例项的建立有关。

# RandomForestClassifier轻松替换LogisticRegression分类器

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

输出：

Accuracy score of the RandomForestClassifier is 0.96

练习

完成接下来的练习：

载入乳腺癌资料集，从sklearn.datasets汇入函式load_breast_cancer：

# %load solutions/01_1_solutions.py使用sklearn.model_selection.train_test_split拆分资料集并保留30％的资料集以进行测试。确保对资料进行分层（即使用stratify引数）并将random_state设定为0：

# %load solutions/01_2_solutions.py使用训练资料训练监督分类器：

# %load solutions/01_3_solutions.py使用拟合分类器预测测试集的分类标签：

# %load solutions/01_4_solutions.py计算测试集的balanced精度，需要从sklearn.metrics汇入balanced_accuracy_score：

# %load solutions/01_5_solutions.py

更高阶的用例：在训练和测试分类器之前预处理资料

2.1 标准化资料

在学习模型之前可能需要预处理。例如，一个使用者可能对建立手工制作的特征或者算法感兴趣，那么他可能会对资料进行一些先验假设。在我们的例子中，LogisticRegression使用的求解器期望资料被规范化。因此，我们需要在训练模型之前标准化资料。为了观察这个必要条件，我们将检查训练模型所需的迭代次数。

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, max_iter=5000, random_state=42)

clf.fit(X_train, y_train)

print(‘{} required {} iterations to be fitted‘.format(clf.__class__.__name__, clf.n_iter_[0]))

输出：

LogisticRegression required 1841 iterations to be fitted

MinMaxScaler变换器用于规范化资料。该标量应该以下列方式应用：学习（即，fit方法）训练集上的统计资料并标准化（即，transform方法）训练集和测试集。最后，我们将训练和测试这个模型并得到归一化后的资料集。

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

scaler = MinMaxScaler

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

clf = LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, max_iter=1000, random_state=42)

clf.fit(X_train_scaled, y_train)

accuracy = clf.score(X_test_scaled, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

print(‘{} required {} iterations to be fitted‘.format(clf.__class__.__name__, clf.n_iter_[0]))

输出：

Accuracy score of the LogisticRegression is 0.96

LogisticRegression required 190 iterations to be fitted

通过归一化资料，模型的收敛速度要比未归一化的资料快得多(迭代次数变少了)。

2.2 错误的预处理模式

我们强调了如何预处理和充分训练机器学习模型。发现预处理资料的错误方法也很有趣。其中有两个潜在的错误，易于犯错但又很容易发现。

第一种模式是在整个资料集分成训练和测试集之前标准化资料。

scaler = MinMaxScaler

X_scaled = scaler.fit_transform(X)

X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(

X_scaled, y, stratify=y, random_state=42)

clf = LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, max_iter=1000, random_state=42)

clf.fit(X_train_prescaled, y_train_prescaled)

accuracy = clf.score(X_test_prescaled, y_test_prescaled)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

输出：

Accuracy score of the LogisticRegression is 0.96

第二种模式是独立地标准化训练和测试集。它回来在训练和测试集上呼叫fit方法。因此，训练和测试集的标准化不同。

scaler = MinMaxScaler

X_train_prescaled = scaler.fit_transform(X_train)

# 这里发生了变化(将transform替换为fit_transform)

X_test_prescaled = scaler.fit_transform(X_test)

clf = LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, max_iter=1000, random_state=42)

clf.fit(X_train_prescaled, y_train)

accuracy = clf.score(X_test_prescaled, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

输出：

2.3 保持简单，愚蠢：使用scikit-learn的管道联结器

前面提到的两个模式是资料泄漏的问题。然而，当必须手动进行预处理时，很难防止这种错误。因此,scikit-learn引入了Pipeline物件。它依次连线多个变压器和分类器（或回归器）。我们可以建立一个如下管道：

from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[(‘scaler‘, MinMaxScaler),

(‘clf‘, LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, random_state=42))])

我们看到这个管道包含了缩放器(归一化)和分类器的引数。有时，为管道中的每个估计器命名可能会很繁琐，而make_pipeline将自动为每个估计器命名，这是类名的小写。

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(MinMaxScaler,

LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘, random_state=42, max_iter=1000))

管道将具有相同的API。我们使用fit来训练分类器和socre来检查准确性。然而，呼叫fit会呼叫管道中所有变换器的fit_transform方法。呼叫score（或predict和predict_proba）将呼叫管道中所有变换器的内部变换。它对应于本文2.1中的规范化过程。

pipe.fit(X_train, y_train)

accuracy = pipe.score(X_test, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(pipe.__class__.__name__, accuracy))

Accuracy score of the Pipeline is 0.96

我们可以使用get_params检查管道的所有引数。

pipe.get_params

输出：

{‘logisticregression‘: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=1000, multi_class=‘auto‘,

n_jobs=None, penalty=‘l2‘, random_state=42, solver=‘lbfgs‘,

tol=0.0001, verbose=0, warm_start=False),

‘logisticregression__C‘: 1.0,

...

...}

练习

重用第一个练习的乳腺癌资料集来训练，可以从linear_model汇入SGDClassifier。使用此分类器和从sklearn.preprocessing汇入的StandardScaler变换器来建立管道，然后训练和测试这条管道。

# %load solutions/02_solutions.py

当更多优于更少时：交叉验证而不是单独拆分

分割资料对于评估统计模型效能是必要的。但是，它减少了可用于学习模型的样本数量。因此，应尽可能使用交叉验证。有多个拆分也会提供有关模型稳定性的资讯。

scikit-learn提供了三个函式：cross_val_score，cross_val_predict和cross_validate。后者提供了有关拟合时间，训练和测试分数的更多资讯。我也可以一次返回多个分数。

from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler,

LogisticRegression(solver=‘lbfgs‘, multi_class=‘auto‘,

max_iter=1000, random_state=42))

scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)

使用交叉验证函式，我们可以快速检查训练和测试分数，并使用pandas快速绘图。

import pandas as pd

df_scores = pd.DataFrame(scores)

df_scores

输出：

# pandas绘制箱体图

df_scores[[‘train_score‘, ‘test_score‘]].boxplot

输出：

练习

使用上一个练习的管道并进行交叉验证，而不是单个拆分评估。

# %load solutions/03_solutions.py

超引数优化：微调管道内部

有时希望找到管道元件的引数，从而获得最佳精度。我们已经看到我们可以使用get_params检查管道的引数。

输出：

可以通过穷举搜寻来优化超引数。GridSearchCV 提供此类实用程式，并通过引数网格进行交叉验证的网格搜寻。

如下例子，我们希望优化LogisticRegression分类器的C和penalty引数。

from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(MinMaxScaler,

LogisticRegression(solver=‘saga‘, multi_class=‘auto‘,

random_state=42, max_iter=5000))

param_grid = {‘logisticregression__C‘: [0.1, 1.0, 10],

‘logisticregression__penalty‘: [‘l2‘, ‘l1‘]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, return_train_score=True)

grid.fit(X_train, y_train)

输出：

GridSearchCV(cv=3, error_score=‘raise-deprecating‘,

...

scoring=None, verbose=0)

在拟合网格搜寻物件时，它会在训练集上找到最佳的引数组合（使用交叉验证）。我们可以通过访问属性cv_results_来得到网格搜寻的结果。通过这个属性允许我们可以检查引数对模型效能的影响。

df_grid = pd.DataFrame(grid.cv_results_)

df_grid

输出：

预设情况下，网格搜寻物件也表现为估计器。一旦它被fit后，呼叫score将超引数固定为找到的最佳引数。

grid.best_params_

输出：

{‘logisticregression__C‘: 10, ‘logisticregression__penalty‘: ‘l2‘}

此外，可以将网格搜寻称为任何其他分类器以进行预测。

accuracy = grid.score(X_test, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(grid.__class__.__name__, accuracy))

Accuracy score of the GridSearchCV is 0.96

最重要的是，我们只对单个分割进行网格搜寻。但是，如前所述，我们可能有兴趣进行外部交叉验证，以估计模型的效能和不同的资料样本，并检查效能的潜在变化。由于网格搜寻是一个估计器，我们可以直接在cross_validate函式中使用它。

scores = cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True)

df_scores = pd.DataFrame(scores)

df_scores

输出：

练习

重复使用乳腺癌资料集的先前管道并进行网格搜寻以评估hinge(铰链) and log(对数)损失之间的差异。此外，微调penalty。

# %load solutions/04_solutions.py

总结：我的scikit-learn管道只有不到10行程式码（跳过import语句）

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import GridSearchCV

fromimport cross_validate

pipe = make_pipeline(MinMaxScaler,

LogisticRegression(solver=‘saga‘, multi_class=‘auto‘, random_state=42, max_iter=5000))

param_grid = {‘logisticregression__C‘: [0.1, 1.0, 10],

‘logisticregression__penalty‘: [‘l2‘, ‘l1‘]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)

scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))

scores[[‘train_score‘, ‘test_score‘]].boxplot

输出：

异构资料：当您使用数字以外的资料时

到目前为止，我们使用scikit-learn来训练使用数值资料的模型。

输出：

array([[ 0., 0., 5., ..., 0., 0., 0.],

[ 0., 0., 0., ..., 10., 0., 0.],

[ 0., 0., 0., ..., 16., 9., 0.],

...,

[ 0., 0., 1., ..., 6., 0., 0.],

[ 0., 0., 2., ..., 12., 0., 0.],

[ 0., 0., 10., ..., 12., 1., 0.]])

X是仅包含浮点值的NumPy阵列。但是，资料集可以包含混合型别。

import os

data = pd.read_csv(os.path.join(‘data‘, ‘titanic_openml.csv‘), na_values=‘?‘)

data.head

输出：

泰坦尼克号资料集包含分类、文字和数字特征。我们将使用此资料集来预测乘客是否在泰坦尼克号中幸存下来。

让我们将资料拆分为训练和测试集，并将幸存列用作目标。

y = data[‘survived‘]

X = data.drop(columns=‘survived‘)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

首先，可以尝试使用LogisticRegression分类器，看看它的表现有多好。

clf = LogisticRegression

clf.fit(X_train, y_train)

哎呀，大多数分类器都设计用于处理数值资料。因此，我们需要将分类资料转换为数字特征。最简单的方法是使用OneHotEncoder对每个分类特征进行读热编码。让我们以sex与embarked列为例。请注意，我们还会遇到一些缺失的资料。我们将使用SimpleImputer用常量值替换缺失值。

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder

ohe = make_pipeline(SimpleImputer(strategy=‘constant‘), OneHotEncoder)

X_encoded = ohe.fit_transform(X_train[[‘sex‘, ‘embarked‘]])

X_encoded.toarray

输出：

array([[0., 1., 0., 0., 1., 0.],

[0., 1., 1., 0., 0., 0.],

[0., 1., 0., 0., 1., 0.],

...,

[1., 0., 0., 0., 1., 0.],

])

这样，可以对分类特征进行编码。但是，我们也希望标准化数字特征。因此，我们需要将原始资料分成2个子组并应用不同的预处理：（i）分类资料的独热编；（ii）数值资料的标准缩放(归一化)。我们还需要处理两种情况下的缺失值：对于分类列，我们将字串‘missing_values‘替换为缺失值，该字串将自行解释为类别。对于数值资料，我们将用感兴趣的特征的平均值替换缺失的资料。

分类资料的独热编：

col_cat = [‘sex‘, ‘embarked‘]

col_num = [‘age‘, ‘sibsp‘, ‘parch‘, ‘fare‘]

X_train_cat = X_train[col_cat]

X_train_num = X_train[col_num]

X_test_cat = X_test[col_cat]

X_test_num = X_test[col_num]数值资料的标准缩放(归一化)：

from sklearn.preprocessing import StandardScaler

scaler_cat = make_pipeline(SimpleImputer(strategy=‘constant‘), OneHotEncoder)

X_train_cat_enc = scaler_cat.fit_transform(X_train_cat)

X_test_cat_enc = scaler_cat.transform(X_test_cat)

scaler_num = make_pipeline(SimpleImputer(strategy=‘mean‘), StandardScaler)

X_train_num_scaled = scaler_num.fit_transform(X_train_num)

X_test_num_scaled = scaler_num.transform(X_test_num)

我们应该像在本文2.1中那样在训练和测试集上应用这些变换。

import numpy as np

from scipy import sparse

X_train_scaled = sparse.hstack((X_train_cat_enc,

sparse.csr_matrix(X_train_num_scaled)))

X_test_scaled = sparse.hstack((X_test_cat_enc,

sparse.csr_matrix(X_test_num_scaled)))

转换完成后，我们现在可以组合所有数值的资讯。最后，我们使用LogisticRegression分类器作为模型。

clf = LogisticRegression(solver=‘lbfgs‘)

clf.fit(X_train_scaled, y_train)

accuracy = clf.score(X_test_scaled, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(clf.__class__.__name__, accuracy))

输出：

Accuracy score of the LogisticRegression is 0.79

上面首先转换资料然后拟合/评分分类器的模式恰好是本节2.1的模式之一。因此，我们希望为此目的使用管道。

但是，我们还希望对矩阵的不同列进行不同的处理。应使用ColumnTransformer转换器或make_column_transformer函式。它用于在不同的列上自动应用不同的管道。

from sklearn.compose import make_column_transformer

pipe_cat = make_pipeline(SimpleImputer(strategy=‘constant‘), OneHotEncoder(handle_unknown=‘ignore‘))

pipe_num = make_pipeline(SimpleImputer, StandardScaler)

preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))

pipe = make_pipeline(preprocessor, LogisticRegression(solver=‘lbfgs‘))

pipe.fit(X_train, y_train)

accuracy = pipe.score(X_test, y_test)

print(‘Accuracy score of the {} is {:.2f}‘.format(pipe.__class__.__name__, accuracy))

输出：

Accuracy score of the Pipeline is 0.79

此外，它还可以被使用在另一个管道。因此，我们将能够使用所有scikit-learn实用程式作为cross_validate或GridSearchCV。

输出：

{‘columntransformer‘: ColumnTransformer(n_jobs=None, remainder=‘drop‘, sparse_threshold=0.3,

transformer_weights=None,

transformers=[(‘pipeline-1‘, Pipeline(memory=None,

...]}

合并及视觉化：

pipe_cat = make_pipeline(SimpleImputer(strategy=‘constant‘), OneHotEncoder(handle_unknown=‘ignore‘))

pipe_num = make_pipeline(StandardScaler, SimpleImputer)

preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))

pipe = make_pipeline(preprocessor, LogisticRegression(solver=‘lbfgs‘))

param_grid = {‘columntransformer__pipeline-2__simpleimputer__strategy‘: [‘mean‘, ‘median‘],

‘logisticregression__C‘: [0.1, 1.0, 10]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)

scores = pd.DataFrame(cross_validate(grid, X, y, scoring=‘balanced_accuracy‘, cv=5, n_jobs=-1, return_train_score=True))

scores[[‘train_score‘, ‘test_score‘]].boxplot

输出：

练习

完成接下来的练习：

载入位于./data/adult_openml.csv中的成人资料集，制作自己的ColumnTransformer前处理器，并用分类器管道化它，对其进行微调并在交叉验证中检查预测准确性。

使用pd.read_csv读取位于./data/adult_openml.csv中的成人资料集：

# %load solutions/05_1_solutions.py将资料集拆分为资料和目标，目标对应于类列。对于资料，删除列fnlwgt，capitalgain和capitalloss：

# %load solutions/05_2_solutions.py目标未编码，使用sklearn.preprocessing.LabelEncoder对类进行编码：

# %load solutions/05_3_solutions.py建立一个包含分类列名称的列表，同样，对数值资料也一样：

# %load solutions/05_4_solutions.py

建立一个管道以对分类资料进行读热编码，使用KBinsDiscretizer作为数值资料，从sklearn.preprocessing汇入它：

# %load solutions/05_5_solutions.py使用make_column_transformer建立前处理器，应该将好的管道应用于好的列：

# %load solutions/05_6_solutions.py使用LogisticRegression分类器对前处理器进行管道传输。随后定义网格搜寻以找到最佳引数C.使用cross_validate在交叉验证方案中训练和测试此工作流程：

# %load solutions/05_7_solutions.py本篇文章翻译自：https://github.com/glemaitre/pyparis-2018-sklearn。

译者简介：光城，研一，，部落格：http://light-city.me/, 个人研究方向知识图谱，正致力于将机器学习运用到KG当中。

宣告：本文为作者投稿，版权归其个人所有。

热文 推荐

print_r(‘点个好看吧！‘);

var_dump(‘点个好看吧！‘);

NSLog(@"点个好看吧！");

System.out.println("点个好看吧！");

console.log("点个好看吧！");

print("点个好看吧！");

printf("点个好看吧！");

cout Console.WriteLine("点个好看吧！");

fmt.Println("点个好看吧！");

Response.Write("点个好看吧！");

alert("点个好看吧！")

echo "点个好看吧！"

如何建立复杂的机器学习专案？

品牌选车