scikit-learn:4.1. Pipeline and FeatureUnion: combining estimators(特征与预测器结合;特征与特征结合)

@H_404_0@后者相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。



@H_404_0@3、pipeline:chaining estimators

@H_404_0@Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的,比如特征选择、规范化、分类。所以pipeline主要由两个目的:


@H_404_0@拼接参数选择:仅需一次即可grid search所有estimators的所有parameters。




@H_404_0@使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:

>>> from sklearn.pipeline import Pipeline
sklearn.svm SVC
sklearn.decomposition PCA
>>> estimators = [('reduce_dim', PCA()), ('svm'SVC())]
clf = Pipeline(estimators)
    probability=False,random_state=None,shrinking=True,tol=0.001,51)">    verbose=False))])
每一个阶段的estimators存放在 steps属性中,可以通过索引这样取出每一个estimators:
>>> clf.steps[0]
也可以通过name这样取出每一个estimators( as a dict in named_steps:):
想改变estimators的parameter值?用这样的语法: <estimator>__<parameter> Syntax,例如: @H_404_0@

    whiten=False)),SVC(C=10,51)">    verbose=False))])

@H_404_0@终极目的,grid searches:

from sklearn.grid_search import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5,80)">10],
...               svm__C=[0.1,80)">10,80)">100])
>>> grid_search = GridSearchCV(clf, param_grid=params)


# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),160)">'clf', SGDClassifier()),

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75,80)">1.0),
    #'vect__max_features': (None,5000,10000,50000),
    'vect__ngram_range': ((1,80)">1), (2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha': (0.00001,80)">0.000001),160)">'clf__penalty': ('l2', 'elasticnet'),144); font-style:italic">#'clf__n_iter': (10,50,80),

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-verbose1)


@H_404_0@Callingfiton the pipeline is the same as callingon each estimator in turn,transformthe input and pass it on to the next step.

@H_404_0@Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,thecan be used as a classifier. If the last estimator is a transformer,again,so is the pipeline.

4、FeatureUnion:composite feature spaces


FeatureUnioncombines several transformer objects into a new transformer that combines their output. Atakes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors.


@H_404_0@featureUnion和pipleline同样是为了方便和joint parameter,两者也可以结合成更加复杂的模型。





@H_404_0@使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:

sklearn.pipeline import FeatureUnion
sklearn.decomposition import PCA
import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined 
    n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)">    coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)">    gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)">    transformer_weights=None)




# License: BSD 3 clause
# Author: Andreas Mueller <>
# License: BSD 3 clause

import Pipeline, FeatureUnion
import GridSearchCV
sklearn.svm import SVC
sklearn.datasets import load_iris
import PCA
sklearn.feature_selection import SelectKBest

iris = load_iris()

X, y =,

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components2)

# Maybe some original features where good,too?
selection = SelectKBest(k1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca),160)">"univ_select", selection)])

# Use combined features to transform dataset:
X_features =, y).transform(X)

svm = SVC(kernel="linear")

# Do grid search over k,n_components and C:

pipeline Pipeline([("features", combined_features),160)">"svm", svm)])

param_grid dict(features__pca__n_components3],

grid_search = GridSearchCV(pipeline,102)">=param_grid,80)">10)



