scikit-learn：4.1. Pipeline and FeatureUnion: combining estimators（特征与预测器结合；特征与特征结合）

@H_404_0@带病在网吧里写，，，，给点鼓励吧。。。

@H_404_0@http://scikit-learn.org/stable/modules/pipeline.html

@H_404_0@1、pipeline和featureUnion是干什么的：

@H_404_0@pipeline之前已经介绍过了，结合transformer和estimator。

@H_404_0@featureUinon听名字就知道，将多个transformer的结果vector拼接成大的vector。

@H_404_0@

@H_404_0@2、两者的区别：

@H_404_0@前者相当于feature串行处理，后一个transformer处理前一个transformer的feature结果；

@H_404_0@后者相当于feature的并行处理，将所有transformer的处理结果拼接成大的feature vector。

@H_404_0@

@H_404_0@3、pipeline：chaining estimators

@H_404_0@Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的，比如特征选择、规范化、分类。所以pipeline主要由两个目的：

@H_404_0@方便：fit、predict一次即可处理所有estimators的结果。

@H_404_0@拼接参数选择：仅需一次即可grid search所有estimators的所有parameters。

@H_404_0@

@H_404_0@pipeline的所有的estimators（除了最后一个）都必须是transformer（有transform方法），最后一个estimator可以使任何类型（transformer、classifier）

@H_404_0@

@H_404_0@使用：通过一组（key,value）对来串联所有的estimators，key是自己对每一步骤的随意的命名，value是一个estimator object，例如：

>>> from sklearn.pipeline import Pipeline
sklearn.svm SVC
sklearn.decomposition PCA
>>> estimators = [('reduce_dim', PCA()), ('svm'SVC())]
clf = Pipeline(estimators)
clf 
Pipeline(steps=[('reduce_dim',PCA(copy=True,n_components=None,
 whiten=False)),('svm',SVC(C=1.0,cache_size=200,class_weight=None,
    coef0=0.0,degree=3,gamma=0.0,kernel='rbf',max_iter=-1,
    probability=False,random_state=None,shrinking=True,tol=0.001,51)">    verbose=False))])

每一个阶段的estimators存放在 steps属性中，可以通过索引这样取出每一个estimators：

>>> clf.steps[0]
('reduce_dim',whiten=False))

也可以通过name这样取出每一个estimators（ as a dict in named_steps:）：

.named_steps['reduce_dim']
PCA(copy=True,whiten=False)

想改变estimators的parameter值？用这样的语法： <estimator>__<parameter> Syntax，例如： @H_404_0@

clf.set_paramssvm__C=10) 
    whiten=False)),SVC(C=10,51)">    verbose=False))])

@H_404_0@终极目的，grid searches：

from sklearn.grid_search import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5,80)">10],
...               svm__C=[0.1,80)">10,80)">100])
>>> grid_search = GridSearchCV(clf, param_grid=params)

@H_404_0@最经典的文本分类来了：

# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),160)">'clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75,80)">1.0),
    #'vect__max_features': (None,5000,10000,50000),
    'vect__ngram_range': ((1,80)">1), (2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha': (0.00001,80)">0.000001),160)">'clf__penalty': ('l2', 'elasticnet'),144); font-style:italic">#'clf__n_iter': (10,50,80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-verbose1)

@H_404_0@Notes：重要的事情不翻译，

@H_404_0@Callingfiton the pipeline is the same as callingon each estimator in turn,transformthe input and pass it on to the next step. @H_404_0@Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,thecan be used as a classifier. If the last estimator is a transformer,again,so is the pipeline. 4、FeatureUnion：composite feature spaces featureUnion描述，重要的不翻译： FeatureUnioncombines several transformer objects into a new transformer that combines their output. Atakes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors. @H_404_0@ @H_404_0@featureUnion和pipleline同样是为了方便和joint parameter，两者也可以结合成更加复杂的模型。 @H_404_0@ @H_404_0@（featureUnion不管两个transformers是否产生相同的特征，他仅仅简单的拼接所有的特征，判重工作还是要你自己来做的。。。） @H_404_0@ @H_404_0@ @H_404_0@使用：通过一组（key,value）对来串联所有的estimators，key是自己对每一步骤的随意的命名，value是一个estimator object，例如： sklearn.pipeline import FeatureUnion sklearn.decomposition import PCA import KernelPCA >>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())] >>> combined = FeatureUnion(estimators) >>> combined FeatureUnion(n_jobs=1,transformer_list=[('linear_pca', n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)"> coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)"> gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)"> transformer_weights=None) @H_404_0@最后给个例子： @H_404_0@http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py @H_404_0@感谢 Author: Andreas Mueller <amueller@ais.uni-bonn.de> # Author: Andreas Mueller <amueller@ais.uni-bonn.de> # # License: BSD 3 clause import Pipeline, FeatureUnion import GridSearchCV sklearn.svm import SVC sklearn.datasets import load_iris import PCA sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target # This dataset is way to high-dimensional. Better do PCA: pca = PCA(n_components2) # Maybe some original features where good,too? selection = SelectKBest(k1) # Build estimator from PCA and Univariate selection: combined_features = FeatureUnion([("pca", pca),160)">"univ_select", selection)]) # Use combined features to transform dataset: X_features = combined_features.fit(X, y).transform(X) svm = SVC(kernel="linear") # Do grid search over k,n_components and C: pipeline Pipeline([("features", combined_features),160)">"svm", svm)]) param_grid dict(features__pca__n_components3], features__univ_select__k2], svm__C10]) grid_search = GridSearchCV(pipeline,102)">=param_grid,80)">10) grid_searchy) print(grid_search.best_estimator_) @H_404_0@ @H_404_0@完，看来以后提取特征有可以省很多事了。。。。。。。。原文链接：https://www.f2er.com/javaschema/284617.html

scikit-learn：4.1. Pipeline and FeatureUnion: combining estimators（特征与预测器结合；特征与特征结合）

猜你在找的设计模式相关文章