@H_404_0@
@H_404_0@2、两者的区别: @H_404_0@前者相当于feature串行处理,后一个transformer处理前一个transformer的feature结果; @H_404_0@后者相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。 @H_404_0@
@H_404_0@
@H_404_0@3、pipeline:chaining estimators @H_404_0@Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的,比如特征选择、规范化、分类。所以pipeline主要由两个目的:
@H_404_0@方便:fit、predict一次即可处理所有estimators的结果。 @H_404_0@拼接参数选择:仅需一次即可grid search所有estimators的所有parameters。 @H_404_0@
@H_404_0@pipeline的所有的estimators(除了最后一个)都必须是transformer(有transform方法),最后一个estimator可以使任何类型(transformer、classifier) @H_404_0@
@H_404_0@使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如: 每一个阶段的estimators存放在 steps属性中,可以通过索引这样取出每一个estimators:
>>> clf.steps[0] ('reduce_dim',whiten=False))也可以通过name这样取出每一个estimators( as a dict in named_steps:):
.named_steps['reduce_dim'] PCA(copy=True,whiten=False)想改变estimators的parameter值?用这样的语法: <estimator>__<parameter> Syntax,例如: @H_404_0@
clf.set_paramssvm__C=10) whiten=False)),SVC(C=10,51)"> verbose=False))])
@H_404_0@终极目的,grid searches:
from sklearn.grid_search import GridSearchCV >>> params = dict(reduce_dim__n_components=[2, 5,80)">10], ... svm__C=[0.1,80)">10,80)">100]) >>> grid_search = GridSearchCV(clf, param_grid=params)
@H_404_0@最经典的文本分类来了:
# define a pipeline combining a text feature extractor with a simple # classifier pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),160)">'clf', SGDClassifier()), ]) # uncommenting more parameters will give better exploring power but will # increase processing time in a combinatorial way parameters = { 'vect__max_df': (0.5, 0.75,80)">1.0), #'vect__max_features': (None,5000,10000,50000), 'vect__ngram_range': ((1,80)">1), (2)), # unigrams or bigrams #'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha': (0.00001,80)">0.000001),160)">'clf__penalty': ('l2', 'elasticnet'),144); font-style:italic">#'clf__n_iter': (10,50,80), } if __name__ == "__main__": # multiprocessing requires the fork to happen in a __main__ protected # block # find the best parameters for both the feature extraction and the # classifier grid_search = GridSearchCV(pipeline, parameters, n_jobs=-verbose1)
@H_404_0@Notes:重要的事情不翻译, @H_404_0@Callingfiton the pipeline is the same as callingon each estimator in turn,transformthe input and pass it on to the next step. @H_404_0@Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,thecan be used as a classifier. If the last estimator is a transformer,again,so is the pipeline.
4、FeatureUnion:composite feature spaces
featureUnion描述,重要的不翻译:
FeatureUnioncombines several transformer objects into a new transformer that combines their output. Atakes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors.
@H_404_0@@H_404_0@featureUnion和pipleline同样是为了方便和joint parameter,两者也可以结合成更加复杂的模型。 @H_404_0@
@H_404_0@(featureUnion不管两个transformers是否产生相同的特征,他仅仅简单的拼接所有的特征,判重工作还是要你自己来做的。。。) @H_404_0@
@H_404_0@
@H_404_0@使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:
sklearn.pipeline import FeatureUnion sklearn.decomposition import PCA import KernelPCA >>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())] >>> combined = FeatureUnion(estimators) >>> combined FeatureUnion(n_jobs=1,transformer_list=[('linear_pca', n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)"> coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)"> gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)"> transformer_weights=None)
@H_404_0@最后给个例子: @H_404_0@http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
@H_404_0@感谢
Author: Andreas Mueller <amueller@ais.uni-bonn.de>
# Author: Andreas Mueller <amueller@ais.uni-bonn.de> # # License: BSD 3 clause import Pipeline, FeatureUnion import GridSearchCV sklearn.svm import SVC sklearn.datasets import load_iris import PCA sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target # This dataset is way to high-dimensional. Better do PCA: pca = PCA(n_components2) # Maybe some original features where good,too? selection = SelectKBest(k1) # Build estimator from PCA and Univariate selection: combined_features = FeatureUnion([("pca", pca),160)">"univ_select", selection)]) # Use combined features to transform dataset: X_features = combined_features.fit(X, y).transform(X) svm = SVC(kernel="linear") # Do grid search over k,n_components and C: pipeline Pipeline([("features", combined_features),160)">"svm", svm)]) param_grid dict(features__pca__n_components3], features__univ_select__k2], svm__C10]) grid_search = GridSearchCV(pipeline,102)">=param_grid,80)">10) grid_searchy) print(grid_search.best_estimator_)
@H_404_0@
@H_404_0@完,看来以后提取特征有可以省很多事了。。。。。。。。 原文链接:https://www.f2er.com/javaschema/284617.html