我在熊猫里有一张桌子:
import pandas as pd df = pd.DataFrame({ 'LeafID':[1,1,2,3,6,5,1],'pidx':[10,10,300,30,40,20,45,20],'pidy':[20,400,15,12,43,54,112,23],'count':[10,80,50,70],'score':[10,22,4,9,1] }) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 2 2 30 300 400 10 3 1 40 10 20 22 4 3 80 30 15 22 5 3 10 40 20 3 6 1 20 20 12 4 7 6 50 10 43 5 8 3 30 20 54 9 9 5 10 45 112 0 10 1 70 20 23 1
我想做一个groupby,然后过滤pidx大于2的行.
也就是说,过滤pidx为10和20的行.
我尝试使用df.groupby(‘pidx’).count()但它没有帮助我.同样对于那些行,我必须做0.4 *计数0.6 *得分.
期望的输出是:
LeafID count pidx pidy final_score 1 10 10 20 1 20 10 20 1 40 10 20 6 50 10 43 1 20 20 12 3 30 20 54 1 70 20 23
解决方法
您可以在
boolean indexing
和
isin
中使用
value_counts
:
df = pd.DataFrame({ 'LeafID':[1,1] }) print (df) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 2 2 30 300 400 10 3 1 40 10 20 22 4 3 80 30 15 22 5 3 10 40 20 3 6 1 20 20 12 4 7 6 50 10 43 5 8 3 30 30 54 9 9 5 10 45 112 0 10 1 70 20 23 1 s = df.pidx.value_counts() idx = s[s>2].index print (df[df.pidx.isin(idx)]) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 3 1 40 10 20 22 7 6 50 10 43 5
时序:
np.random.seed(123) N = 1000000 L1 = list('abcdefghijklmnopqrstu') L2 = list('efghijklmnopqrstuvwxyz') df = pd.DataFrame({'LeafId':np.random.randint(1000,size=N),'pidx': np.random.randint(10000,'pidy': np.random.choice(L2,N),'count':np.random.randint(1000,size=N)}) print (df) print (df.groupby('pidx').filter(lambda x: len(x) > 120)) def jez(df): s = df.pidx.value_counts() return df[df.pidx.isin(s[s>120].index)] print (jez(df)) In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120)) 1 loop,best of 3: 1.17 s per loop In [56]: %timeit (jez(df)) 10 loops,best of 3: 141 ms per loop In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120]) 10 loops,best of 3: 102 ms per loop In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120]) 1 loop,best of 3: 685 ms per loop In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120]) 10 loops,best of 3: 104 ms per loop
对于final_score,您可以使用:
df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))