我有以下pandas DataFrame:
import pandas as pd import numpy as np df = pd.DataFrame({"first_column": [0,1,0]}) >>> df first_column 0 0 1 0 2 0 3 1 4 1 5 1 6 0 7 0 8 1 9 1 10 0 11 0 12 0 13 0 14 1 15 1 16 1 17 1 18 1 19 0 20 0
first_column是0和1的二进制列.存在连续的“簇”,它们总是成对出现至少两个.
我的目标是创建一个列“计算”每组的行数:
>>> df first_column counts 0 0 0 1 0 0 2 0 0 3 1 3 4 1 3 5 1 3 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0
这听起来像df.loc()的工作,例如df.loc [df.first_column == 1] ……某事
我只是不确定如何考虑每个“群集”,以及如何用“行数”标记每个独特的群集.
怎么会这样做?
解决方法
这是NumPy的
cumsum
和
@L_301_1@的一种方法 –
def cumsum_bincount(a): # Append 0 & look for a [0,1] pattern. Form a binned array based off 1s groups ids = a*(np.diff(np.r_[0,a])==1).cumsum() # Get the bincount,index into the count with ids and finally mask out 0s return a*np.bincount(ids)[ids]
样品运行 –
In [88]: df['counts'] = cumsum_bincount(df.first_column.values) In [89]: df Out[89]: first_column counts 0 0 0 1 0 0 2 0 0 3 1 3 4 1 3 5 1 3 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0
将前6个元素设置为1,然后测试 –
In [101]: df.first_column.values[:5] = 1 In [102]: df['counts'] = cumsum_bincount(df.first_column.values) In [103]: df Out[103]: first_column counts 0 1 6 1 1 6 2 1 6 3 1 6 4 1 6 5 1 6 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0