所以我试图理解pandas.dataFrame.groupby()函数,我在文档中遇到了这个例子:
In [1]: df = pd.DataFrame({'A' : ['foo','bar','foo',...: 'foo','foo'],...: 'B' : ['one','one','two','three',...: 'two','three'],...: 'C' : np.random.randn(8),...: 'D' : np.random.randn(8)}) ...: In [2]: df Out[2]: A B C D 0 foo one 0.469112 -0.861849 1 bar one -0.282863 -2.104569 2 foo two -1.509059 -0.494929 3 bar three -1.135632 1.071804 4 foo two 1.212112 0.721555 5 bar two -0.173215 -0.706771 6 foo one 0.119209 -1.039575 7 foo three -1.044236 0.271860
不进一步探索我做了这个:
print(df.groupby('B').head())
它输出相同的dataFrame,但是当我这样做时:
print(df.groupby('B'))
它给了我这个:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
这是什么意思?在普通的dataFrame中,打印.head()只输出前5行所发生的事情?
还有为什么打印.head()会提供与数据帧相同的输出?它不应该按“B”列的元素分组吗?
解决方法
当你使用时
df.groupby('A')
你得到一个groupby
object.你还没有应用任何功能.在引擎盖下,虽然这个定义可能不完美,但您可以将groupby对象视为:
>(group,DataFrame)对的迭代器,用于DataFrame,或
> Series(系列)对的迭代器,用于Series.
为了显示:
df = DataFrame({'A' : [1,1,2,2],'B' : [1,3,4]}) grouped = df.groupby('A') # each `i` is a tuple of (group,DataFrame) # so your output here will be a little messy for i in grouped: print(i) (1,A B 0 1 1 1 1 2) (2,A B 2 2 3 3 2 4) # this version uses multiple counters # in a single loop. each `group` is a group,each # `df` is its corresponding DataFrame for group,df in grouped: print('group of A:',group,'\n') print(df,'\n') group of A: 1 A B 0 1 1 1 1 2 group of A: 2 A B 2 2 3 3 2 4 # and if you just wanted to visualize the groups,# your second counter is a "throwaway" for group,_ in grouped: print('group of A:','\n') group of A: 1 group of A: 2
Essentially equivalent to
.apply(lambda x: x.head(n))
所以这里你实际上是将一个函数应用于groupby对象的每个组.请记住.head(5)应用于每个组(每个DataFrame),因为每组有少于或等于5行,您将获得原始DataFrame.
请考虑以上示例.如果使用.head(1),则只获得每组的前1行:
print(df.groupby('A').head(1)) A B 0 1 1 2 2 3