数据帧
df = pd.DataFrame({'A': [['gener'],['gener'],['system'],['gutter'],['aluminum'],['aluminum','toledo']],'B': [['gutter'],['gutter','system'],'guard',['ohio','gutter'],'toledo'],['toledo',['how','to','instal','aluminum','gutter','color'],'adrian','ohio'],'bowl','green','maume','perrysburg','tecumseh','toledo','ohio']]},columns=['A','B'])
它看起来像什么
我有一个包含两列列表的数据框.
A B 0 [gener] [gutter] 1 [gener] [gutter] 2 [system] [gutter,system] 3 [system] [gutter,guard,system] 4 [gutter] [ohio,gutter] 5 [gutter] [gutter,toledo] 6 [gutter] [toledo,gutter] 7 [gutter] [gutter] 8 [gutter] [gutter] 9 [gutter] [gutter] 10 [aluminum] [how,to,instal,aluminum,gutter] 11 [aluminum] [aluminum,gutter] 12 [aluminum] [aluminum,gutter,color] 13 [aluminum] [aluminum,gutter] 14 [aluminum] [aluminum,adrian,ohio] 15 [aluminum] [aluminum,bowl,green,ohio] 16 [aluminum] [aluminum,maume,ohio] 17 [aluminum] [aluminum,perrysburg,ohio] 18 [aluminum] [aluminum,tecumseh,ohio] 19 [aluminum,toledo] [aluminum,toledo,ohio]
题
如果我有列的列,是否有一个pandas函数,让我操作整个列表数组来检查交集并返回一个布尔值或交叉值作为一个新的系列?
例如,我想让熊猫拥有相同的东西:
def intersection(df,col1,col2,return_type='boolean'): if return_type == 'boolean': df = df[[col1,col2]] s = [] for idx in df.iterrows(): s.append(any([phrase in idx[1][0] for phrase in idx[1][1]])) S = pd.Series(s) return S elif return_type == 'word': df = df[[col1,col2]] s = [] for idx in df.iterrows(): s.append(','.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])) S = pd.Series(s) return S #Create column C in df df['C'] = intersection(df,'A','B','word')
…无需编写自己的函数或求助于循环.我觉得必须有一种更简单的方法来比较同一行中两列中的列表,看它们是否相交.
我可以用for循环来做,但这对我来说很难看
for循环返回一个布尔系列:
for idx in df.iterrows(): any([phrase in idx[1][0] for phrase in idx[1][1]])
生产:
False False True True True True True True True True True True True True True True True True True True
或者,使用集合查找相交的单词:
for idx in df.iterrows(): ','.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]) '' '' 'system' 'system' 'gutter' 'gutter' 'gutter' 'gutter' 'gutter' 'gutter' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'aluminum' 'toledo,aluminum'
解决方法
要检查df.A中的每个项目是否都包含在df.B中:
>>> df.apply(lambda row: all(i in row.B for i in row.A),axis=1) # OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool) 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True 8 True 9 True 10 True 11 True 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True dtype: bool
要获得联盟:
df['intersection'] = [list(set(a).intersection(set(b))) for a,b in zip(df.A,df.B)] >>> df A B intersection 0 [gener] [gutter] [] 1 [gener] [gutter] [] 2 [system] [gutter,system] [system] 3 [system] [gutter,system] [system] 4 [gutter] [ohio,gutter] [gutter] 5 [gutter] [gutter,toledo] [gutter] 6 [gutter] [toledo,gutter] [gutter] 7 [gutter] [gutter] [gutter] 8 [gutter] [gutter] [gutter] 9 [gutter] [gutter] [gutter] 10 [aluminum] [how,gutter] [aluminum] 11 [aluminum] [aluminum,gutter] [aluminum] 12 [aluminum] [aluminum,color] [aluminum] 13 [aluminum] [aluminum,gutter] [aluminum] 14 [aluminum] [aluminum,ohio] [aluminum] 15 [aluminum] [aluminum,ohio] [aluminum] 16 [aluminum] [aluminum,ohio] [aluminum] 17 [aluminum] [aluminum,ohio] [aluminum] 18 [aluminum] [aluminum,ohio] [aluminum] 19 [aluminum,ohio] [aluminum,toledo]