如何在Pandas.read_csv中使用方括号作为引号字符

前端之家收集整理的这篇文章主要介绍了如何在Pandas.read_csv中使用方括号作为引号字符前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

假设我有一个看起来像这样的文本文件

  1. Item,Date,Time,Location
  2. 1,01/01/2016,13:41,[45.2344:-78.25453]
  3. 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
  4. 3,01/10/2016,01:27,[51.2344:-86.24432]

我希望能够做的是用pandas.read_csv读取,但第二行将抛出错误.这是我目前使用的代码

  1. import pandas as pd
  2. df = pd.read_csv("path/to/file.txt",sep=",",dtype=str)

我试图将quotechar设置为“[”,但是这显然只是占用了行,直到下一个打开括号并添加一个右括号会导致“找到长度为2的字符串”错误.任何见解将不胜感激.谢谢!

更新

提供了三种主要解决方案:1)为数据框提供大量名称,以允许读入所有数据,然后对数据进行后处理,2)在方括号中查找值并在其周围加上引号,或者3)用分号替换前n个逗号.

总的来说,我认为选项3通常不是一个可行的解决方案(虽然对我的数据来说很好),因为a)如果我在一个包含逗号的列中引用了值,b)如果我的方括号列是不是最后一栏?这留下了解决方案1和2.我认为解决方案2更具可读性,但解决方案1更有效,仅运行1.38秒,而解决方案2则运行3.02秒.测试在包含18列和超过208,000行的文本文件上运行.

最佳答案
我想你可以在每行文件中替换前3个出现的;然后使用参数sep =“;”在read_csv

  1. import pandas as pd
  2. import io
  3. with open('file2.csv','r') as f:
  4. lines = f.readlines()
  5. fo = io.StringIO()
  6. fo.writelines(u"" + line.replace(',',';',3) for line in lines)
  7. fo.seek(0)
  8. df = pd.read_csv(fo,sep=';')
  9. print df
  10. Item Date Time Location
  11. 0 1 01/01/2016 13:41 [45.2344:-78.25453]
  12. 1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
  13. 2 3 01/10/2016 01:27 [51.2344:-86.24432]

或者可以尝试这种复杂的方法,因为主要问题是,分隔符,列表中的值与其他列值的分隔符相同.

所以你需要后期处理:

  1. import pandas as pd
  2. import io
  3. temp=u"""Item,41.2342:-81242,[51.2344:-86.24432]"""
  4. #after testing replace io.StringIO(temp) to filename
  5. #estimated max number of columns
  6. df = pd.read_csv(io.StringIO(temp),names=range(10))
  7. print df
  8. 0 1 2 3 4 \
  9. 0 Item Date Time Location NaN
  10. 1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
  11. 2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
  12. 3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
  13. 5 6 7 8 9
  14. 0 NaN NaN NaN NaN NaN
  15. 1 NaN NaN NaN NaN NaN
  16. 2 41.2342:-81242] NaN NaN NaN NaN
  17. 3 NaN NaN NaN NaN NaN
  1. #remove column with all NaN
  2. df = df.dropna(how='all',axis=1)
  3. #first row get as columns names
  4. df.columns = df.iloc[0,:]
  5. #remove first row
  6. df = df[1:]
  7. #remove columns name
  8. df.columns.name = None
  9. #get position of column Location
  10. print df.columns.get_loc('Location')
  11. 3
  12. #df1 with Location values
  13. df1 = df.iloc[:,df.columns.get_loc('Location'): ]
  14. print df1
  15. Location NaN NaN
  16. 1 [45.2344:-78.25453] NaN NaN
  17. 2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
  18. 3 [51.2344:-86.24432] NaN NaN
  19. #combine values to one column
  20. df['Location'] = df1.apply( lambda x : ','.join([e for e in x if isinstance(e,basestring)]),axis=1)
  21. #subset of desired columns
  22. print df[['Item','Date','Time','Location']]
  23. Item Date Time Location
  24. 1 1 01/01/2016 13:41 [45.2344:-78.25453]
  25. 2 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-8...
  26. 3 3 01/10/2016 01:27 [51.2344:-86.24432]

猜你在找的Python相关文章