我们一直在命令行使用grep,cut,sort,uniq和join来进行数据分析.虽然有缺点,但它们效果很好.例如,您必须为每个工具提供列号.我们经常有宽文件(很多列)和列标题,它们给出了列名.实际上,我们的文件看起来很像sql表.我确定有一个驱动程序(ODBC?)将操作分隔的文本文件,以及一些将使用该驱动程序的查询引擎,因此我们可以在我们的文本文件上使用SQL查询.由于进行分析通常是临时的,因此查询新文件(仅使用我在此目录中指定的文件)而不是在某些配置中声明特定表必须是最小化设置.
解决方法
重复其他人的建议,这里是sqlite3的
Python脚本.有点冗长,但它的确有效.
我不喜欢完全复制文件以删除标题行,但我不知道如何说服sqlite3的.import跳过它.我可以创建INSERT语句,但是如果不是更糟的话,这看起来也差一点.
示例调用:
$sql.py --file foo --sql "select count(*) from data"
代码:
#!/usr/bin/env python """Run a sql statement on a text file""" import os import sys import getopt import tempfile import re class Usage(Exception): def __init__(self,msg): self.msg = msg def runCmd(cmd): if os.system(cmd): print "Error running " + cmd sys.exit(1) # TODO(dan): Return actual exit code def usage(): print >>sys.stderr,"Usage: sql.py --file file --sql sql" def main(argv=None): if argv is None: argv = sys.argv try: try: opts,args = getopt.getopt(argv[1:],"h",["help","file=","sql="]) except getopt.error,msg: raise Usage(msg) except Usage,err: print >>sys.stderr,err.msg print >>sys.stderr,"for help use --help" return 2 filename = None sql = None for o,a in opts: if o in ("-h","--help"): usage() return 0 elif o in ("--file"): filename = a elif o in ("--sql"): sql = a else: print "Found unexpected option " + o if not filename: print >>sys.stderr,"Must give --file" sys.exit(1) if not sql: print >>sys.stderr,"Must give --sql" sys.exit(1) # Get the first line of the file to make a CREATE statement # # Copy the rest of the lines into a new file (datafile) so that # sqlite3 can import data without header. If sqlite3 could skip # the first line with .import,this copy would be unnecessary. foo = open(filename) datafile = tempfile.NamedTemporaryFile() first = True for line in foo.readlines(): if first: headers = line.rstrip().split() first = False else: print >>datafile,line,datafile.flush() #print datafile.name #runCmd("cat %s" % datafile.name) # Create columns with NUMERIC affinity so that if they are numbers,# sql queries will treat them as such. create_statement = "CREATE TABLE data (" + ",".join( map(lambda x: "`%s` NUMERIC" % x,headers)) + ");" cmdfile = tempfile.NamedTemporaryFile() #print cmdfile.name print >>cmdfile,create_statement print >>cmdfile,".separator ' '" print >>cmdfile,".import '" + datafile.name + "' data" print >>cmdfile,sql + ";" cmdfile.flush() #runCmd("cat %s" % cmdfile.name) runCmd("cat %s | sqlite3" % cmdfile.name) if __name__ == "__main__": sys.exit(main())