使用bash或python排序巨大的JSON文件

要求：我有一个.gz格式的Json文件.因此,压缩后的大小约为500 MB.当我解压缩它时,json文件几乎变成了大约10 GB.提取的JSON 文件逐行包含单个JSON对象.我想要的是使用任何bash脚本或python程序根据字段ps对文件进行排序.

由于文件太大,因此不建议将其加载到内存中.因此,我使用gzcat和cat bash命令流式传输JSON数据,然后将它们通过管道传输到jq以进行排序.但是系统在此过程中没有响应,或者在output.json中得到了空文件

>cat  sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"
>gzcat  sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"

硬件：
16GB RAM,
核心i5处理器

样本JSON数据：-

{
    "ps":"abc"
    ....
}
{   
    "ps":"def"
    ......
}
{
    "ps":"abc"
    ....
}

预期产量：

{
    "ps":"abc"
    ....
}
{   
    "ps":"abc"
    ....
}
{
    "ps":"def"
    ....
}

我不明白我在做什么错.谁能建议如何对如此巨大的JSON文件进行排序？
我关注的链接：
https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

另外,如果没有Hadoop,我是否可以通过任何Map Reduce进行任何操作？

方法1：将数据流传输到本地sqlite DB.

import sqlite3
import fileinput

PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO Feeds (data) VALUES (?)"

def db_connect(db_path=PATH):
    con = sqlite3.connect(db_path)
    return con

con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj

record_count = 0
for line in fileinput.input():
    cur.execute(insert_query,(line,))

命令行：

>gzcat sth.json.gz | python insert.py

最佳答案

这是基于其中一项评论中的建议的解决方案：

If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON,then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy

您可以使用jq来完成以下操作：

jq -cr '"\(.ps)\t\(.)"'

这将产生带有制表符分隔值的行,如下所示：

abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}

使用-c选项可确保将每一对(即排序键和对象)写入一行.

现在,您可以轻松地对行进行排序,例如使用排序然后使用剪切以去除.ps字段.

最后,如果您确实希望格式化输出,则可以再次使用jq(例如jq),这是因为jq默认情况下是面向流的.

警告

上面假设.ps值不带制表符.如果不是这种情况,则可以使用其他字段分隔符,也可以：

jq -cr '([.ps] | @tsv) + "\t" + tostring'

原文链接：/python/533322.html

使用bash或python排序巨大的JSON文件

猜你在找的Python相关文章