SELECT source,id FROM tablename GROUP BY source,id
但结果可能需要一段时间才能完成.有人建议,如果字段被索引,并且索引值的比例很小(这是我的情况),那么松散的索引扫描可能是更好的选择:http://wiki.postgresql.org/wiki/Loose_indexscan
WITH RECURSIVE t AS (SELECT min(col) AS col FROM tablename UNION ALL SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL) SELECT col FROM t WHERE col IS NOT NULL UNION ALL SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);
该示例在单个字段上运行.尝试返回多个字段会产生错误:子查询必须只返回一列.一种可能性是尝试检索整个行 – 例如SELECT ROW(min(source),min(id)…,但是我不确定WHERE语句的语法需要看起来像处理单个行元素.
问题是:可以修改基于递归的代码以使用多个列,如果是,如何?我致力于使用Postgres,但看起来MysqL已经为多个列实现了松散的索引扫描:http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
按照建议,我附上了EXPLAIN ANALYZE结果.
对于我的情况 – 我使用GROUP BY为2列选择不同的值,它是以下内容:
HashAggregate (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1) -> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1) Total runtime: 36127.789 ms (3 rows)
我不知道如何进行2列索引扫描(这是问题),但为了进行比较,在一列上使用GROUP BY,我得到:
HashAggregate (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1) -> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1) Total runtime: 32350.899 ms (3 rows)
但是对于一列上的松散索引扫描,我得到:
Result (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1) CTE t -> Recursive Union (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1) -> Result (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1) InitPlan 1 (returns $1) -> Limit (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1) -> Index Only Scan using tablename_id on tablename (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1) Index Cond: (id IS NOT NULL) Heap Fetches: 0 -> WorkTable Scan on t (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101) Filter: (id IS NOT NULL) Rows Removed by Filter: 0 SubPlan 3 -> Result (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100) InitPlan 2 (returns $3) -> Limit (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100) -> Index Only Scan using tablename_id on tablename (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100) Index Cond: ((id IS NOT NULL) AND (id > t.id)) Heap Fetches: 0 -> Append (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1) -> CTE Scan on t (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1) Filter: (id IS NOT NULL) Rows Removed by Filter: 1 -> Result (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) One-Time Filter: $5 InitPlan 5 (returns $5) -> Index Only Scan using tablename_id on tablename (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) Index Cond: (id IS NULL) Heap Fetches: 0 Total runtime: 2.040 ms
完整的表定义如下所示:
CREATE TABLE tablename ( source character(25),id bigint NOT NULL,time_ timestamp without time zone,height numeric,lon numeric,lat numeric,distance numeric,status character(3),geom geometry(PointZ,4326),relid bigint ) WITH ( OIDS=FALSE ); CREATE INDEX tablename_height ON public.tablename USING btree (height); CREATE INDEX tablename_geom ON public.tablename USING gist (geom); CREATE INDEX tablename_id ON public.tablename USING btree (id); CREATE INDEX tablename_lat ON public.tablename USING btree (lat); CREATE INDEX tablename_lon ON public.tablename USING btree (lon); CREATE INDEX tablename_relid ON public.tablename USING btree (relid); CREATE INDEX tablename_sid ON public.tablename USING btree (source COLLATE pg_catalog."default",id); CREATE INDEX tablename_source ON public.tablename USING btree (source COLLATE pg_catalog."default"); CREATE INDEX tablename_time ON public.tablename USING btree (time_);
答案选择:
我花了一些时间来比较所提供的方法.有时这样我希望不止一个答案可以被接受,但在这种情况下,我正在给@jjanes打勾.这样做的原因是他的解决方案与最初提出的问题更紧密地匹配,并且我能够获得关于所需WHERE语句的形式的一些见解.最后,HashAggregate实际上是最快的方法(对我来说),但这是由于我的数据的性质,而不是算法的任何问题.我已经为下面的不同方法添加了EXPLAIN ANALYZE,并且将为jjanes和joop提供1.
HashAggregate:
HashAggregate (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1) -> Seq Scan on tablename (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1) Total runtime: 24787.292 ms
松散索引扫描修改
CTE Scan on t (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1) Filter: (source IS NOT NULL) Rows Removed by Filter: 1 CTE t -> Recursive Union (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1) -> Limit (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1) -> Index Only Scan using tablename_sid on tablename (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1) Heap Fetches: 0 -> WorkTable Scan on t (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081) Filter: (source IS NOT NULL) Rows Removed by Filter: 0 SubPlan 1 -> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080) -> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080) Index Cond: (ROW(source,id) > ROW(t.source,t.id)) Heap Fetches: 0 SubPlan 2 -> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080) -> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080) Index Cond: (ROW(source,t.id)) Heap Fetches: 0 Total runtime: 250491.559 ms
合并反加入
Merge Anti Join (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1) Merge Cond: ((src.source = nx.source) AND (src.id = nx.id)) Join Filter: (nx.time_ > src.time_) Rows Removed by Join Filter: 363464177 -> Index Only Scan using tablename_pkey on tablename src (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1) Heap Fetches: 0 -> Materialize (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1) -> Index Only Scan using tablename_pkey on tablename nx (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1) Heap Fetches: 0 Total runtime: 541750.026 ms
WITH RECURSIVE t AS ( select a,b from (select a,b from foo order by a,b limit 1) asdf union all select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),(select b from foo where (a,b limit 1) from t where t.a is not null) select * from t where t.a is not null;
我真的不明白为什么需要“不是空”,因为空值首先出现在哪里?