我在一个基本上收集日志记录信息的表上经常进行以下两个查询.两者都从大量行中选择不同的值,但这些值中的值不到10个.
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_; QUERY PLAN ---------------------------------------------------------------------------------------------- HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21) -> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21) (2 rows) marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_; QUERY PLAN ---------------------------------------------------------------------------------------------- HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13) -> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13) (2 rows)
两者都对列进行序列扫描.但是,如果我关闭enable_seqscan(显示名称,这只会禁用对带索引的列执行序列扫描),查询将使用索引,但速度更慢:
marchena=> set enable_seqscan = off; SET marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Unique (cost=0.00..19613740.62 rows=6 width=21) -> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21) (2 rows) marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Unique (cost=0.00..45851449.96 rows=5 width=13) -> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13) (2 rows)
bundle_id和server_name列都有btree索引,我应该使用不同类型的索引来快速选择不同的值吗?
BEGIN; CREATE TABLE dist ( x INTEGER NOT NULL ); INSERT INTO dist SELECT random()*50 FROM generate_series( 1,5000000 ); COMMIT; CREATE INDEX dist_x ON dist(x); VACUUM ANALYZE dist; EXPLAIN ANALYZE SELECT DISTINCT x FROM dist; HashAggregate (cost=84624.00..84624.51 rows=51 width=4) (actual time=1840.141..1840.153 rows=51 loops=1) -> Seq Scan on dist (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.003..573.819 rows=5000000 loops=1) Total runtime: 1848.060 ms
PG不能(还)使用不同的索引(跳过相同的值)但你可以这样做:
CREATE OR REPLACE FUNCTION distinct_skip_foo() RETURNS SETOF INTEGER LANGUAGE plpgsql STABLE AS $$ DECLARE _x INTEGER; BEGIN _x := min(x) FROM dist; WHILE _x IS NOT NULL LOOP RETURN NEXT _x; _x := min(x) FROM dist WHERE x > _x; END LOOP; END; $$; EXPLAIN ANALYZE SELECT * FROM distinct_skip_foo(); Function Scan on distinct_skip_foo (cost=0.00..260.00 rows=1000 width=4) (actual time=1.629..1.635 rows=51 loops=1) Total runtime: 1.652 ms