在使用配置了热备的Postgresql数据库时,在执行大量事务时,尤其是一个需要插入几千万条数据的insert事务时(典型的做法是持续 insert into t select * from t;),后台log中报错如下:
csv格式日志:
2013-07-01 13:25:29.430 CST,27738,51d112c8.6c5a,1,2013-07-01 13:25:28 CST,LOG,00000,"streaming replication successfully connected to primary","libpqrcv_connect,libpqwalreceiver.c:171","" 2013-07-01 13:25:29.430 CST,2,FATAL,XX000,"could not receive data from WAL stream:FATAL: requested WAL segment 0000000800002A0000000000 has already been removed ","libpqrcv_receive,libpqwalreceiver.c:389","" |
备注:根据报错信息,很容易知道是由于在主库产生了大量xlog,由于postgresql在执行事务时,在提交时才发送到备库。由于该事务需要执行的时间过长,超过了checkpoint的默认间隔,所以导致有的xlog还未发送到备库却被remove掉了。要解决该问题,一般可用的方案有:
一、调整wal_keep_segments的值
将GUC参数wal_keep_segments设大一些,比如设置为2000,而每个segment默认值为16MB,就相当于有32000MB,即大概30多个GB的空间作为缓存空间。
不过,该方法并不能从根本上解决该问题。毕竟,在生产环境中或TPCC等测试灌数时,如果某条事务需要插入几十亿条记录,有可能还是会出现该问题。
二、启用归档
归档,就是将未发送到备库的xlog备份到某个目录下,待重启数据库时再将其恢复到备库中去。
GUC参数设置示例如下:
主库的postgresql.conf文件中: wal_level = hot_standby @H_502_82@archive_mode = on @H_502_82@archive_command = 'rsync -zaq %p postgres@pg-slave:/var/lib/pgsql/wal_restore/%f && test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp %p /var/lib/pgsql/backup/wal_archive/' archive_timeout = 300 max_wal_senders = 5 wal_keep_segments = 0 # not sure why I've set it to this? 备库的postgresql.conf文件中: archive_command = 'test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp -i %p /var/lib/pgsql/backup/wal_archive/%f < /dev/null' hot_standby = on wal_keep_segments = 1 备库的recovery.conf文件中: standby_mode = 'on' primary_conninfo = 'host=pg-master port=5432 user=replicator' restore_command = 'cp /var/lib/psql/wal_restore/%f %p' archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_restore/ %r' |
参考:
http://blog.2ndquadrant.com/postgresql-9-4-slots/
http://grokbase.com/t/postgresql/pgsql-general/13654jchy3/trouble-with-replication