表膨胀
表膨胀的原因
udw的存储实现(MVCC-多版本并发控制)来自于Postgres。根据MVCC的原理,没有办法直接更新数据(更新操作(update)是通过先删除(delete)再插入(insert)实现的),被更新之前的行数据仍然在数据文件中。
如何避免表膨胀
方法一:vacuum full table
vacuum full不能回收索引的膨胀空间。vacuum full 加载的锁与 DDL 锁类似,是排它锁。建议在没有业务的时候执行,不要堵塞业务。
使用 vacuum full 回收垃圾的建议操作流程:
- 记录下表的索引
- 删除索引
- vacuum full 表
- 重建索引
示例:
创建测试表:
dev=# create table test(id int, name text); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE dev=# insert into test select generate_series(1,100000000), 'test'; INSERT 0 100000000 dev=# create index idx_test on test(id); CREATE INDEX dev=# update test set name='nice'; UPDATE 100000000
查看表格数据:
dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 8401 MB (1 row)
查看索引数据:
dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 6377 MB (1 row)
先回收表数据(此方法不能回收索引数据):
dev=# vacuum full test; VACUUM dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) Time: 4.278 ms dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 6377 MB (1 row)
回收索引和表的数据:
dev=# drop index idx_test; DROP INDEX dev=# vacuum full test; VACUUM dev=# create index idx_test on test(id); CREATE INDEX dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 2126 MB (1 row)
方法二:通过修改分布键释放空间
修改分布键可以回收索引的膨胀空间。修改分布键加载的锁与 DDL 锁类似,是排它锁。建议在没有业务的时候执行,不要影响业务。
alter table test set with (reorganize=true) distributed randomly; alter table test set with (reorganize=true) distributed by (id);
实例:
dev=# update test set name='back'; UPDATE 100000000
查看数据:
dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 8401 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 4251 MB (1 row)
查看表格的分布键,如下所示Distributed by: (id),分布键为id
dev=# \d+ test Table "public.test" Column | Type | Modifiers | Storage | Description --------+---------+-----------+----------+------------- id | integer | | plain | name | text | | extended | Indexes: "idx_test" btree (id) Has OIDs: no Distributed by: (id)
按照原有的分布键重新分布
dev=# alter table test set with (reorganize=true) distributed by (id); ALTER TABLE
查看数据
dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 2126 MB
方法三:创建新表,导入数据
CREATE TABLE...AS SELECT
命令把该表拷贝为一个新表,新建的表将不会出现膨胀现象。然后删除原始表并且重命名拷贝的表。
参考: https://gp-docs-cn.github.io/docs/best_practices/bloat.html /analysis/udw/developer(选择数据分布策略)