V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
HOMO114514
V2EX  ›  数据库

(类 PG 信创)亿级 table 简单查询的一个优化问题

  •  
  •   HOMO114514 · 111 天前 · 1274 次点击
    这是一个创建于 111 天前的主题,其中的信息可能已经有所发展或是发生改变。

    部分信息借由 GPT 进行脱敏处理

    现有一个简单表结构:

    Table "schema.t_table_m"
         Column       |            Type             |                Modifiers                 | Storage  | Stats target | Description 
    -----------------+-----------------------------+------------------------------------------+----------+--------------+-------------
     partition_id    | smallint                    | not null                                 | plain    |              | 分区 ID
     transaction_time| timestamp without time zone | not null default pg_systimestamp()       | plain    |              | 交易时间
     transaction_count | integer                   |                                          | plain    |              | 交易数量
     last_update_time | timestamp without time zone | not null default pg_systimestamp()      | plain    |              | 更新时间
     source_type     | character varying(10)       | not null default NULL::character varying | extended |              | 事件源
     transaction_code | character varying(30)      | not null default NULL::character varying | extended |              | 交易码
     key_value       | character varying(100)      | not null default NULL::character varying | extended |              | RV 变量键
     value_data      | character varying(500)      | not null default NULL::character varying | extended |              | RV 变量值
    Indexes:
        "t_table_m_pkey" PRIMARY KEY, ubtree (transaction_time, value_data, key_value, transaction_code, source_type, partition_id) LOCAL WITH (storage_type=USTORE, active_pages=1314323) TABLESPACE pg_default
    Partition By LIST(partition_id)
    Number of partitions: 31 (View pg_partition to check each partition range.)
    Has OIDs: no
    Options: orientation=row, compression=no, storage_type=USTORE, segment=off
    

    表由 Mysql 迁移而来,这个分区表中存储的记录条数为 126000000+。在开发侧,他们反馈下面的一个简单查询执行性能非常差,每条 SQL 要跑 15s 左右:

    SELECT
      value_data AS dataValue,
      SUM(transaction_count) AS txnCount
    FROM
      schema.t_table_m
    WHERE
      1 = 1
      AND source_type = '1'
      AND key_value IN (
        'key1',
        'key2',
        'key3',
        'key4',
        'key5',
        'key6',
        'key7',
        'key8',
        'key9',
        'key10'
      )
      AND partition_id IN (1, 31)
      AND transaction_time >= '2024-07-31 15:58:11'
      AND transaction_time <= '2024-08-01 15:58:11'
      AND value_data NOT IN ('null','space','#','*')
      AND transaction_code NOT IN ('code1', 'code2')
    GROUP BY
      value_data;
    

    执行计划如下:

    postgres=> explain performance
    SELECT
    .......;
    
     id |                                           operation                                            |  A-time   | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width |    E-costs     
    ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+----------------
      1 | ->  HashAggregate                                                                              | 12715.021 |      0 |      1 |            | 22KB        |         |       8 | 28.485..28.495
      2 |    ->  Partition Iterator                                                                      | 12715.011 |      0 |      1 |            | 69KB        |         |       8 | 0.010..28.480
      3 |       ->  Partitioned Index Scan using t_table_m_pkey on schema.t_table_m                      | 12714.903 |      0 |      1 |            | 29KB        |         |       8 | 0.010..28.480
    (3 rows)
    
                Predicate Information (identified by plan id)  
    ------------------------------------------------------------------------------------------
       2 --Partition Iterator
             Iterations: 2
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text) AND (schema.t_table_m.partition_id = ANY ('{1,31}'::integer[])))
             Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[])))
             Selected Partitions:  1,31
    (6 rows)
    
                             Memory Information (identified by plan id)                         
    ------------------------------------------------------------------------------------------
       1 --HashAggregate
             Peak Memory: 22KB, Estimate Memory: 32MB
       2 --Partition Iterator
             Peak Memory: 69KB, Estimate Memory: 32MB
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Peak Memory: 29KB, Estimate Memory: 32MB
    (6 rows)
    
                           Targetlist Information (identified by plan id)                       
    --------------------------------------------------------------------------------------------
       1 --HashAggregate
             Output: value_data, sum(transaction_count)
             Group By Key: schema.t_table_m.value_data
       2 --Partition Iterator
             Output: value_data, transaction_count
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Output: value_data, transaction_count
    (7 rows)
    
                            Datanode Information (identified by plan id)                        
    --------------------------------------------------------------------------------------------
       1 --HashAggregate
             (actual time=12715.021..12715.021 rows=0 loops=1)
             (Buffers: shared hit=760703)
             (CPU: ex c/r=0, ex row=0, ex cyc=437, inc cyc=847446039235951)
       2 --Partition Iterator
             (actual time=12715.011..12715.011 rows=0 loops=1)
             (CPU: ex c/r=0, ex row=0, ex cyc=-847446039224001, inc cyc=847446039235514)
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             (actual time=12714.903..12714.903 rows=0 loops=2)
             (Buffers: shared hit=760703)
             (CPU: ex c/r=0, ex row=0, ex cyc=1694892078459515, inc cyc=1694892078459515)
    (11 rows)
    
           ====== Query Summary =====        
    ------------------------------------------
     Datanode executor start time: 0.177 ms
     Datanode executor run time: 12715.042 ms
     Datanode executor end time: 0.021 ms
     Planner runtime: 1.124 ms
     Query Id: 10810046480632471817
     Total runtime: 12715.260 ms
    (6 rows)
    

    其中所有涉及列都已经命中索引扫描,测试时尝试更改某些 where 条件使得查询进行全表扫描的话,执行时间会暴增到 2 分钟左右。

    经由我们多轮调试后,发现仅仅是单纯地将 where 条件中partition_id IN (1, 31)更改为(partition_id=31 OR partition_id=1),就能使得查询平均耗时减少到 6s 。其它优化更改的效果低于系统性能波动,无法观测。

    原本猜测是因为 IN 语句改写后会触发分区剪枝类的优化,但是经对比发现改写前后的执行计划完全一致:

    postgres=> explain performance
    SELECT
      value_data AS dataValue,
      SUM(transaction_count) AS txnCount
    FROM
      schema.t_table_m
    WHERE
      1 = 1
      AND source_type = '1'
      AND key_value = any (array[
        'key1',
        'key2',
        'key3',
        'key4',
        'key5',
        'key6',
        'key7',
        'key8',
        'key9',
        'key10'
      ])
      AND (partition_id = 1 OR partition_id = 31)
      AND transaction_time >= '2024-07-31 15:58:11'
      AND transaction_time <= '2024-08-01 15:58:11'
      AND value_data not in ('null','space','#','*')
      AND transaction_code <> 'code1' 
      AND transaction_code <> 'code2'
    GROUP BY
      value_data;
    
    
     id |                                           operation                                            |  A-time   | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width |    E-costs     
    ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+----------------
      1 | ->  HashAggregate                                                                              | 6421.454 |      0 |      1 |            | 22KB        |         |       8 | 14.790..14.800
      2 |    ->  Partition Iterator                                                                      | 6421.441 |      0 |      1 |            | 69KB        |         |       8 | 0.010..14.785
      3 |       ->  Partitioned Index Scan using t_table_m_pkey on schema.t_table_m                      | 6421.276 |      0 |      1 |            | 31KB        |         |       8 | 0.010..14.785
    (3 rows)
    
                Predicate Information (identified by plan id)
    -----------------------------------------------------------------------------------------
       2 --Partition Iterator
             Iterations: 2
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text))
             Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.partition_id = 31) OR (schema.t_table_m.partition_id = 1)) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[])))
             Selected Partitions:  1,31
    (6 rows)
    
                
                Memory Information (identified by plan id)                         
    ------------------------------------------------------------------------------------------
       1 --HashAggregate
             Peak Memory: 22KB, Estimate Memory: 32MB
       2 --Partition Iterator
             Peak Memory: 69KB, Estimate Memory: 32MB
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Peak Memory: 31KB, Estimate Memory: 32MB
    (6 rows)
    
                           Targetlist Information (identified by plan id)                       
    --------------------------------------------------------------------------------------------
       1 --HashAggregate
             Output: value_data, sum(transaction_count)
             Group By Key: schema.t_table_m.value_data
       2 --Partition Iterator
             Output: value_data, transaction_count
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             Output: value_data, transaction_count
    (7 rows)
    
                            Datanode Information (identified by plan id)                        
    --------------------------------------------------------------------------------------------
       1 --HashAggregate
             (actual time=6421.454..6421.454 rows=0 loops=1)
             (Buffers: shared hit=380353)
             (CPU: ex c/r=0, ex row=0, ex cyc=488, inc cyc=847433445353297)
       2 --Partition Iterator
             (actual time=6421.441..6421.441 rows=0 loops=1)
             (CPU: ex c/r=0, ex row=0, ex cyc=-847433445336080, inc cyc=847433445352809)
       3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m
             (actual time=6421.276..6421.276 rows=0 loops=2)
             (Buffers: shared hit=380353)
             (CPU: ex c/r=0, ex row=0, ex cyc=1694866890688889, inc cyc=1694866890688889)
    (11 rows)
    
           ====== Query Summary =====        
    -----------------------------------------
     Datanode executor start time: 0.198 ms
     Datanode executor run time: 6421.479 ms
     Datanode executor end time: 0.022 ms
     Planner runtime: 1.417 ms
     Query Id: 10810046480632413864
     Total runtime: 6421.729 ms
    (6 rows)
    

    两者唯一的区别是,在 Partitioned Index Scan 时,使用 OR 的语句 Buffers: shared hit 是使用 IN 的一半。

    其中造成这样的差距的可能原因是什么?

    4 条回复    2024-08-14 12:04:29 +08:00
    967182
        1
    967182  
       111 天前
    GPT 这样说: 从描述和执行计划中可以看出,使用 IN 与使用 OR 的主要区别在于访问缓冲区( Buffers )的行为不同。使用 OR 时,缓冲区的共享命中率只有使用 IN 的一半。这意味着在使用 OR 时,查询过程中的缓存利用效率更高,从而导致查询执行时间较短。

    以下是可能导致这种差异的原因:

    缓存使用效率:IN 操作符通常会转换成多个条件的 OR 表达式,这可能会增加某些情况下的缓存失效或增加磁盘 IO 。

    执行计划差异:尽管在高层次上执行计划看似相同,但在某些底层细节上,优化器可能会对 IN 和 OR 进行不同的处理,导致资源使用的不同。

    执行路径:OR 和 IN 在某些数据库系统中的处理路径可能不同,导致数据扫描和缓存的差异。

    总的来说,具体的原因可能是数据库在处理 IN 操作符时,在分区剪枝和索引扫描的优化方面没有完全优化,或者在使用 OR 时,查询路径更为直接,减少了缓存未命中的情况,从而使查询执行得更快。
    mark2025
        2
    mark2025  
       111 天前
    考虑在 transaction_time 上面创建 BRIN 索引(不是 btree )
    ShuA1
        3
    ShuA1  
       111 天前
    建议将 transaction_time 保存为时间戳,bigint 的对比速度会快很多。 你这里是日期,会存在隐式转换为时间戳的问题。
    另外我看 source_type 字段也是 text 类型, 也可以更换为 int4 。
    建议你用 explain analyze 进行分析,找出慢的地方,performance 更多是对资源消耗的分析。
    MoYi123
        4
    MoYi123  
       111 天前
    优化器没写好呗, 能报 bug 的话就报一下, 这种应该很好修.
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1519 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 27ms · UTC 17:18 · PVG 01:18 · LAX 09:18 · JFK 12:18
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.