V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
littlewey
V2EX  ›  问与答

OpenvSwitch port 在大流量时候消失的 troubleshooting 方法,思路求教。

  •  
  •   littlewey · 2018-01-14 13:04:12 +08:00 · 2534 次点击
    这是一个创建于 2496 天前的主题,其中的信息可能已经有所发展或是发生改变。

    我一直没有系统编程的经验,欠缺这方面的方法论,这次遇到 ovs 的 port 丢失问题,Google 相关 troubleshoot 的东西都是 data flow 层面的信息,实在没办法了上来提问,期待大家帮忙,mentoring 哈。

    因为是半生产环境,所以不是方便一直连上去重现,所以要自己多研究一下才能去做尝试。

    刚刚也把问题发在了 stackoverflow: https://stackoverflow.com/questions/48246801/openvswitch-port-missing-in-large-load-long-poll-interval-observed

    如果大家有帮助思路烦请直接去那里回答或者麻烦两边都添加回复,抱歉带来这样的麻烦。


    ISSUE description

    I have a OpenStack system with HA management network (VIP) via ovs (Open vSwitch) port, it's found in this system, with high load (concurrently volume-from-glance-image creation), the VIP port (an ovs port) will be missing.

    Analysis

    For now, with default log level from log file, the only thing observed is as below the Unreasonably long 62741ms poll interval.

    2017-12-29T16:40:38.611Z|00001|timeval(revalidator70)|WARN|Unreasonably long 62741ms poll interval (0ms user, 0ms system)
    

    Idea for now

    I will turn debug log on for file and try reproducing the issue:

    sudo ovs-appctl vlog/set file:dbg
    

    Question

    • What else should I do during/after of the issue reproduction please?
    • Is this issue typical? Caused by what if yes?

    I googled OpenvSwitch trouble shoot or other related key words while information was all on data flow/table level instead of this ovs-vswitchd level ( am I right? )

    Many thanks! BR//Wey

    4 条回复    2021-02-24 16:35:17 +08:00
    alcarl
        1
    alcarl  
       2018-01-14 13:48:37 +08:00 via Android   ❤️ 1
    猜一发,操作系统的 limit 设置中 openfiles 太小,高并发时连接数多了超过 openfiles 上限,然后监听线程 accept 是报错退出了,所以 listen 没有了
    littlewey
        2
    littlewey  
    OP
       2018-01-14 13:58:15 +08:00
    @alcarl #1 多谢 多谢,不过我那个 high load 其实是大的 bandwidth 占用,好像不是连接数的 high
    littlewey
        3
    littlewey  
    OP
       2018-01-17 14:19:33 +08:00
    @alcarl update,我查了一下,host 和 guest 系统里都没有设定 nofile 的 limitation。 再次感谢。
    littlewey
        4
    littlewey  
    OP
       2021-02-24 16:35:17 +08:00
    Update:

    之前的问题不事生产环境,之后没有重现不了了之。

    今年又有别的朋友遇到这个问题,找到我,我又看了一下,猜测问题可能出在 traffic pattern 频繁触发了 ovs dpdk bonding 的 流量 shift,shift 的抖动造成了 VM (有抖动敏感的重启机制)内部的重启,而这个 long poll 只是 VM 重启的结果。

    下边是 bonding shift 流量的触发条件:高的一边比低的一边差别在几个维度( delta BW,BW 比例,高的流量超过一个,并且 shift 能降低至少 10%)。

    ref: https://docs.openvswitch.org/en/latest/topics/bonding/

    > Bond Packet Output¶
    > When a packet is sent out a bond port, the bond member actually used is selected based on the packet’s source MAC and VLAN tag (see bond_choose_output_member()). In particular, the source MAC and VLAN tag are hashed into one of 256 values, and that value is looked up in a hash table (the “bond hash”) kept in the bond_hash member of struct port. The hash table entry identifies a bond member. If no bond member has yet been chosen for that hash table entry, vswitchd chooses one arbitrarily.

    > Every 10 seconds, vswitchd rebalances the bond members (see bond_rebalance()). To rebalance, vswitchd examines the statistics for the number of bytes transmitted by each member over approximately the past minute, with data sent more recently weighted more heavily than data sent less recently. It considers each of the members in order from most-loaded to least-loaded. If highly loaded member H is significantly more heavily loaded than the least-loaded member L, and member H carries at least two hashes, then vswitchd shifts one of H’s hashes to L. However, vswitchd will only shift a hash from H to L if it will decrease the ratio of the load between H and L by at least 0.1.

    > Currently, “significantly more loaded” means that H must carry at least 1 Mbps more traffic, and that traffic must be at least 3% greater than L’s.
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1192 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 29ms · UTC 23:05 · PVG 07:05 · LAX 15:05 · JFK 18:05
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.