V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐工具
RoboMongo
推荐书目
50 Tips and Tricks for MongoDB Developers
Related Blogs
Snail in a Turtleneck
comwrg
V2EX  ›  MongoDB

mongodb 频繁异常退出 errno:24 Too many open files 求助

  •  
  •   comwrg · 2019-08-02 10:35:43 +08:00 · 14914 次点击
    这是一个创建于 1940 天前的主题,其中的信息可能已经有所发展或是发生改变。

    部分日志

    2019-08-01T23:59:02.301+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:02.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:03.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:03.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:04.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:04.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:05.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:05.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:06.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:06.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:07.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:07.302+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:08.302+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:08.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:09.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:09.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:10.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:10.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:11.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:11.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:12.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:12.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:13.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:13.303+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:14.303+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:14.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:15.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:15.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:16.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:16.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:17.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:17.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:18.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:18.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:19.295+0800 W NETWORK  [HostnameCanonicalizationWorker] Failed to obtain address information for hostname iZuf61zao4uxbprumx45dlZ: System error
    2019-08-01T23:59:19.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:19.304+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:20.304+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:20.305+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:21.305+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:21.305+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:22.305+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:22.305+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:23.305+0800 I NETWORK  [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
    2019-08-01T23:59:23.305+0800 E NETWORK  [initandlisten] Out of file descriptors. Waiting one second before trying to accept more connections.
    2019-08-01T23:59:23.631+0800 E STORAGE  [thread2] WiredTiger (24) [1564675163:631372][9783:0x7f4e30730700], file:WiredTiger.wt, WT_SESSION.checkpoint: /var/lib/mongodb/WiredTiger.turtle: handle-open: open: Too many open files
    2019-08-01T23:59:23.632+0800 E STORAGE  [thread2] WiredTiger (24) [1564675163:632761][9783:0x7f4e30730700], checkpoint-server: checkpoint server error: Too many open files
    2019-08-01T23:59:23.632+0800 E STORAGE  [thread2] WiredTiger (-31804) [1564675163:632802][9783:0x7f4e30730700], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
    2019-08-01T23:59:23.632+0800 I -        [thread2] Fatal Assertion 28558
    2019-08-01T23:59:23.632+0800 I -        [thread2] 
    
    ***aborting after fassert() failure
    
    
    2019-08-01T23:59:23.638+0800 F -        [thread2] Got signal: 6 (Aborted).
    
    ulimit -a
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 31862
    max locked memory       (kbytes, -l) 64
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 65535
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 31862
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited
    

    设置了 sysctl.conf fs.file-max = 2097152

    每天都会崩溃 实在不清楚问题所在根源

    第 1 条附言  ·  2019-08-02 11:43:12 +08:00
    mongod --version
    db version v3.2.11
    git version: 009580ad490190ba33d1c6253ebd8d91808923e4
    OpenSSL version: OpenSSL 1.0.2s  28 May 2019
    allocator: tcmalloc
    modules: none
    build environment:
        distarch: x86_64
        target_arch: x86_64
    
    22 条回复    2020-06-05 13:56:02 +08:00
    KYLINZZ
        1
    KYLINZZ  
       2019-08-02 11:13:25 +08:00   ❤️ 1
    auser
        2
    auser  
       2019-08-02 11:16:37 +08:00   ❤️ 1
    建议在 /proc/PID/limits 文件里看进程到底能打开多少 FD
    comwrg
        3
    comwrg  
    OP
       2019-08-02 11:35:34 +08:00
    @auser


    ```
    Limit Soft Limit Hard Limit Units
    Max cpu time unlimited unlimited seconds
    Max file size unlimited unlimited bytes
    Max data size unlimited unlimited bytes
    Max stack size 8388608 unlimited bytes
    Max core file size 0 unlimited bytes
    Max resident set unlimited unlimited bytes
    Max processes 64000 64000 processes
    Max open files 64000 64000 files
    Max locked memory unlimited unlimited bytes
    Max address space unlimited unlimited bytes
    Max file locks unlimited unlimited locks
    Max pending signals 31862 31862 signals
    Max msgqueue size 819200 819200 bytes
    Max nice priority 0 0
    Max realtime priority 0 0
    Max realtime timeout unlimited unlimited us
    ```

    看了下应该是没有问题的
    auser
        4
    auser  
       2019-08-02 11:54:04 +08:00   ❤️ 1
    @comwrg 检查下 TCP 连接的数量,可以使用 ss 或者 netstat,然后看看 mongodb 进程相关的连接数量是否过多。如果过多,要根据 TCP 所处的状态来进一步推断问题在哪里,到底是什么原因把文件描述符资源占用完了。比如说被拒绝服务攻击,大量空的 TCP 连接。

    一个网络连接占用一个文件描述符( fd ),打开文件读写也占用一个。从错误日志来看,最先出现的错误是文件描述符用完,导致新的网络连接拿不到 fd,accept (接受新网络连接的系统调用)失败。这种情况还好。但是对数据库而言,文件写不进磁盘,数据无法落地,主动崩溃是好的做法。

    针对楼主的问题,我觉得很可能是频繁调用的地方,文件使用完没有关闭,导致 fd 一直无法释放,最终达到上限。现在楼主应该从网络(第一段所说)与 /proc/PID/fd/目录下来排查故障原因。
    est
        5
    est  
       2019-08-02 11:55:10 +08:00
    inode 用完了。
    comwrg
        6
    comwrg  
    OP
       2019-08-02 11:58:10 +08:00
    comwrg
        7
    comwrg  
    OP
       2019-08-02 11:59:24 +08:00
    @KYLINZZ
    我看里面的 version 是 2.6.7 与我的对不上呀 这个 BUG 也有点老老
    aaa5838769
        8
    aaa5838769  
       2019-08-02 12:00:13 +08:00
    这种一般都是磁盘没空间了,要不就是 i 节点用完了。
    julyclyde
        9
    julyclyde  
       2019-08-02 12:03:53 +08:00   ❤️ 1
    用 ulimit 或者 /etc/securiyt/limits.conf 去查看和修改是一种很经典的错误

    后台服务的 rlimit 要在其启动的地方设置
    bigpigB
        10
    bigpigB  
       2019-08-02 12:49:23 +08:00 via Android
    ulimit 改大一点
    neverfall
        11
    neverfall  
       2019-08-02 13:03:15 +08:00
    只管开不管关么?
    记得 close
    comwrg
        12
    comwrg  
    OP
       2019-08-02 13:59:03 +08:00
    @est @aaa5838769 都没用哈
    comwrg
        13
    comwrg  
    OP
       2019-08-02 13:59:27 +08:00
    @est @aaa5838769 都没有哈
    comwrg
        14
    comwrg  
    OP
       2019-08-02 14:24:09 +08:00
    @auser 非常感谢🙏,已经按照您说的去排查了

    排查到 mongodb 占用了很多 fd ( 24135/38839 )占用超过了一半往上

    ![image]( https://user-images.githubusercontent.com/19854253/62348661-efa26b00-b52f-11e9-80be-b1eef07c061b.png)

    难道真的时候项目中没有关闭连接吗 不过这个项目已经运行了好几个月了 只是最近几天 mongo 开始频繁的因为 fd 用完而崩溃
    comwrg
        15
    comwrg  
    OP
       2019-08-02 14:41:29 +08:00
    auser
        16
    auser  
       2019-08-02 14:49:08 +08:00 via iPhone
    docs.mongodb.com/v3.2/core/index-text/

    隐约感觉问题出在这里,推测是设计问题(滥用数据库)。我不会这个数据库,只能帮到这里了。
    comwrg
        17
    comwrg  
    OP
       2019-08-02 14:56:50 +08:00
    @auser 好的,非常感谢您提供的建议。我自己再去慢慢排查:)
    ilucio
        18
    ilucio  
       2019-08-02 17:48:07 +08:00 via Android
    将 ulimit 设置成 64000,官网文档里讲了的
    auser
        19
    auser  
       2019-08-02 23:23:13 +08:00 via iPhone
    @comwrg

    如果系统负载跟磁盘 io 不高
    先直接把文件描述符限制增大吧
    有最终结果了分享下吧
    主要是为什么会打开那么多索引文件
    comwrg
        20
    comwrg  
    OP
       2019-08-03 13:09:09 +08:00 via Android
    @auser 恩,已经设置到 200000 了
    qq1340691923
        21
    qq1340691923  
       2020-06-05 13:06:23 +08:00
    @comwrg 你倒是分享一下最终结果啊..
    comwrg
        22
    comwrg  
    OP
       2020-06-05 13:56:02 +08:00
    @qq1340691923
    缓解方案,将 ulimit 设置的非常大 之前设置到了 200000 就没有出现那种情况了
    暂时还是不清楚是什么原因,不过推断可能是 collection 过多的原因(数量级大于十万)
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2872 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 28ms · UTC 07:56 · PVG 15:56 · LAX 23:56 · JFK 02:56
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.