分析依赖链条如下
paperless-ngx
需求将 Office 文件转化为 pdf
paperless 依赖 tika (通过调用 http://tika:9998
tika 依赖 gotenberg (通过调用 http://gotenberg:3000/forms/libreoffice/convert
gotenberg 依赖 unoconv (通过 cmd 调用 /usr/bin/unoconv
此时 gotenberg 容器中 libreoffice 以 headless 启动 监听 tcp44573 (为什么不能常驻?因为有内存泄漏!哭笑不得
(第一次启动会报一个 81 的错误码 然后 gotenberg 第二次尝试启动才能起来 这是正常行为!这居然是正常行为!? log 里打了一句 got exit code 81, e.g., LibreOffice listener first start.
unoconv 依赖 libreoffice (通过容器中调用 http://localhost:44573
问题出在这一步 libreoffice 会直接报错误码 6 然后监听进程就没了
理所当然的 gotenberg 向 tika 返回 500 Internel Error 然后 paperless-ngx 网页也炸了
目前 paperless-ngx 依赖的 gotenberg 版本为 7.8
issues 中提到回退 gotenberg 到 7.4 版本可以解决问题 但是目前 latest 的 paperless-ngx 已经无法使用 7.4 的 gotenberg 会报 400 因为接口名字换了 最终目前能用的方法就是一起回退 (太蠢了
另外这个项目里 unoconv 也是 deprecated 的状态(新的库是 Unoserver
直到我排查到了 LibreOffice 这个库 才发现宝藏 居然还有版本号是 7.6.0 (7.3.5.2 30(Build:2))这样命名的系统(
这里介绍了目前 work 和不 work 的大概版本 https://github.com/gotenberg/gotenberg/issues/576
依赖它的库真的是太好玩了 https://github.com/gotenberg/gotenberg/commit/e4a553c022ff076bbd3a1c5e59cc45172b461771
Next lines are not currently relevant, as latest versions of LibreOffice (i.e., > 7.0.4) are causing troubles.
分享下我现在的 docker-compose 文件 希望有大佬能修一下
version: '3'
services:
broker:
image: docker.io/library/redis:7
restart: unless-stopped
volumes:
- /data/paperless-ngx1/redis:/data
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- broker
- gotenberg
- tika
ports:
- 38000:8000
healthcheck:
test: ["CMD", "curl", "-fs", "-S", "--max-time", "2", "http://localhost:8000"]
interval: 30s
timeout: 10s
retries: 5
volumes:
- ./apt.list:/etc/apt/sources.list
- /data/paperless-ngx1/data:/usr/src/paperless/data
- /data/paperless-ngx1/media:/usr/src/paperless/media
- /data/paperless-ngx1/export:/usr/src/paperless/export
- /data/paperless-ngx1/consume:/usr/src/paperless/consume
env_file: paperless1.env
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
gotenberg:
image: docker.io/gotenberg/gotenberg:7.8
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- "gotenberg"
# - "--log-level=debug"
- "--api-timeout=600s"
# - "--uno-listener-start-timeout=180s"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
# The UID and GID of the user used to run paperless in the container. Set this
# to your UID and GID on the host so that you have write access to the
# consumption directory.
USERMAP_UID=1011
USERMAP_GID=1011
# Additional languages to install for text recognition, separated by a
# whitespace. Note that this is
# different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines the
# language used for OCR.
# The container installs English, German, Italian, Spanish and French by
# default.
# See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
# for available languages.
PAPERLESS_OCR_LANGUAGES=chi-sim
###############################################################################
# Paperless-specific settings #
###############################################################################
# All settings defined in the paperless.conf.example can be used here. The
# Docker setup does not use the configuration file.
# A few commonly adjusted settings are provided below.
# This is required if you will be exposing Paperless-ngx on a public domain
# (if doing so please consider security measures such as reverse proxy)
PAPERLESS_URL=https://d******e
# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
PAPERLESS_SECRET_KEY=helloworld
# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
PAPERLESS_TIME_ZONE=Asia/Shanghai
# The default language to use for OCR. Set this to the language most of your
# documents are written in.
PAPERLESS_OCR_LANGUAGE=chi_sim
# Set if accessing paperless via a domain subpath e.g. https://domain.com/PATHPREFIX and using a reverse-proxy like traefik or nginx
#PAPERLESS_FORCE_SCRIPT_NAME=/PATHPREFIX
#PAPERLESS_STATIC_URL=/PATHPREFIX/static/ # trailing slash required
真的 求一个 C 艹的大佬拯救一下这个库吧 真的有人用 LibreOffice 吗?
1
evilStart 2023-02-17 01:16:12 +08:00 via Android 2
想转换文档的话直接用 pandoc 把,轻巧可控,不会有这些问题
|
2
Licsber OP @evilStart #1 有没有可能我只是想用一下 paperless 这个软件 hhh
自己转换文档的话其实怎么都可以 主要就是这个软件本身不支持创建空的文档记录 也不支持替换文件 或者替换原始文件 大概梳理了一下这个软件的机制 就是原始文件传上去之后 要压缩一遍 统一做成 pdf 然后生成 webp 格式的缩略图 同时对原始文件的全部要素做 OCR 识别( tesseract ocr )然后保存成元数据 其实我更应该贡献点代码 增加一下手动替换文件的功能 可惜最近没什么精力 |