记一次 Python 服务 TCE 实例进程异常退出排查

结论

子进程由于配置中心连接异常导致启动失败进而导致主进程退出。

现象

容器实例进程异常退出告警。

日志

检查对应时间段日志有如下异常：

arduino 复制代码

Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 172, in _get_from_bconfig
    return bconfig.get(key, enable_auth=self.enable_auth)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 178, in get
    return get_client().get(key, enable_auth=enable_auth)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 101, in get
    return self._decode_response(resp, key)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 120, in _decode_response
    raise ResponseError(self.addr, "response: %s : %s" % (error_code, message))
bytedtcc.v2.bconfig.exception.ResponseError: server: http://[fdbd:dc02:2:173::156]:10935, ex: response: 50 : ByteKV: GRPC DeadlineExceeded error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 586, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 135, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 96, in load
    app = import_app(self.app_uri)
  File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 131, in import_app
    __import__(module)
  File "/opt/tiger/app/dora/bootstrap.py", line 7, in <module>
    application, settings = bootstrap(__name__, os.path.dirname(__file__))
  File "/opt/tiger/toutiao/lib/sysutil/djangoutil/django_bootstrap.py", line 16, in bootstrap
    import settings
  File "/opt/tiger/app/dora/settings.py", line 7, in <module>
    from django_site.django_settings_use_ciam import *
  File "/opt/tiger/app/dora/django_site/django_settings_use_ciam.py", line 233, in <module>
    from .conf.kms import *
  File "/opt/tiger/app/dora/django_site/conf/kms.py", line 23, in <module>
    kms_config = json.loads(tcc_client.get("kms_config"))
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 93, in get
    value, _ = self._get(key)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 108, in _get
    data = self._get_from_server()
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 162, in _get_from_server
    data_str = self._get_from_bconfig(KEY_DATA_FMT % (self.service_name, self.confspace))
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 179, in _get_from_bconfig
    raise GetBConfigError(key, ex)
bytedtcc.exception.GetBConfigError: Get bconfig error, key: /tcc/v2/data/data.system.dora/default, ex: server: http://[fdbd:dc02:2:173::156]:10935, ex: response: 50 : ByteKV: GRPC DeadlineExceeded error

python 复制代码

Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 723, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1016, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    % (self.host, self.timeout),
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 172, in _get_from_bconfig
    return bconfig.get(key, enable_auth=self.enable_auth)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 178, in get
    return get_client().get(key, enable_auth=enable_auth)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 98, in get
    resp = conn.request("GET", url, headers=headers)
  File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 78, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 99, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    **response_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    **response_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    **response_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 803, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 594, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='fdbd:dc02:19:44:a::203', port=9972): Max retries exceeded with url: /v1/keys/tcc/v2/meta/data.system.dora (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 586, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 135, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 96, in load
    app = import_app(self.app_uri)
  File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 131, in import_app
    __import__(module)
  File "/opt/tiger/app/dora/bootstrap.py", line 7, in <module>
    application, settings = bootstrap(__name__, os.path.dirname(__file__))
  File "/opt/tiger/toutiao/lib/sysutil/djangoutil/django_bootstrap.py", line 16, in bootstrap
    import settings
  File "/opt/tiger/app/dora/settings.py", line 7, in <module>
    from django_site.django_settings_use_ciam import *
  File "/opt/tiger/app/dora/django_site/django_settings_use_ciam.py", line 233, in <module>
    from .conf.kms import *
  File "/opt/tiger/app/dora/django_site/conf/kms.py", line 23, in <module>
    kms_config = json.loads(tcc_client.get("kms_config"))
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 93, in get
    value, _ = self._get(key)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 108, in _get
    data = self._get_from_server()
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 153, in _get_from_server
    meta_str = self._get_from_bconfig(KEY_META_FMT % self.service_name)
  File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 179, in _get_from_bconfig
    raise GetBConfigError(key, ex)
bytedtcc.exception.GetBConfigError: Get bconfig error, key: /tcc/v2/meta/data.system.dora, ex: HTTPConnectionPool(host='fdbd:dc02:19:44:a::203', port=9972): Max retries exceeded with url: /v1/keys/tcc/v2/meta/data.system.dora (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)'))

从异常调用栈看起来跟 Gunicorn 有关，那有必要探索下 Gunicorn

Gunicorn

简述

Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for UNIX. It's a pre-fork worker model ported from Ruby's Unicorn project.

架构

Server Model

lua 复制代码

            +-----------------------+
            |     Master Process    |
            | (Signal-Driven Loop)  |
            +-----------------------+
                       |
                       | 管理
                       ▼
+-------------------------------------------------+
| Workers Pool                                    |
| +-----------+ +-----------+ +-----------+       |
| | Worker 1  | | Worker 2  | | Worker N  | ...   |
| | (Process) | | (Process) | | (Process) |       |
| +-----------+ +-----------+ +-----------+       |
+-------------------------------------------------+

Server Model

Gunicorn is based on the pre-fork worker model. This means that there is a central master process that manages a set of worker processes. The master never knows anything about individual clients. All requests and responses are handled completely by worker processes.

Master

The master process is a simple loop that listens for various process signals and reacts accordingly. It manages the list of running workers by listening for signals like TTIN, TTOU, and CHLD. TTIN and TTOU tell the master to increase or decrease the number of running workers. CHLD indicates that a child process has terminated. In this case, the master process automatically restarts the failed worker.

参考文档：docs.gunicorn.org/en/stable/d...

Signal Handling

信号	Master process	Worker process
QUIT	快速关闭主进程和所有工作进程	快速关闭当前工作进程
INT	快速关闭主进程和所有工作进程（同QUIT）	快速关闭当前工作进程
TERM	优雅关闭：等待工作进程完成当前请求，最长等待时间为graceful_timeout	优雅关闭当前工作进程
HUP	重新加载配置，启动新工作进程，并优雅关闭旧进程 ( 若未启用preload_app，同时加载新版本应用 )。	--
TTIN	增加工作进程数量（+1）	--
TTOU	减少工作进程数量（-1）	--
USR1	重新打开所有日志文件	重新打开当前工作进程的日志文件
USR2	在线升级：启动新主进程并保留旧进程。需手动发送TERM终止旧主进程	--
WINCH	若主进程已守护化（daemonized），优雅关闭所有工作进程	--

Master process

QUIT, INT: Quick shutdown

TERM: Graceful shutdown. Waits for workers to finish their current requests up to the graceful_timeout.

HUP: Reload the configuration, start the new worker processes with a new configuration and gracefully shutdown older workers. If the application is not preloaded (using the preload_app option), Gunicorn will also load the new version of it.

TTIN: Increment the number of processes by one

TTOU: Decrement the number of processes by one

USR1: Reopen the log files

USR2: Upgrade Gunicorn on the fly. A separate TERM signal should be used to kill the old master process. This signal can also be used to use the new versions of pre-loaded applications. See Upgrading to a new binary on the fly for more information.

WINCH: Gracefully shutdown the worker processes when Gunicorn is daemonized.

Worker process

Sending signals directly to the worker processes should not normally be needed. If the master process is running, any exited worker will be automatically respawned.

QUIT, INT: Quick shutdown

TERM: Graceful shutdown

USR1: Reopen the log files

参考文档：docs.gunicorn.org/en/stable/s...

流程

从火焰图学习流程是一个特别快接的办法，以下源码基于最新的 Gunicorn master 分支进行分析。可以看到执行流程如下:

BaseApplication.run

Arbiter.run

Arbiter.manage_workers

Arbiter.spawn_workers

Arbiter.spawn_worker

Worker.init_process

Master Process

Arbiter.run -- 主进程大循环

该方法中，一方面是通过 start 设置信号处理 / 事件监听；另一方面通过 manage_workers 来维持指定数量的 Worker Process -- 创建不足的 Worker Process / 终止多余的 Worker Process、通过 murder_workers 来终止超时的 Worker Process ( 关于超时判断，可以查看下文中的 "配置" -> "timeout" )。

python 复制代码

class Arbiter: 

    def run(self):
        "Main master loop."
        # 设置信号处理 / 事件监听
    self.start()
        util._setproctitle("master [%s]" % self.proc_name)
    
        try:
            self.manage_workers()
    
            while True:
                self.maybe_promote_master()
    
                sig = self.SIG_QUEUE.pop(0) if self.SIG_QUEUE else None
                if sig is None:
                    self.sleep()
                    # 终止超时 Worker Process
                    self.murder_workers()
                    # 创建不足 Worker Process
                    # 终止多余 Worker Process
                    self.manage_workers()
                    continue
    
                if sig not in self.SIG_NAMES:
                    self.log.info("Ignoring unknown signal: %s", sig)
                    continue
    
                signame = self.SIG_NAMES.get(sig)
                handler = getattr(self, "handle_%s" % signame, None)
                if not handler:
                    self.log.error("Unhandled signal: %s", signame)
                    continue
                self.log.info("Handling signal: %s", signame)
                handler()
                self.wakeup()
        except (StopIteration, KeyboardInterrupt):
            self.halt()
        except HaltServer as inst:
            self.halt(reason=inst.reason, exit_status=inst.exit_status)
        except SystemExit:
            raise
        except Exception:
            self.log.error("Unhandled exception in main loop", exc_info=True)
            self.stop(False)
            if self.pidfile is not None:
                self.pidfile.unlink()
            sys.exit(-1)

Arbiter.start -- 配置信号处理和事件监听

该方法核心是配置信号处理和事件监听，信号SIGCHLD ( Indicating that a child process has terminated ) 用于处理子进程退出，从而更新 / 维护 WORKERS 队列 ( 根据这个 WORKERS 队列新增 / 终止 Worker Process )。

python 复制代码

class Arbiter: 

    def start(self):
        """\
    Initialize the arbiter. Start listening and set pidfile if needed.
    """
    self.log.info("Starting gunicorn %s", __version__)
    
        if 'GUNICORN_PID' in os.environ:
            self.master_pid = int(os.environ.get('GUNICORN_PID'))
            self.proc_name = self.proc_name + ".2"
            self.master_name = "Master.2"
    
        self.pid = os.getpid()
        if self.cfg.pidfile is not None:
            pidname = self.cfg.pidfile
            if self.master_pid != 0:
                pidname += ".2"
            self.pidfile = Pidfile(pidname)
            self.pidfile.create(self.pid)
        self.cfg.on_starting(self)
        # 配置信号处理
        self.init_signals()
    
        if not self.LISTENERS:
            fds = None
            listen_fds = systemd.listen_fds()
            if listen_fds:
                self.systemd = True
                fds = range(systemd.SD_LISTEN_FDS_START, systemd.SD_LISTEN_FDS_START + listen_fds)
    
            elif self.master_pid:
                fds = []
                for fd in os.environ.pop('GUNICORN_FD').split(','):
                    fds.append(int(fd))
    
            if not (self.cfg.reuse_port and hasattr(socket, 'SO_REUSEPORT')):
                self.LISTENERS = sock.create_sockets(self.cfg, self.log, fds)
    
        listeners_str = ",".join([str(lnr) for lnr in self.LISTENERS])
        self.log.debug("Arbiter booted")
        self.log.info("Listening at: %s (%s)", listeners_str, self.pid)
        self.log.info("Using worker: %s", self.cfg.worker_class_str)
        systemd.sd_notify("READY=1\nSTATUS=Gunicorn arbiter booted", self.log)
    
        # check worker class requirements
        if hasattr(self.worker_class, "check_config"):
            self.worker_class.check_config(self.cfg, self.log)
    
        self.cfg.when_ready(self)
        

    def init_signals(self):
        """\
    Initialize master signal handling. Most of the signals
    are queued. Child signals only wake up the master.
    """
    # close old PIPE
        for p in self.PIPE:
            os.close(p)
    
        # initialize the pipe
        self.PIPE = pair = os.pipe()
        for p in pair:
            util.set_non_blocking(p)
            util.close_on_exec(p)
    
        self.log.close_on_exec()
    
        # initialize all signals
        for s in self.SIGNALS:
            signal.signal(s, self.signal)
        
        # 监听子进程退出
        signal.signal(signal.SIGCHLD, self.handle_chld)
    
    
    def handle_chld(self, sig, frame):
        "SIGCHLD handling"
    self.reap_workers()
        self.wakeup()
        
    
    def reap_workers(self):
        """\
    Reap workers to avoid zombie processes
    """
    try:
            while True:
                # status 为子进程退出状态
                wpid, status = os.waitpid(-1, os.WNOHANG)
                if not wpid:
                    break
                if self.reexec_pid == wpid:
                    self.reexec_pid = 0
                else:
                    # A worker was terminated. If the termination reason was
                    # that it could not boot, we'll shut it down to avoid
                    # infinite start/stop cycles.
                    exitcode = status >> 8
                    if exitcode != 0:
                        self.log.error('Worker (pid:%s) exited with code %s', wpid, exitcode)
                    # Worker Process 启动失败错误
                    # WORKER_BOOT_ERROR = 3
                    # A flag indicating if a worker failed to
                    # to boot. If a worker process exist with
                    # this error code, the arbiter will terminate.
                    if exitcode == self.WORKER_BOOT_ERROR:
                        reason = "Worker failed to boot."
                        raise HaltServer(reason, self.WORKER_BOOT_ERROR)
                    if exitcode == self.APP_LOAD_ERROR:
                        reason = "App failed to load."
                        raise HaltServer(reason, self.APP_LOAD_ERROR)
    
                    if exitcode > 0:
                        # If the exit code of the worker is greater than 0,
                        # let the user know.
                        self.log.error("Worker (pid:%s) exited with code %s.", wpid, exitcode)
                    elif status > 0:
                        # If the exit code of the worker is 0 and the status
                        # is greater than 0, then it was most likely killed
                        # via a signal.
                        try:
                            sig_name = signal.Signals(status).name
                        except ValueError:
                            sig_name = "code {}".format(status)
                        msg = "Worker (pid:{}) was sent {}!".format(wpid, sig_name)
    
                        # Additional hint for SIGKILL
                        if status == signal.SIGKILL:
                            msg += " Perhaps out of memory?"
                        self.log.error(msg)
    
                    worker = self.WORKERS.pop(wpid, None)
                    if not worker:
                        continue
                    worker.tmp.close()
                    self.cfg.child_exit(self, worker)
        except OSError as e:
            if e.errno != errno.ECHILD:
                raise

Arbiter.manage_workers / Arbiter.murder_workers -- 管理 Worker Process

子进程启动异常会导致主进程退出 ( 日志文件中有 Exception in worker process 输出 )。

python 复制代码

class Arbiter: 

    def manage_workers(self):
        """\
    Maintain the number of workers by spawning or killing
    as required.
    """
    if len(self.WORKERS) < self.num_workers:
            # 创建不足的 Worker Process
            self.spawn_workers()
    
        workers = self.WORKERS.items()
        workers = sorted(workers, key=lambda w: w[1].age)
        while len(workers) > self.num_workers:
            (pid, _) = workers.pop(0)
            # 终止多余的 Worker Process
            self.kill_worker(pid, signal.SIGTERM)
    
        active_worker_count = len(workers)
        if self._last_logged_active_worker_count != active_worker_count:
            self._last_logged_active_worker_count = active_worker_count
            self.log.debug("{0} workers".format(active_worker_count), extra={"metric": "gunicorn.workers", "value": active_worker_count, "mtype": "gauge"})
                                  
    def spawn_workers(self):
        """\
    Spawn new workers as needed.
    
    This is where a worker process leaves the main loop
    of the master process.
    """
    
    for _ in range(self.num_workers - len(self.WORKERS)):
            self.spawn_worker()
            time.sleep(0.1 * random.random())
            
    def spawn_worker(self):
        self.worker_age += 1
        worker = self.worker_class(self.worker_age, self.pid, self.LISTENERS, self.app, self.timeout / 2.0, self.cfg, self.log)
        self.cfg.pre_fork(self, worker)
        
        # 创建子进程
        pid = os.fork()
        
        # 主进程流程
        if pid != 0:
            worker.pid = pid
            self.WORKERS[pid] = worker
            return pid
    
        # Do not inherit the temporary files of other workers
        for sibling in self.WORKERS.values():
            sibling.tmp.close()
    
        # 子进程流程
        worker.pid = os.getpid()
        try:
            util._setproctitle("worker [%s]" % self.proc_name)
            self.log.info("Booting worker with pid: %s", worker.pid)
            if self.cfg.reuse_port:
                worker.sockets = sock.create_sockets(self.cfg, self.log)
            self.cfg.post_fork(self, worker)
            worker.init_process()
            sys.exit(0)
        except SystemExit:
            raise
        except AppImportError as e:
            self.log.debug("Exception while loading the application", exc_info=True)
            print("%s" % e, file=sys.stderr)
            sys.stderr.flush()
            sys.exit(self.APP_LOAD_ERROR)
        except Exception:
            self.log.exception("Exception in worker process")
            if not worker.booted:
                # Worker Process 启动失败错误
                # WORKER_BOOT_ERROR = 3
                # A flag indicating if a worker failed to
                # to boot. If a worker process exist with
                # this error code, the arbiter will terminate.
                sys.exit(self.WORKER_BOOT_ERROR)
            sys.exit(-1)
        finally:
            self.log.info("Worker exiting (pid: %s)", worker.pid)
            try:
                worker.tmp.close()
                self.cfg.worker_exit(self, worker)
            except Exception:
                self.log.warning("Exception during worker exit:\n%s", traceback.format_exc())
                
    def murder_workers(self):
        """\
    Kill unused/idle workers
    """
    if not self.timeout:
            return
        workers = list(self.WORKERS.items())
        for (pid, worker) in workers:
            try:
                if time.monotonic() - worker.tmp.last_update() <= self.timeout:
                    continue
            except (OSError, ValueError):
                continue
    
            if not worker.aborted:
                self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
                worker.aborted = True
                self.kill_worker(pid, signal.SIGABRT)
            else:
                self.kill_worker(pid, signal.SIGKILL)

Worker Process

Worker.init_process -- 初始化子进程

该方法核心：一方面通过 init_signals 设置信号处理；另一方面则是通过 load_wsgi 加载应用。

python 复制代码

class Worker:

    def init_process(self):
        """\
    If you override this method in a subclass, the last statement
    in the function should be to call this method with
    super().init_process() so that the ``run()`` loop is initiated.
    """
    
    # set environment' variables
        if self.cfg.env:
            for k, v in self.cfg.env.items():
                os.environ[k] = v
    
        util.set_owner_process(self.cfg.uid, self.cfg.gid, initgroups=self.cfg.initgroups)
    
        # Reseed the random number generator
        util.seed()
    
        # For waking ourselves up
        self.PIPE = os.pipe()
        for p in self.PIPE:
            util.set_non_blocking(p)
            util.close_on_exec(p)
    
        # Prevent fd inheritance
        for s in self.sockets:
            util.close_on_exec(s)
        util.close_on_exec(self.tmp.fileno())
    
        self.wait_fds = self.sockets + [self.PIPE[0]]
    
        self.log.close_on_exec()
    
        self.init_signals()
    
        # start the reloader
        if self.cfg.reload:
            def changed(fname):
                self.log.info("Worker reloading: %s modified", fname)
                self.alive = False
                os.write(self.PIPE[1], b"1")
                self.cfg.worker_int(self)
                time.sleep(0.1)
                sys.exit(0)
    
            reloader_cls = reloader_engines[self.cfg.reload_engine]
            self.reloader = reloader_cls(extra_files=self.cfg.reload_extra_files, callback=changed)
    
        self.load_wsgi()
        if self.reloader:
            self.reloader.start()
    
        self.cfg.post_worker_init(self)
    
        # Enter main run loop
        self.booted = True
        self.run()

Worker.init_signals -- 配置信号处理

python 复制代码

class Worker:

    def init_signals(self):
        # reset signaling
        for s in self.SIGNALS:
            signal.signal(s, signal.SIG_DFL)
        # init new signaling
        signal.signal(signal.SIGQUIT, self.handle_quit)
        signal.signal(signal.SIGTERM, self.handle_exit)
        signal.signal(signal.SIGINT, self.handle_quit)
        signal.signal(signal.SIGWINCH, self.handle_winch)
        signal.signal(signal.SIGUSR1, self.handle_usr1)
        # 主进程判断子进程 timeout，会向子进程发送 SIGABRT 信号
        signal.signal(signal.SIGABRT, self.handle_abort)
    
        # Don't let SIGTERM and SIGUSR1 disturb active requests
        # by interrupting system calls
        signal.siginterrupt(signal.SIGTERM, False)
        signal.siginterrupt(signal.SIGUSR1, False)
    
        if hasattr(signal, 'set_wakeup_fd'):
            signal.set_wakeup_fd(self.PIPE[1])

AsyncWorker.handle_request -- 处理请求

python 复制代码

class AsyncWorker:

    def handle_request(self, listener_name, req, sock, addr):
        request_start = datetime.now()
        environ = {}
        resp = None
        try:
            self.cfg.pre_request(self, req)
            resp, environ = wsgi.create(req, sock, addr,
                                        listener_name, self.cfg)
            environ["wsgi.multithread"] = True
            self.nr += 1
            # 处理 max_request 配置，Worker Process 处理指定数量请求后会进行自动销毁重建
            if self.nr >= self.max_requests:
                if self.alive:
                    self.log.info("Autorestarting worker after current request.")
                    self.alive = False
    
            if not self.alive or not self.cfg.keepalive:
                resp.force_close()
    
            respiter = self.wsgi(environ, resp.start_response)
            if self.is_already_handled(respiter):
                return False
            try:
                if isinstance(respiter, environ['wsgi.file_wrapper']):
                    resp.write_file(respiter)
                else:
                    for item in respiter:
                        resp.write(item)
                resp.close()
            finally:
                request_time = datetime.now() - request_start
                self.log.access(resp, req, environ, request_time)
                if hasattr(respiter, "close"):
                    respiter.close()
            if resp.should_close():
                raise StopIteration()
        except StopIteration:
            raise
        except OSError:
            # If the original exception was a socket.error we delegate
            # handling it to the caller (where handle() might ignore it)
            util.reraise(*sys.exc_info())
        except Exception:
            if resp and resp.headers_sent:
                # If the requests have already been sent, we should close the
                # connection to indicate the error.
                self.log.exception("Error handling request")
                try:
                    sock.shutdown(socket.SHUT_RDWR)
                    sock.close()
                except OSError:
                    pass
                raise StopIteration()
            raise
        finally:
            try:
                self.cfg.post_request(self, req, environ, resp)
            except Exception:
                self.log.exception("Exception in post_request hook")
        return True

配置

timeout

Worker.init -- 创建同步文件

Worker Process 初始化会创建一个 tmp 文件，Worker Process ( 要求 ) 定期更新这个文件，Master Process 则根据该文件最后更新时间进行 timeout 判断超时从而终止对应 Worker Process。

ini 复制代码

class Worker:

    def __init__(self, age, ppid, sockets, app, timeout, cfg, log):
        """\
This is called pre-fork so it shouldn't do anything to the
current process. If there's a need to make process wide
changes you'll want to do that in ``self.init_process()``.
"""
self.age = age
        self.pid = "[booting]"
        self.ppid = ppid
        self.sockets = sockets
        self.app = app
        self.timeout = timeout
        self.cfg = cfg
        self.booted = False
        self.aborted = False
        self.reloader = None

        self.nr = 0

        if cfg.max_requests > 0:
            jitter = randint(0, cfg.max_requests_jitter)
            self.max_requests = cfg.max_requests + jitter
        else:
            self.max_requests = sys.maxsize

        self.alive = True
        self.log = log
        # TMP 文件
        self.tmp = WorkerTmp(cfg)

  
class WorkerTmp:

    def __init__(self, cfg):
        old_umask = os.umask(cfg.umask)
        fdir = cfg.worker_tmp_dir
        if fdir and not os.path.isdir(fdir):
            raise RuntimeError("%s doesn't exist. Can't create workertmp." % fdir)
        fd, name = tempfile.mkstemp(prefix="wgunicorn-", dir=fdir)
        os.umask(old_umask)

        # change the owner and group of the file if the worker will run as
        # a different user or group, so that the worker can modify the file
        if cfg.uid != os.geteuid() or cfg.gid != os.getegid():
            util.chown(name, cfg.uid, cfg.gid)

        # unlink the file so we don't leak temporary files
        try:
            if not IS_CYGWIN:
                util.unlink(name)
            # In Python 3.8, open() emits RuntimeWarning if buffering=1 for binary mode.
            # Because we never write to this file, pass 0 to switch buffering off.
            self._tmp = os.fdopen(fd, 'w+b', 0)
        except Exception:
            os.close(fd)
            raise

Arbiter.murder_workers -- 杀掉"心跳检测超时"的 Worker Process

python 复制代码

class Arbiter: 

    def murder_workers(self):
        """\
    Kill unused/idle workers
    """
    if not self.timeout:
            return
        workers = list(self.WORKERS.items())
        for (pid, worker) in workers:
            try:
                # 根据文件最后更新时间判断 Worker Process 是否超时
                if time.monotonic() - worker.tmp.last_update() <= self.timeout:
                    continue
            except (OSError, ValueError):
                continue
    
            if not worker.aborted:
                self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
                worker.aborted = True
                # 终止超时 Worker Process
                self.kill_worker(pid, signal.SIGABRT)
            else:
                self.kill_worker(pid, signal.SIGKILL)

Worker.notify / WorkerTmp.notify -- 更新文件更新时间

python 复制代码

class Worker:

    def notify(self):
        """\
Your worker subclass must arrange to have this method called
once every ``self.timeout`` seconds. If you fail in accomplishing
this task, the master process will murder your workers.
"""
self.tmp.notify()
        

class WorkerTmp:

    def notify(self):
        new_time = time.monotonic()
        # 更新 Worker Process 对应文件最后更新时间
        os.utime(self._tmp.fileno(), (new_time, new_time))

分析

从开头的日志两点关键信息：主进程退出状态码为 3、以及子进程由于 TCC 连接异常导致启动失败，结合源码中 Worker.init_process 和 Arbiter.spawn_worker 逻辑，有：子进程启动失败会导致主进程退出。触发子进程启动有两种情况：应用启动过、子进程超时终止而创建启动 ( 配置项 timeout 控制 )、子进程处理请求次数达到指定次数终止而创建启动 ( 配置项 max_request 控制 )。

处理建议：增加重试次数 / 超时时间、增加兜底配置、优化告警提示 ( 已补充 )。

扩展

Worker Type

Worker类型	处理方式	并发模型	适用场景	连接处理特性	依赖/要求	注意事项
Sync	同步	单进程单线程	简单应用/调试	不支持持久连接	无	错误仅影响单个请求；需应用层无阻塞代码
Async	异步协程	Greenlet 协程 ( Eventlet / Gevent )	兼容协程的异步应用	支持持久连接	需安装 Eventlet / Gevent；可能需要 psycogreen 等适配库	需验证应用兼容性；部分库需猴子补丁 ( Monkey-patching )
Gthread	线程池	多线程	I/O 密集型长连接场景	支持 Keep-Alive	无	主循环接受连接，线程池处理请求；适合高并发短任务
Tornado	异步	Tornado 事件循环	Tornado 框架应用	框架原生连接管理	需使用 Tornado 框架	虽支持 WSGI 但不推荐；专为 Tornado 异步应用设计
AsyncIO	异步	asyncio 事件循环	asyncio框架应用 ( 如FastAPI )	依赖框架实现	需第三方扩展 ( 如uvicorn )	需通过 worker_class 指定；适合原生异步框架集成

Sync Workers

The most basic and the default worker type is a synchronous worker class that handles a single request at a time. This model is the simplest to reason about as any errors will affect at most a single request. Though as we describe below only processing a single request at a time requires some assumptions about how applications are programmed.

sync worker does not support persistent connections - each connection is closed after response has been sent (even if you manually add Keep-Alive or Connection: keep-alive header in your application).

Async Workers

The asynchronous workers available are based on Greenlets (via Eventlet and Gevent). Greenlets are an implementation of cooperative multi-threading for Python. In general, an application should be able to make use of these worker classes with no changes.

For full greenlet support applications might need to be adapted. When using, e.g., Gevent and Psycopg it makes sense to ensure psycogreen is installed and setup.

Other applications might not be compatible at all as they, e.g., rely on the original unpatched behavior.

Gthread Workers

The worker gthread is a threaded worker. It accepts connections in the main loop. Accepted connections are added to the thread pool as a connection job. On keepalive connections are put back in the loop waiting for an event. If no event happens after the keepalive timeout, the connection is closed.

Tornado Workers

There's also a Tornado worker class. It can be used to write applications using the Tornado framework. Although the Tornado workers are capable of serving a WSGI application, this is not a recommended configuration.

AsyncIO Workers

Third-party workers can be used to use Gunicorn with asyncio frameworks.

参考文档：docs.gunicorn.org/en/stable/d...