结论
子进程由于配置中心连接异常导致启动失败进而导致主进程退出。
现象
容器实例进程异常退出告警。
日志
检查对应时间段日志有如下异常:
arduino
Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 172, in _get_from_bconfig
return bconfig.get(key, enable_auth=self.enable_auth)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 178, in get
return get_client().get(key, enable_auth=enable_auth)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 101, in get
return self._decode_response(resp, key)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 120, in _decode_response
raise ResponseError(self.addr, "response: %s : %s" % (error_code, message))
bytedtcc.v2.bconfig.exception.ResponseError: server: http://[fdbd:dc02:2:173::156]:10935, ex: response: 50 : ByteKV: GRPC DeadlineExceeded error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 586, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 135, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 96, in load
app = import_app(self.app_uri)
File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 131, in import_app
__import__(module)
File "/opt/tiger/app/dora/bootstrap.py", line 7, in <module>
application, settings = bootstrap(__name__, os.path.dirname(__file__))
File "/opt/tiger/toutiao/lib/sysutil/djangoutil/django_bootstrap.py", line 16, in bootstrap
import settings
File "/opt/tiger/app/dora/settings.py", line 7, in <module>
from django_site.django_settings_use_ciam import *
File "/opt/tiger/app/dora/django_site/django_settings_use_ciam.py", line 233, in <module>
from .conf.kms import *
File "/opt/tiger/app/dora/django_site/conf/kms.py", line 23, in <module>
kms_config = json.loads(tcc_client.get("kms_config"))
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 93, in get
value, _ = self._get(key)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 108, in _get
data = self._get_from_server()
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 162, in _get_from_server
data_str = self._get_from_bconfig(KEY_DATA_FMT % (self.service_name, self.confspace))
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 179, in _get_from_bconfig
raise GetBConfigError(key, ex)
bytedtcc.exception.GetBConfigError: Get bconfig error, key: /tcc/v2/data/data.system.dora/default, ex: server: http://[fdbd:dc02:2:173::156]:10935, ex: response: 50 : ByteKV: GRPC DeadlineExceeded error
python
Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 723, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 244, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/local/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1016, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 956, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
% (self.host, self.timeout),
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 172, in _get_from_bconfig
return bconfig.get(key, enable_auth=self.enable_auth)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 178, in get
return get_client().get(key, enable_auth=enable_auth)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/bconfig/client.py", line 98, in get
resp = conn.request("GET", url, headers=headers)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 78, in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 99, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
**response_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
**response_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in urlopen
**response_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 803, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 594, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='fdbd:dc02:19:44:a::203', port=9972): Max retries exceeded with url: /v1/keys/tcc/v2/meta/data.system.dora (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 586, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 135, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 96, in load
app = import_app(self.app_uri)
File "/usr/local/lib/python3.7/site-packages/bytedunicorn/app.py", line 131, in import_app
__import__(module)
File "/opt/tiger/app/dora/bootstrap.py", line 7, in <module>
application, settings = bootstrap(__name__, os.path.dirname(__file__))
File "/opt/tiger/toutiao/lib/sysutil/djangoutil/django_bootstrap.py", line 16, in bootstrap
import settings
File "/opt/tiger/app/dora/settings.py", line 7, in <module>
from django_site.django_settings_use_ciam import *
File "/opt/tiger/app/dora/django_site/django_settings_use_ciam.py", line 233, in <module>
from .conf.kms import *
File "/opt/tiger/app/dora/django_site/conf/kms.py", line 23, in <module>
kms_config = json.loads(tcc_client.get("kms_config"))
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 93, in get
value, _ = self._get(key)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 108, in _get
data = self._get_from_server()
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 153, in _get_from_server
meta_str = self._get_from_bconfig(KEY_META_FMT % self.service_name)
File "/usr/local/lib/python3.7/site-packages/bytedtcc/v2/client.py", line 179, in _get_from_bconfig
raise GetBConfigError(key, ex)
bytedtcc.exception.GetBConfigError: Get bconfig error, key: /tcc/v2/meta/data.system.dora, ex: HTTPConnectionPool(host='fdbd:dc02:19:44:a::203', port=9972): Max retries exceeded with url: /v1/keys/tcc/v2/meta/data.system.dora (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb38649dc18>, 'Connection to fdbd:dc02:19:44:a::203 timed out. (connect timeout=0.2)'))
从异常调用栈看起来跟 Gunicorn 有关,那有必要探索下 Gunicorn
Gunicorn
简述
Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for UNIX. It's a pre-fork worker model ported from Ruby's Unicorn project.
架构
Server Model
lua
+-----------------------+
| Master Process |
| (Signal-Driven Loop) |
+-----------------------+
|
| 管理
▼
+-------------------------------------------------+
| Workers Pool |
| +-----------+ +-----------+ +-----------+ |
| | Worker 1 | | Worker 2 | | Worker N | ... |
| | (Process) | | (Process) | | (Process) | |
| +-----------+ +-----------+ +-----------+ |
+-------------------------------------------------+
Server Model
Gunicorn is based on the pre-fork worker model. This means that there is a central master process that manages a set of worker processes. The master never knows anything about individual clients. All requests and responses are handled completely by worker processes.
Master
The master process is a simple loop that listens for various process signals and reacts accordingly. It manages the list of running workers by listening for signals like TTIN, TTOU, and CHLD. TTIN and TTOU tell the master to increase or decrease the number of running workers. CHLD indicates that a child process has terminated. In this case, the master process automatically restarts the failed worker.
Signal Handling
信号 | Master process | Worker process |
---|---|---|
QUIT | 快速关闭主进程和所有工作进程 | 快速关闭当前工作进程 |
INT | 快速关闭主进程和所有工作进程(同QUIT) | 快速关闭当前工作进程 |
TERM | 优雅关闭:等待工作进程完成当前请求,最长等待时间为graceful_timeout | 优雅关闭当前工作进程 |
HUP | 重新加载配置,启动新工作进程,并优雅关闭旧进程 ( 若未启用preload_app,同时加载新版本应用 )。 | -- |
TTIN | 增加工作进程数量(+1) | -- |
TTOU | 减少工作进程数量(-1) | -- |
USR1 | 重新打开所有日志文件 | 重新打开当前工作进程的日志文件 |
USR2 | 在线升级:启动新主进程并保留旧进程。需手动发送TERM终止旧主进程 | -- |
WINCH | 若主进程已守护化(daemonized),优雅关闭所有工作进程 | -- |
Master process
QUIT
,INT
: Quick shutdownTERM
: Graceful shutdown. Waits for workers to finish their current requests up to the graceful_timeout.HUP
: Reload the configuration, start the new worker processes with a new configuration and gracefully shutdown older workers. If the application is not preloaded (using the preload_app option), Gunicorn will also load the new version of it.TTIN
: Increment the number of processes by oneTTOU
: Decrement the number of processes by oneUSR1
: Reopen the log filesUSR2
: Upgrade Gunicorn on the fly. A separateTERM
signal should be used to kill the old master process. This signal can also be used to use the new versions of pre-loaded applications. See Upgrading to a new binary on the fly for more information.WINCH
: Gracefully shutdown the worker processes when Gunicorn is daemonized.Worker process
Sending signals directly to the worker processes should not normally be needed. If the master process is running, any exited worker will be automatically respawned.
QUIT
,INT
: Quick shutdown
TERM
: Graceful shutdown
USR1
: Reopen the log files
流程

从火焰图学习流程是一个特别快接的办法,以下源码基于最新的 Gunicorn master 分支进行分析。可以看到执行流程如下:
BaseApplication.run
Arbiter.run
Arbiter.manage_workers
Arbiter.spawn_workers
Arbiter.spawn_worker
Worker.init_process
Master Process
Arbiter.run -- 主进程大循环
该方法中,一方面是通过 start
设置信号处理 / 事件监听;另一方面通过 manage_workers
来维持指定数量的 Worker Process -- 创建不足的 Worker Process / 终止多余的 Worker Process、通过 murder_workers
来终止超时的 Worker Process ( 关于超时判断,可以查看下文中的 "配置" -> "timeout" )。
python
class Arbiter:
def run(self):
"Main master loop."
# 设置信号处理 / 事件监听
self.start()
util._setproctitle("master [%s]" % self.proc_name)
try:
self.manage_workers()
while True:
self.maybe_promote_master()
sig = self.SIG_QUEUE.pop(0) if self.SIG_QUEUE else None
if sig is None:
self.sleep()
# 终止超时 Worker Process
self.murder_workers()
# 创建不足 Worker Process
# 终止多余 Worker Process
self.manage_workers()
continue
if sig not in self.SIG_NAMES:
self.log.info("Ignoring unknown signal: %s", sig)
continue
signame = self.SIG_NAMES.get(sig)
handler = getattr(self, "handle_%s" % signame, None)
if not handler:
self.log.error("Unhandled signal: %s", signame)
continue
self.log.info("Handling signal: %s", signame)
handler()
self.wakeup()
except (StopIteration, KeyboardInterrupt):
self.halt()
except HaltServer as inst:
self.halt(reason=inst.reason, exit_status=inst.exit_status)
except SystemExit:
raise
except Exception:
self.log.error("Unhandled exception in main loop", exc_info=True)
self.stop(False)
if self.pidfile is not None:
self.pidfile.unlink()
sys.exit(-1)
Arbiter.start -- 配置信号处理和事件监听
该方法核心是配置信号处理和事件监听,信号SIGCHLD ( Indicating that a child process has terminated ) 用于处理子进程退出,从而更新 / 维护 WORKERS 队列 ( 根据这个 WORKERS 队列新增 / 终止 Worker Process )。
python
class Arbiter:
def start(self):
"""\
Initialize the arbiter. Start listening and set pidfile if needed.
"""
self.log.info("Starting gunicorn %s", __version__)
if 'GUNICORN_PID' in os.environ:
self.master_pid = int(os.environ.get('GUNICORN_PID'))
self.proc_name = self.proc_name + ".2"
self.master_name = "Master.2"
self.pid = os.getpid()
if self.cfg.pidfile is not None:
pidname = self.cfg.pidfile
if self.master_pid != 0:
pidname += ".2"
self.pidfile = Pidfile(pidname)
self.pidfile.create(self.pid)
self.cfg.on_starting(self)
# 配置信号处理
self.init_signals()
if not self.LISTENERS:
fds = None
listen_fds = systemd.listen_fds()
if listen_fds:
self.systemd = True
fds = range(systemd.SD_LISTEN_FDS_START, systemd.SD_LISTEN_FDS_START + listen_fds)
elif self.master_pid:
fds = []
for fd in os.environ.pop('GUNICORN_FD').split(','):
fds.append(int(fd))
if not (self.cfg.reuse_port and hasattr(socket, 'SO_REUSEPORT')):
self.LISTENERS = sock.create_sockets(self.cfg, self.log, fds)
listeners_str = ",".join([str(lnr) for lnr in self.LISTENERS])
self.log.debug("Arbiter booted")
self.log.info("Listening at: %s (%s)", listeners_str, self.pid)
self.log.info("Using worker: %s", self.cfg.worker_class_str)
systemd.sd_notify("READY=1\nSTATUS=Gunicorn arbiter booted", self.log)
# check worker class requirements
if hasattr(self.worker_class, "check_config"):
self.worker_class.check_config(self.cfg, self.log)
self.cfg.when_ready(self)
def init_signals(self):
"""\
Initialize master signal handling. Most of the signals
are queued. Child signals only wake up the master.
"""
# close old PIPE
for p in self.PIPE:
os.close(p)
# initialize the pipe
self.PIPE = pair = os.pipe()
for p in pair:
util.set_non_blocking(p)
util.close_on_exec(p)
self.log.close_on_exec()
# initialize all signals
for s in self.SIGNALS:
signal.signal(s, self.signal)
# 监听子进程退出
signal.signal(signal.SIGCHLD, self.handle_chld)
def handle_chld(self, sig, frame):
"SIGCHLD handling"
self.reap_workers()
self.wakeup()
def reap_workers(self):
"""\
Reap workers to avoid zombie processes
"""
try:
while True:
# status 为子进程退出状态
wpid, status = os.waitpid(-1, os.WNOHANG)
if not wpid:
break
if self.reexec_pid == wpid:
self.reexec_pid = 0
else:
# A worker was terminated. If the termination reason was
# that it could not boot, we'll shut it down to avoid
# infinite start/stop cycles.
exitcode = status >> 8
if exitcode != 0:
self.log.error('Worker (pid:%s) exited with code %s', wpid, exitcode)
# Worker Process 启动失败错误
# WORKER_BOOT_ERROR = 3
# A flag indicating if a worker failed to
# to boot. If a worker process exist with
# this error code, the arbiter will terminate.
if exitcode == self.WORKER_BOOT_ERROR:
reason = "Worker failed to boot."
raise HaltServer(reason, self.WORKER_BOOT_ERROR)
if exitcode == self.APP_LOAD_ERROR:
reason = "App failed to load."
raise HaltServer(reason, self.APP_LOAD_ERROR)
if exitcode > 0:
# If the exit code of the worker is greater than 0,
# let the user know.
self.log.error("Worker (pid:%s) exited with code %s.", wpid, exitcode)
elif status > 0:
# If the exit code of the worker is 0 and the status
# is greater than 0, then it was most likely killed
# via a signal.
try:
sig_name = signal.Signals(status).name
except ValueError:
sig_name = "code {}".format(status)
msg = "Worker (pid:{}) was sent {}!".format(wpid, sig_name)
# Additional hint for SIGKILL
if status == signal.SIGKILL:
msg += " Perhaps out of memory?"
self.log.error(msg)
worker = self.WORKERS.pop(wpid, None)
if not worker:
continue
worker.tmp.close()
self.cfg.child_exit(self, worker)
except OSError as e:
if e.errno != errno.ECHILD:
raise
Arbiter.manage_workers / Arbiter.murder_workers -- 管理 Worker Process
子进程启动异常会导致主进程退出 ( 日志文件中有 Exception in worker process
输出 )。
python
class Arbiter:
def manage_workers(self):
"""\
Maintain the number of workers by spawning or killing
as required.
"""
if len(self.WORKERS) < self.num_workers:
# 创建不足的 Worker Process
self.spawn_workers()
workers = self.WORKERS.items()
workers = sorted(workers, key=lambda w: w[1].age)
while len(workers) > self.num_workers:
(pid, _) = workers.pop(0)
# 终止多余的 Worker Process
self.kill_worker(pid, signal.SIGTERM)
active_worker_count = len(workers)
if self._last_logged_active_worker_count != active_worker_count:
self._last_logged_active_worker_count = active_worker_count
self.log.debug("{0} workers".format(active_worker_count), extra={"metric": "gunicorn.workers", "value": active_worker_count, "mtype": "gauge"})
def spawn_workers(self):
"""\
Spawn new workers as needed.
This is where a worker process leaves the main loop
of the master process.
"""
for _ in range(self.num_workers - len(self.WORKERS)):
self.spawn_worker()
time.sleep(0.1 * random.random())
def spawn_worker(self):
self.worker_age += 1
worker = self.worker_class(self.worker_age, self.pid, self.LISTENERS, self.app, self.timeout / 2.0, self.cfg, self.log)
self.cfg.pre_fork(self, worker)
# 创建子进程
pid = os.fork()
# 主进程流程
if pid != 0:
worker.pid = pid
self.WORKERS[pid] = worker
return pid
# Do not inherit the temporary files of other workers
for sibling in self.WORKERS.values():
sibling.tmp.close()
# 子进程流程
worker.pid = os.getpid()
try:
util._setproctitle("worker [%s]" % self.proc_name)
self.log.info("Booting worker with pid: %s", worker.pid)
if self.cfg.reuse_port:
worker.sockets = sock.create_sockets(self.cfg, self.log)
self.cfg.post_fork(self, worker)
worker.init_process()
sys.exit(0)
except SystemExit:
raise
except AppImportError as e:
self.log.debug("Exception while loading the application", exc_info=True)
print("%s" % e, file=sys.stderr)
sys.stderr.flush()
sys.exit(self.APP_LOAD_ERROR)
except Exception:
self.log.exception("Exception in worker process")
if not worker.booted:
# Worker Process 启动失败错误
# WORKER_BOOT_ERROR = 3
# A flag indicating if a worker failed to
# to boot. If a worker process exist with
# this error code, the arbiter will terminate.
sys.exit(self.WORKER_BOOT_ERROR)
sys.exit(-1)
finally:
self.log.info("Worker exiting (pid: %s)", worker.pid)
try:
worker.tmp.close()
self.cfg.worker_exit(self, worker)
except Exception:
self.log.warning("Exception during worker exit:\n%s", traceback.format_exc())
def murder_workers(self):
"""\
Kill unused/idle workers
"""
if not self.timeout:
return
workers = list(self.WORKERS.items())
for (pid, worker) in workers:
try:
if time.monotonic() - worker.tmp.last_update() <= self.timeout:
continue
except (OSError, ValueError):
continue
if not worker.aborted:
self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
worker.aborted = True
self.kill_worker(pid, signal.SIGABRT)
else:
self.kill_worker(pid, signal.SIGKILL)
Worker Process
Worker.init_process -- 初始化子进程
该方法核心:一方面通过 init_signals
设置信号处理;另一方面则是通过 load_wsgi
加载应用。
python
class Worker:
def init_process(self):
"""\
If you override this method in a subclass, the last statement
in the function should be to call this method with
super().init_process() so that the ``run()`` loop is initiated.
"""
# set environment' variables
if self.cfg.env:
for k, v in self.cfg.env.items():
os.environ[k] = v
util.set_owner_process(self.cfg.uid, self.cfg.gid, initgroups=self.cfg.initgroups)
# Reseed the random number generator
util.seed()
# For waking ourselves up
self.PIPE = os.pipe()
for p in self.PIPE:
util.set_non_blocking(p)
util.close_on_exec(p)
# Prevent fd inheritance
for s in self.sockets:
util.close_on_exec(s)
util.close_on_exec(self.tmp.fileno())
self.wait_fds = self.sockets + [self.PIPE[0]]
self.log.close_on_exec()
self.init_signals()
# start the reloader
if self.cfg.reload:
def changed(fname):
self.log.info("Worker reloading: %s modified", fname)
self.alive = False
os.write(self.PIPE[1], b"1")
self.cfg.worker_int(self)
time.sleep(0.1)
sys.exit(0)
reloader_cls = reloader_engines[self.cfg.reload_engine]
self.reloader = reloader_cls(extra_files=self.cfg.reload_extra_files, callback=changed)
self.load_wsgi()
if self.reloader:
self.reloader.start()
self.cfg.post_worker_init(self)
# Enter main run loop
self.booted = True
self.run()
Worker.init_signals -- 配置信号处理
python
class Worker:
def init_signals(self):
# reset signaling
for s in self.SIGNALS:
signal.signal(s, signal.SIG_DFL)
# init new signaling
signal.signal(signal.SIGQUIT, self.handle_quit)
signal.signal(signal.SIGTERM, self.handle_exit)
signal.signal(signal.SIGINT, self.handle_quit)
signal.signal(signal.SIGWINCH, self.handle_winch)
signal.signal(signal.SIGUSR1, self.handle_usr1)
# 主进程判断子进程 timeout,会向子进程发送 SIGABRT 信号
signal.signal(signal.SIGABRT, self.handle_abort)
# Don't let SIGTERM and SIGUSR1 disturb active requests
# by interrupting system calls
signal.siginterrupt(signal.SIGTERM, False)
signal.siginterrupt(signal.SIGUSR1, False)
if hasattr(signal, 'set_wakeup_fd'):
signal.set_wakeup_fd(self.PIPE[1])
AsyncWorker.handle_request -- 处理请求
python
class AsyncWorker:
def handle_request(self, listener_name, req, sock, addr):
request_start = datetime.now()
environ = {}
resp = None
try:
self.cfg.pre_request(self, req)
resp, environ = wsgi.create(req, sock, addr,
listener_name, self.cfg)
environ["wsgi.multithread"] = True
self.nr += 1
# 处理 max_request 配置,Worker Process 处理指定数量请求后会进行自动销毁重建
if self.nr >= self.max_requests:
if self.alive:
self.log.info("Autorestarting worker after current request.")
self.alive = False
if not self.alive or not self.cfg.keepalive:
resp.force_close()
respiter = self.wsgi(environ, resp.start_response)
if self.is_already_handled(respiter):
return False
try:
if isinstance(respiter, environ['wsgi.file_wrapper']):
resp.write_file(respiter)
else:
for item in respiter:
resp.write(item)
resp.close()
finally:
request_time = datetime.now() - request_start
self.log.access(resp, req, environ, request_time)
if hasattr(respiter, "close"):
respiter.close()
if resp.should_close():
raise StopIteration()
except StopIteration:
raise
except OSError:
# If the original exception was a socket.error we delegate
# handling it to the caller (where handle() might ignore it)
util.reraise(*sys.exc_info())
except Exception:
if resp and resp.headers_sent:
# If the requests have already been sent, we should close the
# connection to indicate the error.
self.log.exception("Error handling request")
try:
sock.shutdown(socket.SHUT_RDWR)
sock.close()
except OSError:
pass
raise StopIteration()
raise
finally:
try:
self.cfg.post_request(self, req, environ, resp)
except Exception:
self.log.exception("Exception in post_request hook")
return True
配置
timeout
Worker.init -- 创建同步文件
Worker Process 初始化会创建一个 tmp 文件,Worker Process ( 要求 ) 定期更新这个文件,Master Process 则根据该文件最后更新时间进行 timeout 判断超时从而终止对应 Worker Process。
ini
class Worker:
def __init__(self, age, ppid, sockets, app, timeout, cfg, log):
"""\
This is called pre-fork so it shouldn't do anything to the
current process. If there's a need to make process wide
changes you'll want to do that in ``self.init_process()``.
"""
self.age = age
self.pid = "[booting]"
self.ppid = ppid
self.sockets = sockets
self.app = app
self.timeout = timeout
self.cfg = cfg
self.booted = False
self.aborted = False
self.reloader = None
self.nr = 0
if cfg.max_requests > 0:
jitter = randint(0, cfg.max_requests_jitter)
self.max_requests = cfg.max_requests + jitter
else:
self.max_requests = sys.maxsize
self.alive = True
self.log = log
# TMP 文件
self.tmp = WorkerTmp(cfg)
class WorkerTmp:
def __init__(self, cfg):
old_umask = os.umask(cfg.umask)
fdir = cfg.worker_tmp_dir
if fdir and not os.path.isdir(fdir):
raise RuntimeError("%s doesn't exist. Can't create workertmp." % fdir)
fd, name = tempfile.mkstemp(prefix="wgunicorn-", dir=fdir)
os.umask(old_umask)
# change the owner and group of the file if the worker will run as
# a different user or group, so that the worker can modify the file
if cfg.uid != os.geteuid() or cfg.gid != os.getegid():
util.chown(name, cfg.uid, cfg.gid)
# unlink the file so we don't leak temporary files
try:
if not IS_CYGWIN:
util.unlink(name)
# In Python 3.8, open() emits RuntimeWarning if buffering=1 for binary mode.
# Because we never write to this file, pass 0 to switch buffering off.
self._tmp = os.fdopen(fd, 'w+b', 0)
except Exception:
os.close(fd)
raise
Arbiter.murder_workers -- 杀掉"心跳检测超时"的 Worker Process
python
class Arbiter:
def murder_workers(self):
"""\
Kill unused/idle workers
"""
if not self.timeout:
return
workers = list(self.WORKERS.items())
for (pid, worker) in workers:
try:
# 根据文件最后更新时间判断 Worker Process 是否超时
if time.monotonic() - worker.tmp.last_update() <= self.timeout:
continue
except (OSError, ValueError):
continue
if not worker.aborted:
self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
worker.aborted = True
# 终止超时 Worker Process
self.kill_worker(pid, signal.SIGABRT)
else:
self.kill_worker(pid, signal.SIGKILL)
Worker.notify / WorkerTmp.notify -- 更新文件更新时间
python
class Worker:
def notify(self):
"""\
Your worker subclass must arrange to have this method called
once every ``self.timeout`` seconds. If you fail in accomplishing
this task, the master process will murder your workers.
"""
self.tmp.notify()
class WorkerTmp:
def notify(self):
new_time = time.monotonic()
# 更新 Worker Process 对应文件最后更新时间
os.utime(self._tmp.fileno(), (new_time, new_time))
相关日志
WORKER TIMEOUT
max_request
AsyncWorker.handle_request
python
class AsyncWorker:
def handle_request(self, listener_name, req, sock, addr):
request_start = datetime.now()
environ = {}
resp = None
try:
self.cfg.pre_request(self, req)
resp, environ = wsgi.create(req, sock, addr,
listener_name, self.cfg)
environ["wsgi.multithread"] = True
self.nr += 1
# 对应 max_request 配置,Worker Process 处理指定数量请求后会进行自动销毁重建
if self.nr >= self.max_requests:
if self.alive:
self.log.info("Autorestarting worker after current request.")
self.alive = False
if not self.alive or not self.cfg.keepalive:
resp.force_close()
respiter = self.wsgi(environ, resp.start_response)
if self.is_already_handled(respiter):
return False
try:
if isinstance(respiter, environ['wsgi.file_wrapper']):
resp.write_file(respiter)
else:
for item in respiter:
resp.write(item)
resp.close()
finally:
request_time = datetime.now() - request_start
self.log.access(resp, req, environ, request_time)
if hasattr(respiter, "close"):
respiter.close()
if resp.should_close():
raise StopIteration()
except StopIteration:
raise
except OSError:
# If the original exception was a socket.error we delegate
# handling it to the caller (where handle() might ignore it)
util.reraise(*sys.exc_info())
except Exception:
if resp and resp.headers_sent:
# If the requests have already been sent, we should close the
# connection to indicate the error.
self.log.exception("Error handling request")
try:
sock.shutdown(socket.SHUT_RDWR)
sock.close()
except OSError:
pass
raise StopIteration()
raise
finally:
try:
self.cfg.post_request(self, req, environ, resp)
except Exception:
self.log.exception("Exception in post_request hook")
return True
相关日志
Autorestarting worker after current request
分析
从开头的日志两点关键信息:主进程退出状态码为 3、以及子进程由于 TCC 连接异常导致启动失败,结合源码中 Worker.init_process
和 Arbiter.spawn_worker
逻辑,有:子进程启动失败会导致主进程退出。触发子进程启动有两种情况:应用启动过、子进程超时终止而创建启动 ( 配置项 timeout 控制 )、子进程处理请求次数达到指定次数终止而创建启动 ( 配置项 max_request 控制 )。
处理建议:增加重试次数 / 超时时间、增加兜底配置、优化告警提示 ( 已补充 )。
扩展
Worker Type
Worker类型 | 处理方式 | 并发模型 | 适用场景 | 连接处理特性 | 依赖/要求 | 注意事项 |
---|---|---|---|---|---|---|
Sync | 同步 | 单进程单线程 | 简单应用/调试 | 不支持持久连接 | 无 | 错误仅影响单个请求;需应用层无阻塞代码 |
Async | 异步协程 | Greenlet 协程 ( Eventlet / Gevent ) | 兼容协程的异步应用 | 支持持久连接 | 需安装 Eventlet / Gevent;可能需要 psycogreen 等适配库 | 需验证应用兼容性;部分库需猴子补丁 ( Monkey-patching ) |
Gthread | 线程池 | 多线程 | I/O 密集型长连接场景 | 支持 Keep-Alive | 无 | 主循环接受连接,线程池处理请求;适合高并发短任务 |
Tornado | 异步 | Tornado 事件循环 | Tornado 框架应用 | 框架原生连接管理 | 需使用 Tornado 框架 | 虽支持 WSGI 但不推荐;专为 Tornado 异步应用设计 |
AsyncIO | 异步 | asyncio 事件循环 | asyncio框架应用 ( 如FastAPI ) | 依赖框架实现 | 需第三方扩展 ( 如uvicorn ) | 需通过 worker_class 指定;适合原生异步框架集成 |
Sync Workers
The most basic and the default worker type is a synchronous worker class that handles a single request at a time. This model is the simplest to reason about as any errors will affect at most a single request. Though as we describe below only processing a single request at a time requires some assumptions about how applications are programmed.
sync
worker does not support persistent connections - each connection is closed after response has been sent (even if you manually addKeep-Alive
orConnection: keep-alive
header in your application).Async Workers
The asynchronous workers available are based on Greenlets (via Eventlet and Gevent). Greenlets are an implementation of cooperative multi-threading for Python. In general, an application should be able to make use of these worker classes with no changes.
For full greenlet support applications might need to be adapted. When using, e.g., Gevent and Psycopg it makes sense to ensure psycogreen is installed and setup.
Other applications might not be compatible at all as they, e.g., rely on the original unpatched behavior.
Gthread Workers
The worker gthread is a threaded worker. It accepts connections in the main loop. Accepted connections are added to the thread pool as a connection job. On keepalive connections are put back in the loop waiting for an event. If no event happens after the keepalive timeout, the connection is closed.
Tornado Workers
There's also a Tornado worker class. It can be used to write applications using the Tornado framework. Although the Tornado workers are capable of serving a WSGI application, this is not a recommended configuration.
AsyncIO Workers
Third-party workers can be used to use Gunicorn with asyncio frameworks.