Spark的RPC通信二-初稿
Spark RPC的传输层
传输层主要还是借助netty框架进行实现。
TransportContext
包含创建 TransportServer
、TransportClientFactory
和使用 TransportChannelHandler
设置 Netty Channel 管道的上下文。TransportClient
提供两种通信协议:control-plane RPCs 和data-plane的 "chunk fetching"。RPC 的处理在 TransportContext
的范围之外进行(即由用户提供的处理程序执行),它负责设置流,这些流可以使用零拷贝 IO 以块为单位通过数据平面进行流式传输。对消息的处理由RpcHandler
处理。TransportServer
和 TransportClientFactory
都会为每个通道创建一个 TransportChannelHandler
。由于每个 TransportChannelHandler
都包含一个 TransportClient
,因此服务器进程可以通过现有通道向客户端发送消息。
TransportContext -TransportConf conf -RpcHandler rpcHandler -boolean closeIdleConnections -boolean isClientOnly -MessageEncoder ENCODER -MessageEncoder DECODER -EventLoopGroup chunkFetchWorkers +TransportClientFactory createClientFactory() +TransportServer createServer() +TransportChannelHandler createChannelHandler() +TransportChannelHandler initializePipeline() ClientPool TransportClient[] clients Object[] lock TransportClientFactory -TransportContext context -TransportConf conf -List clientBootstraps -ConcurrentHashMap connectionPool -int numConnectionsPerPeer -final Class socketChannelClass -EventLoopGroup workerGroup -PooledByteBufAllocator pooledAllocator -NettyMemoryMetrics metrics <<interface>> TransportClientBootstrap void doBootstrap(TransportClient client, Channel channel) <<interface>> TransportServerBootstrap RpcHandler doBootstrap(Channel channel, RpcHandler rpcHandler) <<abstract>> MessageHandler abstract void handle(T message) abstract void channelActive() abstract void exceptionCaught(Throwable cause) abstract void channelInactive() <<interface>> Message <<abstract>> RpcHandler TransportRequestHandler RpcHandler rpcHandler StreamManager streamManager TransportResponseHandler TransportClient TransportServer TransportChannelHandler
传输上下文TransportContext
TransportContext
的核心成员与核心方法
TransportConf conf
:传输的配置信息RpcHandler rpcHandler
:对接收的RPC消息进行处理EventLoopGroup chunkFetchWorkers
:处理 ChunkFetchRequest 的独立线程池。这有助于控制通过底层通道将 ChunkFetchRequest 信息写回客户端时阻塞的 TransportServer 工作线程的最大数量。createClientFactory()
:初始化 ClientFactory,在返回新客户端之前运行给定的 TransportClientBootstraps。Bootstraps 将同步执行,并且必须成功运行才能创建客户端。createServer()
:创建传输服务端TransportServer
的实例initializePipeline()
:对TransportClient
,TransportRequestHandler
,TransportResponseHandler
进行初始化,然后在用其构造TransportChannelHandler
对象。借助Netty的API对管道进行配置。
TransportContext
的createClientFactory
方法创建传输客户端工厂TransportClientFactory
的实例。在构造TransportClientFactory
的实例时,还会传递客户端引导程序TransportClientBootstrap
的列表。TransportClientFactory
内部维护每个Socket地址的连接池。通过调用TransportContext
的createServer
方法创建传输服务端TransportServer
的实例。
核心类TransportClientFactory
用于使用 createClient
方法 创建 TransportClients
的工厂。该工厂负责维护与其他主机的连接池,并为同一远程主机返回相同的 TransportClient
。它还为所有 TransportClients
共享一个工作线程池。只要有可能,就会重复使用 TransportClients
。在完成创建新的 TransportClient
之前,将运行所有给定的 TransportClientBootstraps
。
TransportClientFactory
的核心成员和核心方法
-
静态内部类
ClientPool
:一种简单的数据结构,用于跟踪两个对等节点之间的客户端连接池,保障其可以复用,由于线程不安全,所以增加了客户端对应的锁。javaprivate static class ClientPool { TransportClient[] clients; Object[] locks; ClientPool(int size) { clients = new TransportClient[size]; locks = new Object[size]; for (int i = 0; i < size; i++) { locks[i] = new Object(); } } }
-
TransportContext context
:TransportContext 的实例对象 -
TransportConf conf
:链接配置信息的实例对象 -
List<TransportClientBootstrap> clientBootstraps
:客户端的引导程序,主要是客户端在建立连接的时候,进行一些初始化的准备操作。 -
ConcurrentHashMap<SocketAddress, ClientPool> connectionPool
:维护了连接地址上的客户端连接池的映射表。 -
createClient(String remoteHost, int remotePort)
:- 首先根据远程地址,确认客户端连接池connectionPool中是否存在关于这个地址的客户端池clientPool,如果没有就新建一个客户端池放入连接池中。
- 检查通道是否超时和客户端是否存活,如果客户端失活,则需要重建一个客户端。创建客户端的在
createClient(InetSocketAddress address)
方法中。
javapublic TransportClient createClient(String remoteHost, int remotePort) throws IOException, InterruptedException { // 此处使用未解析地址,以避免每次创建客户端时都进行 DNS 解析。 final InetSocketAddress unresolvedAddress = InetSocketAddress.createUnresolved(remoteHost, remotePort); // 如果clientPool不存在,则新建. ClientPool clientPool = connectionPool.get(unresolvedAddress); if (clientPool == null) { connectionPool.putIfAbsent(unresolvedAddress, new ClientPool(numConnectionsPerPeer)); clientPool = connectionPool.get(unresolvedAddress); } int clientIndex = rand.nextInt(numConnectionsPerPeer); TransportClient cachedClient = clientPool.clients[clientIndex]; if (cachedClient != null && cachedClient.isActive()) { // 更新处理程序的最后使用时间,确保通道不会超时 TransportChannelHandler handler = cachedClient.getChannel().pipeline() .get(TransportChannelHandler.class); synchronized (handler) { handler.getResponseHandler().updateTimeOfLastRequest(); } // 然后检查客户端是否还活着,以防在代码更新之前超时。 if (cachedClient.isActive()) { logger.trace("Returning cached connection to {}: {}", cachedClient.getSocketAddress(), cachedClient); return cachedClient; } } // 如果我们到达这里,就没有打开现有连接,尝试创建一个新连接。 final long preResolveHost = System.nanoTime(); final InetSocketAddress resolvedAddress = new InetSocketAddress(remoteHost, remotePort); final long hostResolveTimeMs = (System.nanoTime() - preResolveHost) / 1000000; if (hostResolveTimeMs > 2000) { logger.warn("DNS resolution for {} took {} ms", resolvedAddress, hostResolveTimeMs); } else { logger.trace("DNS resolution for {} took {} ms", resolvedAddress, hostResolveTimeMs); } // 多个线程可能会竞相在这里创建新连接。通过同步原语只保留其中一个处于活动状态。 synchronized (clientPool.locks[clientIndex]) { cachedClient = clientPool.clients[clientIndex]; if (cachedClient != null) { if (cachedClient.isActive()) { logger.trace("Returning cached connection to {}: {}", resolvedAddress, cachedClient); return cachedClient; } else { logger.info("Found inactive connection to {}, creating a new one.", resolvedAddress); } } clientPool.clients[clientIndex] = createClient(resolvedAddress); return clientPool.clients[clientIndex]; } }
-
createClient(InetSocketAddress address)
:- 通过Netty的根引导程序进行初始化配置
- 通过回调函数初始化bootstrap的Pipeline,设置好客户端引用和管道引用。
- 遍历客户端引导程序集clientBootstraps,执行其初始化的内容
javaprivate TransportClient createClient(InetSocketAddress address) throws IOException, InterruptedException { logger.debug("Creating new connection to {}", address); // netty的连接创建的根引导程序 Bootstrap bootstrap = new Bootstrap(); bootstrap.group(workerGroup) .channel(socketChannelClass) // 禁用纳格尔算法,因为我们不想让数据包等待 .option(ChannelOption.TCP_NODELAY, true) .option(ChannelOption.SO_KEEPALIVE, true) .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, conf.connectionTimeoutMs()) .option(ChannelOption.ALLOCATOR, pooledAllocator); if (conf.receiveBuf() > 0) { bootstrap.option(ChannelOption.SO_RCVBUF, conf.receiveBuf()); } if (conf.sendBuf() > 0) { bootstrap.option(ChannelOption.SO_SNDBUF, conf.sendBuf()); } final AtomicReference<TransportClient> clientRef = new AtomicReference<>(); final AtomicReference<Channel> channelRef = new AtomicReference<>(); // 通过回调函数初始化bootstrap的Pipeline bootstrap.handler(new ChannelInitializer<SocketChannel>() { @Override public void initChannel(SocketChannel ch) { TransportChannelHandler clientHandler = context.initializePipeline(ch); clientRef.set(clientHandler.getClient()); channelRef.set(ch); } }); // 连接远程服务器 long preConnect = System.nanoTime(); ChannelFuture cf = bootstrap.connect(address); if (!cf.await(conf.connectionTimeoutMs())) { throw new IOException( String.format("Connecting to %s timed out (%s ms)", address, conf.connectionTimeoutMs())); } else if (cf.cause() != null) { throw new IOException(String.format("Failed to connect to %s", address), cf.cause()); } TransportClient client = clientRef.get(); Channel channel = channelRef.get(); assert client != null : "Channel future completed successfully with null client"; // 在将客户端标记为成功之前,同步执行任何客户端引导。 long preBootstrap = System.nanoTime(); logger.debug("Connection to {} successful, running bootstraps...", address); try { // 遍历客户端引导程序集clientBootstraps,执行其初始化的内容 for (TransportClientBootstrap clientBootstrap : clientBootstraps) { clientBootstrap.doBootstrap(client, channel); } } catch (Exception e) { // catch non-RuntimeExceptions too as bootstrap may be written in Scala long bootstrapTimeMs = (System.nanoTime() - preBootstrap) / 1000000; logger.error("Exception while bootstrapping client after " + bootstrapTimeMs + " ms", e); client.close(); throw Throwables.propagate(e); } long postBootstrap = System.nanoTime(); logger.info("Successfully created connection to {} after {} ms ({} ms spent in bootstraps)", address, (postBootstrap - preConnect) / 1000000, (postBootstrap - preBootstrap) / 1000000); return client; }
TransportClient
用于向server端发送rpc请求和从server 端获取流的chunk块,旨在高效传输大量数据,这些数据被分成大小从几百 KB 到几 MB 不等的数据块。
典型流程
java
// 打开远程文件
client.sendRPC(new OpenFile("/foo")) --> returns StreamId = 100
// 获取远程文件的chunk
client.fetchChunk(streamId = 100, chunkIndex = 0, callback)
client.fetchChunk(streamId = 100, chunkIndex = 1, callback)
// 关闭远程文件
client.sendRPC(new CloseStream(100))
用于获取预协商数据流中连续数据块的客户端,处理的是从数据流(即数据平面)中获取数据块的过程,但数据流的实际设置是在传输层范围之外完成的。提供 "sendRPC "方便方法是为了在客户端和服务器之间进行控制平面通信,以执行此设置。使用 TransportClientFactory
构建一个 TransportClient
实例。单个 TransportClient
可用于多个流,但任何给定的流都必须仅限于单个客户端,以避免响应顺序混乱。注意:该类用于向服务器发出请求,而 TransportResponseHandler
则负责处理来自服务器的响应。并发性:线程安全,可由多个线程调用。
TransportServer
服务器,提供高效的底层流媒体服务。
消息的处理
消息处理类MessageHandler
处理来自 Netty
的请求或响应信息。一个 MessageHandler
实例只与一个Netty
通道相关联(尽管同一通道上可能有多个客户端)。以下是其定义的抽象方法。
abstract void handle(T message)
:对接收的单条信息的处理。abstract void channelActive()
:当该消息处理程序所在的频道处于活动状态时调用。abstract void exceptionCaught(Throwable cause)
:当通道上出现异常时调用。abstract void channelInactive()
:当此MessageHandler
所处的通道处于非活动状态时调用。
MessageHandler
有两个继承类TransportRequestHandler
和TransportResponseHandler
分别用来进行Server端处理Client的请求信息和Client端处理Server的响应信息。
TransportRequestHandler
的handle(RequestMessage request)
方法
java
public void handle(RequestMessage request) {
if (request instanceof RpcRequest) {
// 处理RPC请求,依赖RpcHandler的receive()方法
processRpcRequest((RpcRequest) request);
} else if (request instanceof OneWayMessage) {
// 处理无需回复的RPC请求,依赖RpcHandler的receive()方法
processOneWayMessage((OneWayMessage) request);
} else if (request instanceof StreamRequest) {
// 处理流请求,依赖StreamManager的openStream()方法获取流数据并封装成ManagedBuffer
processStreamRequest((StreamRequest) request);
} else {
// 未知请求抛异常
throw new IllegalArgumentException("Unknown request type: " + request);
}
}
TransportResponseHandler
的handle(ResponseMessage message)
方法
在client端发送消息时,根据发送消息的类型调用TransportResponseHandler
中的方法注册回调函数,回调函数和请求信息放入相应的缓存中。
待TransportResponseHandler
收到server端的响应消息时,再调用主要的工作方法handle()
,根据响应消息类型从对应缓存中取出回调函数并调用。
java
@Override
public void handle(ResponseMessage message) throws Exception {
if (message instanceof ChunkFetchSuccess) {
ChunkFetchSuccess resp = (ChunkFetchSuccess) message;
ChunkReceivedCallback listener = outstandingFetches.get(resp.streamChunkId);
if (listener == null) {
logger.warn("Ignoring response for block {} from {} since it is not outstanding",
resp.streamChunkId, getRemoteAddress(channel));
resp.body().release();
} else {
outstandingFetches.remove(resp.streamChunkId);
listener.onSuccess(resp.streamChunkId.chunkIndex, resp.body());
resp.body().release();
}
} else if (message instanceof ChunkFetchFailure) {
ChunkFetchFailure resp = (ChunkFetchFailure) message;
ChunkReceivedCallback listener = outstandingFetches.get(resp.streamChunkId);
if (listener == null) {
logger.warn("Ignoring response for block {} from {} ({}) since it is not outstanding",
resp.streamChunkId, getRemoteAddress(channel), resp.errorString);
} else {
outstandingFetches.remove(resp.streamChunkId);
listener.onFailure(resp.streamChunkId.chunkIndex, new ChunkFetchFailureException(
"Failure while fetching " + resp.streamChunkId + ": " + resp.errorString));
}
} else if (message instanceof RpcResponse) {
RpcResponse resp = (RpcResponse) message;
RpcResponseCallback listener = outstandingRpcs.get(resp.requestId);
if (listener == null) {
logger.warn("Ignoring response for RPC {} from {} ({} bytes) since it is not outstanding",
resp.requestId, getRemoteAddress(channel), resp.body().size());
} else {
outstandingRpcs.remove(resp.requestId);
try {
listener.onSuccess(resp.body().nioByteBuffer());
} finally {
resp.body().release();
}
}
} else if (message instanceof RpcFailure) {
RpcFailure resp = (RpcFailure) message;
RpcResponseCallback listener = outstandingRpcs.get(resp.requestId);
if (listener == null) {
logger.warn("Ignoring response for RPC {} from {} ({}) since it is not outstanding",
resp.requestId, getRemoteAddress(channel), resp.errorString);
} else {
outstandingRpcs.remove(resp.requestId);
listener.onFailure(new RuntimeException(resp.errorString));
}
} else if (message instanceof StreamResponse) {
StreamResponse resp = (StreamResponse) message;
Pair<String, StreamCallback> entry = streamCallbacks.poll();
if (entry != null) {
StreamCallback callback = entry.getValue();
if (resp.byteCount > 0) {
StreamInterceptor interceptor = new StreamInterceptor(this, resp.streamId, resp.byteCount,
callback);
try {
TransportFrameDecoder frameDecoder = (TransportFrameDecoder)
channel.pipeline().get(TransportFrameDecoder.HANDLER_NAME);
frameDecoder.setInterceptor(interceptor);
streamActive = true;
} catch (Exception e) {
logger.error("Error installing stream handler.", e);
deactivateStream();
}
} else {
try {
callback.onComplete(resp.streamId);
} catch (Exception e) {
logger.warn("Error in stream handler onComplete().", e);
}
}
} else {
logger.error("Could not find callback for StreamResponse.");
}
} else if (message instanceof StreamFailure) {
StreamFailure resp = (StreamFailure) message;
Pair<String, StreamCallback> entry = streamCallbacks.poll();
if (entry != null) {
StreamCallback callback = entry.getValue();
try {
callback.onFailure(resp.streamId, new RuntimeException(resp.error));
} catch (IOException ioe) {
logger.warn("Error in stream failure handler.", ioe);
}
} else {
logger.warn("Stream failure with unknown callback: {}", resp.error);
}
} else {
throw new IllegalStateException("Unknown response type: " + message.type());
}
}
消息的分类
MessageHandler
用来处理的消息都是继承或实现自Message接口的。
<<interface>> Message <<interface>> RequestMessage <<interface>> ResponseMessage <<abstract>> AbstractMessage <<abstract>> AbstractResponseMessage MessageHandler abstract void handle(T message) ChunkFetchRequest OneWayMessage RpcRequest StreamRequest ChunkFetchFailure RpcFailure StreamFailure ChunkFetchSuccess RpcResponse StreamResponse
根据上面的类图可以看出,主要分类
-
AbstractMessage
:抽象类,用于在单独的缓冲区中保存正文。其他消息类基本都继承该类。 -
RequestMessage
:定义了从客户端到服务端的消息接口ChunkFetchRequest
:请求获取数据流中单个数据块的序列。这将对应一个响应信息(成功或失败)。RpcRequest
:由远程服务端org.apache.spark.network.server.RpcHandler
处理的通用 RPC。这将对应一个响应信息(成功或失败)。OneWayMessage
:由远程服务端org.apache.spark.network.server.RpcHandler
处理。不需要进行回复客户端。StreamRequest
:请求从远端流式传输数据。数据流 ID 是一个任意字符串,需要两个端点协商后才能流式传输数据
-
ResponseMessage
:定义了从服务端到客户端的消息接口AbstractResponseMessage
:响应信息的抽象类。ChunkFetchSuccess
:处理ChunkFetchRequest
成功后返回的消息。RpcResponse
:处理RpcRequest
成功后返回的消息。StreamResponse
:处理StreamRequest
成功后返回的消息。
ChunkFetchFailure
:处理ChunkFetchRequest
失败后返回的消息。RpcFailure
:处理RpcRequest
失败后返回的消息。StreamFailure
:处理StreamRequest
失败后返回的消息。
client端请求和响应的流程
传输层 1.addRpcRequest或addFetchRequest 2.WriteAndFlush IdleStateHandler MessageDecoder TransportFrameDecoder TransportResponseHandler TransportChannelHandler Netty MessageEncoder TransportClient Request Response
server端处理请求和响应的流程
传输层 IdleStateHandler MessageDecoder TransportFrameDecoder TransportRequestHandler TransportChannelHandler StreamManager RpcHandler Netty MessageEncoder Request Response