Shell任务调度实战
需求
平台侧主动调度shell脚本的执行,结合业务进行各种操作。
机器配置:
shell执行侧:多台16G,16核虚机
平台侧:64G物理机,120核
挑战:
- shell脚本执行是进程级,开销较大
- shell脚本执行时间很长,需要连接Hive数仓
- 队列所具备的连接数有限
- 此次任务无法使用消息队列(增加架构成本)
思考
- 由于平台侧的资源较丰富,对于任务的调度策略,分配策略应当放在平台侧。shell执行侧不应有太大的负担。
- 每台shell执行测的机器的并发度需要合理控制,同一时间内,不应有过多的进程并发执行。
- 队列等参数应当由用户提供,各部门使用对应自己的队列名。
- 需要给任务分配的线程池参数合理分配,和其它业务的线程池进行资源隔离。
- shell执行器应当对执行日志进行采集,每次任务执行并行的开一个线程进行日志写入,防止JVM缓存区溢出
- 平台侧需要记录任务的执行状态。
- 平台侧和shell调度侧第一版用HTTP的方式进行通信。
实现
shell执行侧核心代码
java
private final ThreadPoolExecutor logExecutor = new ThreadPoolExecutor(10, 20,
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>(50));
@PostMapping("/execute")
public Boolean execute(@RequestBody List<String> paramMap) {
StringBuilder command = new StringBuilder();
if(StringUtils.isEmpty(SHELL_PATH)){
command = new StringBuilder("sh " + path);
}else {
command = new StringBuilder("sh " + SHELL_PATH);
}
for (String s : paramMap) {
command.append(" ").append(s);
}
Process process = null;
try {
// 执行命令, 返回一个子进程对象(命令在子进程中执行)
logger.info("command ,{}",command);
process = Runtime.getRuntime().exec(command.toString());
// 获取命令执行结果, 有两个结果: 正常的输出 和 错误的输出(PS: 子进程的输出就是主进程的输入)
Process finalProcess = process;
logExecutor.execute(()->{
String line;
BufferedReader bufrIn = new BufferedReader(new InputStreamReader(finalProcess.getInputStream()));
try {
while ((line = bufrIn.readLine()) != null) {
logger.info("log % {}",line);
}
}catch (Exception e){
logger.error(e.getMessage());
}finally {
try {
bufrIn.close();
} catch (IOException e) {
throw new RuntimeException(e);
}
}
});
logExecutor.execute(()->{
String line;
BufferedReader bufrError = new BufferedReader(new InputStreamReader(finalProcess.getErrorStream()));
try {
while ((line = bufrError.readLine()) != null) {
logger.info("log % {}",line);
}
}catch (Exception e){
logger.error(e.getMessage());
}finally {
try {
bufrError.close();
} catch (IOException e) {
throw new RuntimeException(e);
}
}
});
// 方法阻塞, 等待命令执行完成(成功会返回0)
process.waitFor();
int exitVal = process.waitFor();
if (0 != exitVal) {
logger.error("执行脚本失败");
}
logger.info("执行脚本成功");
}catch (Exception e){
logger.error(e.getMessage());
}
return true;
}
平台侧代码
用户提交任务
less
taskCache.addTask(new ShellMonitorTask(byId.getCloud(),bizDay,configId,UserCache.getUserInfo()));
arduino
public class ShellMonitorTask implements Runnable{
private static final Logger logger = LoggerFactory.getLogger(ShellMonitorTask.class);
private final String cloud;
private String up;
private final String bizDay;
private final String configId;
private final HiveComplementClient client = SpringContextUtils.getBean(
HiveComplementClient.class);
private final HiveTableComplementLogMapper mapper = SpringContextUtils.getBean(
HiveTableComplementLogMapper.class);
private final ShellExecutionTaskCache taskCache = SpringContextUtils.getBean(
ShellExecutionTaskCache.class);
public ShellMonitorTask(String cloud, String bizDay, String configId,String up) {
this.cloud = cloud;
this.bizDay = bizDay;
this.configId = configId;
this.up = up;
}
@Override
public void run() {
logger.info("=== 开始执行补数任务,param:{},{},{},{}",configId,bizDay,cloud,up);
HiveTableComplementLog log = addLog();
//尝试获取执行凭证,否则阻塞
taskCache.acquire();
//执行shell脚本
executeShellScript(log);
//释放凭证
taskCache.release();
//更新执行日志,确定状态
updateLog(log);
logger.info("=== 补数任务结束");
}
private void executeShellScript(HiveTableComplementLog log){
String res = "true";
//根据注册表,其它传入字段进行HTTP/rpc 阻塞请求
}
private HiveTableComplementLog addLog(){
HiveTableComplementLog log = new HiveTableComplementLog();
log.setStartTime(new Date());
log.setConfigId(configId);
log.setBizDay(bizDay);
log.setCp(up);
log.setStatus(HiveTableComplementLogStatus.RUNNING.getValue());
mapper.add(log);
return log;
}
private void updateLog(HiveTableComplementLog log){
log.setFinishTime(new Date());
mapper.update(log);
}
}
单线程循环取任务,这部分代码在Nacos发送注册请求 时,和Netty统一处理连接时的思想类似
java
@Component
public class ShellExecutionTaskCache {
private static final Logger logger = LoggerFactory.getLogger(ShellExecutionTaskCache.class);
//任务容器
private static final BlockingDeque<ShellMonitorTask> SHELL_TASKS = new LinkedBlockingDeque<>();
//许可证,控制shell执行吞吐,目前并发度是7
private static final Semaphore RUNNING_VOUCHER = new Semaphore(7);
//利用信号量,如果获取不到凭证就直接阻塞
public void acquire(){
try {
RUNNING_VOUCHER.acquire(1);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
//利用信号量释放凭证
public void release(){
RUNNING_VOUCHER.release(1);
}
public void addTask(ShellMonitorTask task) {
try {
SHELL_TASKS.put(task);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
@PostConstruct
public void init() {
start();
}
//单线程循环控制任务执行顺序
private void start(){
Thread t = new Thread(() -> {
for (; ; ) {
ShellMonitorTask pop = null;
try {
pop = SHELL_TASKS.take();
} catch (InterruptedException e) {
logger.error("shell 执行器异常{}",e.getMessage());
}
assert pop != null;
//单例模式获取 Shell执行的连接池
ShellExecetionPool.executor().execute(pop);
}
});
//设置守护线程
t.setDaemon(true);
t.start();
}
}
有待改善
虽然我们能控制每台机器的并发量,正常情况执行侧不会出现问题。但很多异常是我们意想不到的,如果后续需要改进,可能会借助Zookeeper,Nacos之类的中间件,利用其心跳机制,进行各个执行侧健康状态的确认和保活。但又增加了架构成本,真累,不弄了