【maap-analysis】spark离线数仓项目完整的开发流程

整体项目架构采用单项目多模块的模式

(此模式的构建思路见单项目多模块的大数据项目框架构建-CSDN博客

技术栈:scala+sparkRDD+sparkDataSet

主要使用RDD的目的在于了解学习底层RDD的实现逻辑和api使用

在企业级的结构化数据开发中一般使用sparksql或者sparkdataframe

gitee仓库地址:https://gitee.com/sawa0725/data-ware-house

一、工具类准备

1、环境准备

Spar环境初始化的工具类Env

放在父项目的Common模块下,因为spark环境很多项目都要用到,所以放在公共的模块下,一些只有这个项目用的环境可以放在本模块的Common下

Scala 复制代码
//spark环境准备工具类,用来获取SparkContext和SparkSession
package com.dw.common.utils

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

trait Env {

  def getSparkContext(appName: String = "AppName", master: String = "local[*]"): SparkContext = {
    val conf: SparkConf = new SparkConf().setMaster(master).setAppName(appName)
    new SparkContext(conf)
  }

  def getSparkSession(appName: String = "AppName", master: String = "local[*]"): SparkSession = {
    SparkSession.builder().master(master).appName(appName).getOrCreate()
  }
}

局部线程封装EnvUtil类

构建EnvUtil类,这个EnvUtil对象是Spark 环境的工具类

核心作用是统一管理SparkContext和SparkSession的线程局部实例

避免多线程场景下的资源冲突,同时提供便捷的获取、设置、清理方法

目录位置同上

Scala 复制代码
package com.dw.common.utils

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

object EnvUtil  {
  private val scLocal = new ThreadLocal[SparkContext]
  private val sparkLocal = new ThreadLocal[SparkSession]
  def getSc: SparkContext = {
   scLocal.get()
  }
  def setSc(sc:SparkContext): Unit = {
   scLocal.set(sc)
  }
  
  def clearSc(): Unit = {
   scLocal.remove()
  }
  
  def clearSession(): Unit = {
   sparkLocal.remove()
  }
  
  def getSession: SparkSession = {
   sparkLocal.get()
  }
  
  def setSession(sc: SparkSession): Unit = {
   sparkLocal.set(sc)
  }
}

这两步就可以完成我们spark环境的准备,只要在主方法中with Env就可以获取特质中的方法

实现代码样例:

Scala 复制代码
private val sc: SparkContext = getSparkContext() //调用特质中的方法获取SparkContext
private val spark: SparkSession = getSparkSession()  //调用特质中的方法获取SparkSession
EnvUtil.setSc(sc)  //将获取到的SparkContext放到局部线程中去
EnvUtil.setSession(spark) //将获取到的SparkSession放到局部线程中去

Dao层(业务数据访问)

这里主要是读取本地文件,其他项目可以增加hadoop文件读取等功能

Scala 复制代码
//业务数据访问
package com.dw.maap.job.dao

import com.dw.common.utils.EnvUtil
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.InputSplit
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.{NewHadoopRDD, RDD}
import org.apache.spark.sql.{DataFrame, SparkSession}

class Dao {
  private val sc: SparkContext = EnvUtil.getSc
  private val spark: SparkSession = EnvUtil.getSession
  
  def  hadoopFile(path: String): RDD[String] = {
   val value: RDD[String] = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](path)
    .asInstanceOf[NewHadoopRDD[LongWritable, Text]]
    .mapPartitionsWithInputSplit(
     (inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
      val file = inputSplit.asInstanceOf[FileSplit]
      val fileName = file.getPath.getName
      iterator.map(line => {
       line._2.toString + fileName
      })
     })
   value
  }
  def rddRead(path: String): RDD[String] = {
   sc.textFile(path)
  }
  def sessionRead(path: String): DataFrame = {
   val frame: DataFrame = spark.read.load(path)
   frame
  }
  def rddReadWhole(path: String): RDD[(String,String)] = {
   sc.wholeTextFiles(path)
  }

}

重点解析这段代码

Scala 复制代码
def  hadoopFile(path: String): RDD[String] = {
  //基于 Spark 的 Hadoop API 读取文件,并在每行数据后拼接文件名
  //newAPIHadoopFile:Spark 调用 Hadoop 新 API 读取文件的核心方法,泛型参数说明:
  //LongWritable:Hadoop 文件的行号(键类型);
  //Text:Hadoop 文件的行内容(值类型,对应每行字符串);
  //TextInputFormat:Hadoop 的文本文件输入格式(默认按行读取) ,此处未使用
  val value: RDD[String] = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](path)
   .asInstanceOf[NewHadoopRDD[LongWritable, Text]] //类型转换为 NewHadoopRDD,普通RDD没有mapPartitionsWithInputSplit方法,必须强转为NewHadoopRDD才能获取InputSplit(文件分片信息),文件分片信息中有文件名
   .mapPartitionsWithInputSplit(
    (inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
     val file = inputSplit.asInstanceOf[FileSplit] //将通用分片转为FileSplit(文件分片),才能获取文件路径 / 名称
     val fileName = file.getPath.getName
     iterator.map(line => {
      line._2.toString + fileName
     })
    })
  value
}

entity:业务实体类

样例类(Case Class),封装业务数据结构(如MaapInfo)

用于 Dataset 类型安全操作,封装从文件中读取的rdd的值类型

这块的代码需要的可以从代码仓获取

二、业务流程

1、应用起点ApplicationAnalysis

只做拉起进程和环境配置

Scala 复制代码
package com.dw.maap.job.application

import com.dw.common.utils.{Env, EnvUtil}
import com.dw.maap.job.controller.Controller
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

object ApplicationAnalysis extends App with Env {
  //环境准备
  Logger.getLogger("org").setLevel(Level.ERROR)
  //核心作用是:将 org 包下所有类的日志输出级别设置为 ERROR,仅打印错误级别的日志,屏蔽 INFO/WARN/DEBUG 等低级别日志
  private val sc: SparkContext = getSparkContext()
  private val spark: SparkSession = getSparkSession()
  EnvUtil.setSc(sc)
  EnvUtil.setSession(spark)
  
  private val controller = new Controller  //
  //controller.test()
  
  controller.mainProcess("ALL")

  EnvUtil.clearSc()
  EnvUtil.clearSession()
  sc.stop()
  spark.stop()
}

2、流程控制Controller

用来做大致的流程控制,只是将整体业务流程分为几个业务模块,然后按序进行

第一块整体流程控制包括判断参数是否正确等

第二块局数据加载,分号段H1H2H3局数据和bossmessage局数据

第三块逻辑处理,也分maap局数据和bsms局数据(这里会将加载的局数据作为入参输入)最后会将临时试图汇总为报表打印出来

Scala 复制代码
一共四个函数
mainProcess
cfgDataLoad(两个业务调用同一个数据加载类的不同函数)
maapProcess
BsmsProcess

流程控制代码

Scala 复制代码
package com.dw.maap.job.controller

import com.dw.maap.job.config.CfgInfo.{bsmsTable, maapTable}
import com.dw.maap.job.entity.{BsmsInfo, H1H2H3, MaapInfo, bossmessageInfo}
import com.dw.maap.job.service.{BsmsAnalysisService, MaapAnalysisService, dataLoad, Statistics}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel

class Controller {
  private val maapService = new MaapAnalysisService
  private val loadService = new dataLoad
  private val bsmsAnalysisService = new BsmsAnalysisService
  private val statistics = new Statistics

  var bossmessage: RDD[bossmessageInfo] = null
  var H1H2H3: RDD[H1H2H3] = null

  //主进程
  def mainProcess(serviceType: String): Unit = {
    if (List("BSMS", "MAAP", "ALL") contains serviceType.toUpperCase) {
      cfgDataLoad() //加载局数据(配置数据)
      //然后根据参数不同执行业务进程(参数类型:ALL、MAAP、BSMS)
      if (serviceType.toUpperCase == "ALL") {
        maapProcess()
        BsmsProcess()
      }
      else if (serviceType.toUpperCase == "MAAP") maapProcess()
      else if (serviceType.toUpperCase == "BSMS") BsmsProcess()
    } else println("请输入正确的业务类型:ALL、MAAP、BSMS")
  }


  def cfgDataLoad(): Unit = {
    println("局数据加载...H1h2H3号段数据开始加载.....打印前十条")
    H1H2H3 = loadService.H1H2H3Load().persist(StorageLevel.MEMORY_AND_DISK)
    //persist 是 Scala/Spark 中核心的缓存 API
    //主要用于持久化 RDD/DataFrame/Dataset 到内存 / 磁盘,避免重复计算,提升性能
    H1H2H3.take(10).foreach(println)
    println("局数据加载...bossmessage端口局数据开始加载.....打印前十条")
    bossmessage = loadService.bossMessageLoad().persist(StorageLevel.MEMORY_AND_DISK)
    bossmessage.take(10).foreach(println)
  }

  def maapProcess(): Unit = {
    println("MAAP详单数据开始加载.....打印前十条")
    val cdrRdd: RDD[MaapInfo] = loadService.maapCdrFileLoad()
    cdrRdd.take(10).foreach(println)
    println("MAAP话单对端省预处理开始执行.....打印前十条")
    val maapResult: RDD[MaapInfo] = maapService.maapCdrPreServerCode(bossmessage, cdrRdd)
    maapResult.take(10).foreach(println)

    println("MAAP话单归属地预处理开始执行.....打印前十条")
    val maapCdr: RDD[MaapInfo] = maapService.maapCdrPreH1H2H3(maapResult, H1H2H3).cache()
    maapCdr.take(10).foreach(println)

    println("##############################################")
    println("分地市报表数据:")
    statistics.statisticsCity(maapTable)

    println("##############################################")
    println("分对端省报表数据")
    statistics.statisticsProv(maapTable)
  }

  private def BsmsProcess(): Unit = {
    println("BSMS详单数据开始加载.....打印前十条")
    val cdrRdd: RDD[BsmsInfo] = loadService.bsmsCdrFileLoad().persist(StorageLevel.MEMORY_AND_DISK)
    cdrRdd.take(10).foreach(println)
    //需要先端口预处理打标端口是否生效标识
    println("BSMS话单端口预处理开始执行.....打印前十条")
    val bsmsResult: RDD[BsmsInfo] = bsmsAnalysisService.bsmsCdrPreServerCode(bossmessage, cdrRdd).persist(StorageLevel.MEMORY_AND_DISK)
    bsmsResult.take(100).foreach(println)
    println("BSMS话单手机号码预处理开始执行.....打印前十条")
    val bsmsCdr: RDD[BsmsInfo] = bsmsAnalysisService.bsmsCdrPreH1H2H3(H1H2H3, bsmsResult).persist(StorageLevel.MEMORY_AND_DISK)
    bsmsCdr.take(100).foreach(println)

    println("##############################################")
    println("分地市报表数据:")
    statistics.statisticsCity(bsmsTable)

    println("##############################################")
    println("分对端省报表数据")
    statistics.statisticsProv(bsmsTable)
  }


}

3、数据加载类dataLoad

功能:加载局数据和明细数据加载,在Controller中调用

(主要是将文件里的数据关联实体类,然后输出成DataSet的临时试图)

Scala 复制代码
package com.dw.maap.job.service

import com.dw.common.utils.EnvUtil
import com.dw.maap.job.config.CfgInfo.{H1H2H3FileName, H1H2H3Path, bBsmsTxtFileName, bBsmsTxtPath, bMaapTxtFileName, bMaapTxtPath, bossmessageinfoFilename, bossmessageinfoPath, bsmsHead10String, bsmsTable, bsmsTail10String, bsmsTxtFileName, bsmsTxtPath, dd_bossmessageinfo, dd_h1h2h3_code_allocate, maapHead10String, maapTable, maapTail10String, maapTxtFileName, maapTxtPath}
import com.dw.maap.job.entity.{BsmsInfo, H1H2H3, MaapInfo, bossmessageInfo}
import com.dw.maap.job.dao.Dao
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Dataset, SparkSession}



class dataLoad {
  
  private val spark: SparkSession = EnvUtil.getSession
  private val sc: SparkContext = EnvUtil.getSc
  private val dao = new Dao
  import spark.implicits._
  //导入SparkSession的隐式转换(包含toDS/toDF等方法),必须在SparkSession创建之后

  def bossMessageLoad(): RDD[bossmessageInfo] = {
   val bossmessageRdd: RDD[bossmessageInfo] = dao.rddRead(bossmessageinfoPath + bossmessageinfoFilename)
    .map(_.split("\\u0001"))
    .map(data => {
     bossmessageInfo(
      data(0), data(1), data(2), data(3), data(4), data(5), data(6), data(7), data(8), data(9), data(10), data(11), data(12), data(13), data(14), data(15)
     )
    })
   val bossmessageSet: Dataset[bossmessageInfo] = bossmessageRdd.toDS()
   bossmessageSet.createOrReplaceTempView(dd_bossmessageinfo)
   bossmessageRdd
  }
  
  def H1H2H3Load(): RDD[H1H2H3] = {
   val H1H2H3Rdd: RDD[H1H2H3] = dao.rddRead(H1H2H3Path + H1H2H3FileName)
    .map(_.split("\\u0001"))
    .map(data => {
     H1H2H3(
      data(0).trim,
      data(1).trim,
      data(2).trim,
      data(3).trim,
      data(4).trim,
      data(5).trim,
      data(6).trim,
      data(7).trim,
      data(8).trim,
      data(9).trim,
      data(10).trim,
      data(11).trim,
      data(12).trim,
      data(13).trim
     )
    })
   val H1H2H3Set: Dataset[H1H2H3] = H1H2H3Rdd.toDS()
   H1H2H3Set.createOrReplaceTempView(dd_h1h2h3_code_allocate)
   H1H2H3Rdd
  }


  def maapCdrFileLoad(): RDD[MaapInfo] = {
   val maapRdd: RDD[String] = dao.hadoopFile(bMaapTxtPath + bMaapTxtFileName + "*" + "," + maapTxtPath + maapTxtFileName + "*")


   val filHead: RDD[String] = maapRdd.filter(_.substring(0, 10) != maapHead10String)
   val filHeadTail: RDD[String] = filHead.filter(_.substring(0, 10) != maapTail10String)

   val maapInfo: RDD[MaapInfo] = filHeadTail.map(data => {
    MaapInfo(

     if (data.substring(1174).startsWith(bMaapTxtFileName)) 1
     else if (data.substring(1174).startsWith(maapTxtFileName)) 0
     else {
      -1
     },
     null,
     null,
     null,
     0.00,
     data.substring(0, 2).trim,
     data.substring(2, 22).trim,
     data.substring(22, 54).trim,
     data.substring(54, 56).trim,
     data.substring(56, 184).split(":")(1).split("@")(0),
     data.substring(184, 312).split("\\+86")(1).trim,
     data.substring(312, 314).trim,
     data.substring(314, 316).trim,
     data.substring(316, 352).trim,
     data.substring(352, 388).trim,
     data.substring(388, 424).trim,
     data.substring(424, 426).trim,
     data.substring(426, 428).trim,
     data.substring(428, 429).trim,
     data.substring(429, 443).trim,
     data.substring(443, 457).trim,
     data.substring(457, 521).trim,
     data.substring(521, 585).trim,
     data.substring(585, 649).trim,
     data.substring(649, 777).trim,
     data.substring(777, 905).trim,
     data.substring(905, 907).trim,
     data.substring(907, 909).trim,
     data.substring(909, 924).trim,
     data.substring(924, 939).trim,
     data.substring(939, 954).trim,
     data.substring(954, 974).trim,
     data.substring(974, 1034).trim,
     data.substring(1034, 1054).trim,
     data.substring(1054, 1114).trim,
     data.substring(1114, 1174).trim,
     data.substring(1174)
    )
   })

   val maapSet: Dataset[MaapInfo] = {
    maapInfo.toDS()
   }

   maapSet.createOrReplaceTempView(maapTable)
   maapInfo
  }
  def bsmsCdrFileLoad(): RDD[BsmsInfo] ={
   val bsmsRdd: RDD[String] = dao.hadoopFile(bBsmsTxtPath + bBsmsTxtFileName + "," + bsmsTxtPath + bsmsTxtFileName)
   val filHead: RDD[String] = bsmsRdd.filter(_.substring(0, 10) != bsmsHead10String)
   val filHeadTail: RDD[String] = filHead.filter(_.substring(0, 10) != bsmsTail10String)
   val bsmsInfo: RDD[BsmsInfo] = filHeadTail.map(data => {
    BsmsInfo(
     if (data.substring(198).substring(12,15)=="551") 1
     else if (data.substring(198).substring(16,19)=="551") 0
     else -1,
     null,
     null,
     null,
     0.00,
     data.substring(0, 2).trim,
     data.substring(2, 22).trim,
     data.substring(22, 24).trim,
     data.substring(24, 25).trim,
     data.substring(25, 40).trim,
     data.substring(40, 58).trim,
     data.substring(58, 70).trim,
     data.substring(70, 85).trim,
     data.substring(85, 106).trim,
     data.substring(106, 116).trim,
     data.substring(116, 123).trim,
     data.substring(123, 124).trim,
     data.substring(124, 127).trim,
     data.substring(127, 133).trim,
     data.substring(133, 139).trim,
     data.substring(139, 150).trim,
     data.substring(150, 164).trim,
     data.substring(164, 178).trim,
     data.substring(178, 198).trim,
     data.substring(198)
    )
   })

   val bsmsSet: Dataset[BsmsInfo] = bsmsInfo.toDS()

   bsmsSet.createOrReplaceTempView(bsmsTable)
   bsmsInfo
  }
}

三、业务逻辑开发

在Controller中调用,实现具体的业务逻辑

两个核心业务BSMS(BsmsProcess)和MAAP(maapProcess),核心逻辑类似

MAAP两个核心逻辑

maapCdrPreH1H2H3

maapCdrPreServerCode

BAMS两个核心逻辑

bsmsCdrPreH1H2H3

bsmsCdrPreServerCode

这里就重点拿MAAP进行详细解析

1、maapCdrPreH1H2H3

功能:

入参为maap话单和H1H2H3(号段局数据)

根据号段匹配H1H2H3号段分辨是哪个城市的号码

2、maapCdrPreServerCode

1>匹配局数据的端口号,找到归属的省份(最长匹配原则,明细数据中的端口号匹配局数据中的端口号,如果有一样长的就匹配一样长的,例如123匹配123,如果123在局数据中没有,就匹配1234,没有1234就匹配12345)

核心为先对号段长度进行排序,然后进行双重遍历的时候对第二层也就是端口的遍历增加一层breakable,这样在下面的遍历操作中优先匹配最短的号段,匹配到之后就跳出当前循环,跳出后,程序会回到外层messageInfos循环,继续处理下一个messageInfos元素;这样就实现了 匹配到serv_code后,处理完逻辑就跳过剩余的serverCodeInfo遍历的需求。

2>批价,根据短信类型以及对应的短信单价

Scala 复制代码
package com.dw.maap.job.service
// 导入SparkSession的隐式转换(包含toDS/toDF等方法)

import com.dw.common.utils.EnvUtil
import com.dw.maap.job.config.CfgInfo._
import com.dw.maap.job.dao.Dao
import com.dw.maap.job.entity.{H1H2H3, MaapInfo, bossmessageInfo}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

import scala.util.control.Breaks.breakable

case class MaapAnalysisService() {
  private val dao = new Dao
  private val spark: SparkSession = EnvUtil.getSession
  private val sc: SparkContext = EnvUtil.getSc

  import spark.implicits._
  //导入SparkSession的隐式转换(包含toDS/toDF等方法),必须在SparkSession创建之后


  def maapCdrPreH1H2H3(maapInfo: RDD[MaapInfo], H1H2H3: RDD[H1H2H3]): RDD[MaapInfo] = {

    val h1h2h3: RDD[(String, String)] = H1H2H3
      .map(data =>
        (data.h1h2h3h4, data.long_code)
      )
    val maap: RDD[(String, String)] = maapInfo.map(
      data => (
        data.reserveuser.substring(0, 7), ""
      )
    )
    val h1h2h3Pre: Array[(String, String)] = h1h2h3.join(maap).map(data => {
      (data._1, data._2._1)
    }).collect()
    val filRdd: RDD[MaapInfo] = maapInfo
      .filter(_.is_bal_port != '0') //过滤发送端口失效的话单
    val frame: DataFrame = filRdd.map(data => {
      for (i <- h1h2h3Pre) {
        if (data.sett_direction == 0 && data.reserveuser.substring(0, 7) == i._1) {
          data.prov = i._2
          data.city = "551"
        }
      }
      data
    }
    ).toDF()

    frame.createOrReplaceTempView(maapTable)
    val value: Dataset[MaapInfo] = frame.as[MaapInfo]
    value.rdd
  }


  def maapCdrPreServerCode(bossmessage: RDD[bossmessageInfo], maap: RDD[MaapInfo]): RDD[MaapInfo] = {
    //业务端口为10657586[开头]并且长度为8位或者12位的按0.045元/条进行结算
    //批价
    val messageInfo: Array[MaapInfo] = maap.map(data => {
      if (data.input_file_name.contains("MAAP_TXT")) {
        data.settle_fee = maapTxtPrice * (data.messagelength.toInt / 270) //MAAP_TXT业务,如果短信长度超过270那么费用为短信长度/270 * 短信单价
      } else if (data.input_file_name.contains("MAAP_MMA")) {
        data.settle_fee = maapMmaPrice
      } else if (data.input_file_name.contains("MAAP_CSN")) {
        data.settle_fee = maapCsnPrice
      }
      data
    }).collect()

    val serverCodeInfo: Array[bossmessageInfo] = bossmessage.sortBy(_.serv_code.length) //根据号段长度排序,优先取最短的号段进行匹配
      .collect()

    for (messageInfos <- messageInfo) {
      breakable {
        for (serverCodeInfos <- serverCodeInfo) {
          if (messageInfos.senduser.indexOf(serverCodeInfos.serv_code) != -1) {
            messageInfos.is_bal_port = serverCodeInfos.is_bal_port
            if (messageInfos.sett_direction == 1) {
              messageInfos.prov = "551"
              messageInfos.city = serverCodeInfos.prov_code
            }
          }
          breakable()
        }
      }
    }

    val value: RDD[MaapInfo] = sc.makeRDD(messageInfo)
    value.toDS().createOrReplaceTempView(maapTable)
    value

  }


}

四、结果输出

Statistics:功能,将统计后的数据分组求和输出报表打印并写入文件

Scala 复制代码
package com.dw.maap.job.service

import com.dw.common.utils.EnvUtil
import org.apache.spark.sql.{Dataset, Row, SparkSession}

import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}

class Statistics {
  private val spark: SparkSession = EnvUtil.getSession
  val currentDate: LocalDate = LocalDate.now()
  val currentDateTime: LocalDateTime = LocalDateTime.now()

  val formatter: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd-HH-mm-ss")
  val formattedDateTime: String = currentDateTime.format(formatter)

  def statisticsCity(tableName:String): Unit = {

   val value: Dataset[Row] = spark.sql(s"select '$tableName',sett_direction,city,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction,city " +
    "union " +
    s"select '$tableName',sett_direction,'全省' as prov,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction"
   ).orderBy("city")
   value.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true") save(s"maap-analysis/src/main/output/${currentDate}/${formattedDateTime}/{$tableName}_city.csv")
   
   value.show()
   
  }
  def statisticsProv(tableName:String): Unit = {
   
   val value: Dataset[Row] = spark.sql(s"select '$tableName',sett_direction,prov,sum(settle_fee) as fee ,count(*) as count_num   from $tableName group by sett_direction,prov " +
    "union " +
    s"select '$tableName',sett_direction,'全国' as prov,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction").orderBy("prov")
   value.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true") save(s"maap-analysis/src/main/output${currentDate}/${formattedDateTime}/{$tableName}_prov.csv")
   
   value.show()
   
  }
  
}

注意,需要读取的数据文件在maap-analysis/src/main/resources中(本文件仅作学习使用,请勿用作他处)

运行结果:

日志后面有bsms短信的结果,模式一致,不进行展示,大家可以亲自动手尝试

五、项目拓展优化方向

当前已经在本地搭建好了hadoop、hive、spark集群,可以优化成从hadop读取文件,然后报表输出到hive表中,模拟企业级开发流程。后面有机会会对这个项目进行改造升级,当前准备从零开始设计一个规范的企业级流程的离线数仓项目,座椅就暂时搁置,后面有时间了再优化项目,可以持续关注~

(搭建全过程见专栏从零开始部署本地大数据集群_是阿威啊的博客-CSDN博客

相关推荐
Hello.Reader2 小时前
Flink ML Naive Bayes 多分类朴素贝叶斯
大数据·分类·flink
白日做梦Q2 小时前
联邦学习与隐私保护深度学习:面向分布式场景的研究热点与实践
人工智能·分布式·深度学习
weilaikeqi11112 小时前
豹变IP张大豆:豹变商学将人格作为新商业学科
大数据·人工智能·tcp/ip
程途拾光15810 小时前
企业部门协作泳道图制作工具 PC端
大数据·运维·流程图
落叶,听雪11 小时前
河南建站系统哪个好
大数据·人工智能·python
大数据追光猿11 小时前
【大数据Doris】生产环境,Doris主键模型全表7000万数据更新写入为什么那么慢?
大数据·经验分享·笔记·性能优化·doris
武子康12 小时前
大数据-197 K折交叉验证实战:sklearn 看均值/方差,选更稳的 KNN 超参
大数据·后端·机器学习
数据皮皮侠12 小时前
2m气温数据集(1940-2024)
大数据·数据库·人工智能·制造·微信开放平台
Coder_Boy_14 小时前
基于SpringAI的智能运维平台(AI驱动)
大数据·运维·人工智能