【maap-analysis】spark离线数仓项目完整的开发流程

整体项目架构采用单项目多模块的模式

(此模式的构建思路见单项目多模块的大数据项目框架构建-CSDN博客

技术栈:scala+sparkRDD+sparkDataSet

主要使用RDD的目的在于了解学习底层RDD的实现逻辑和api使用

在企业级的结构化数据开发中一般使用sparksql或者sparkdataframe

gitee仓库地址:https://gitee.com/sawa0725/data-ware-house

一、工具类准备

1、环境准备

Spar环境初始化的工具类Env

放在父项目的Common模块下,因为spark环境很多项目都要用到,所以放在公共的模块下,一些只有这个项目用的环境可以放在本模块的Common下

Scala 复制代码
//spark环境准备工具类,用来获取SparkContext和SparkSession
package com.dw.common.utils

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

trait Env {

  def getSparkContext(appName: String = "AppName", master: String = "local[*]"): SparkContext = {
    val conf: SparkConf = new SparkConf().setMaster(master).setAppName(appName)
    new SparkContext(conf)
  }

  def getSparkSession(appName: String = "AppName", master: String = "local[*]"): SparkSession = {
    SparkSession.builder().master(master).appName(appName).getOrCreate()
  }
}

局部线程封装EnvUtil类

构建EnvUtil类,这个EnvUtil对象是Spark 环境的工具类

核心作用是统一管理SparkContext和SparkSession的线程局部实例

避免多线程场景下的资源冲突,同时提供便捷的获取、设置、清理方法

目录位置同上

Scala 复制代码
package com.dw.common.utils

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

object EnvUtil  {
  private val scLocal = new ThreadLocal[SparkContext]
  private val sparkLocal = new ThreadLocal[SparkSession]
  def getSc: SparkContext = {
   scLocal.get()
  }
  def setSc(sc:SparkContext): Unit = {
   scLocal.set(sc)
  }
  
  def clearSc(): Unit = {
   scLocal.remove()
  }
  
  def clearSession(): Unit = {
   sparkLocal.remove()
  }
  
  def getSession: SparkSession = {
   sparkLocal.get()
  }
  
  def setSession(sc: SparkSession): Unit = {
   sparkLocal.set(sc)
  }
}

这两步就可以完成我们spark环境的准备,只要在主方法中with Env就可以获取特质中的方法

实现代码样例:

Scala 复制代码
private val sc: SparkContext = getSparkContext() //调用特质中的方法获取SparkContext
private val spark: SparkSession = getSparkSession()  //调用特质中的方法获取SparkSession
EnvUtil.setSc(sc)  //将获取到的SparkContext放到局部线程中去
EnvUtil.setSession(spark) //将获取到的SparkSession放到局部线程中去

Dao层(业务数据访问)

这里主要是读取本地文件,其他项目可以增加hadoop文件读取等功能

Scala 复制代码
//业务数据访问
package com.dw.maap.job.dao

import com.dw.common.utils.EnvUtil
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.InputSplit
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.{NewHadoopRDD, RDD}
import org.apache.spark.sql.{DataFrame, SparkSession}

class Dao {
  private val sc: SparkContext = EnvUtil.getSc
  private val spark: SparkSession = EnvUtil.getSession
  
  def  hadoopFile(path: String): RDD[String] = {
   val value: RDD[String] = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](path)
    .asInstanceOf[NewHadoopRDD[LongWritable, Text]]
    .mapPartitionsWithInputSplit(
     (inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
      val file = inputSplit.asInstanceOf[FileSplit]
      val fileName = file.getPath.getName
      iterator.map(line => {
       line._2.toString + fileName
      })
     })
   value
  }
  def rddRead(path: String): RDD[String] = {
   sc.textFile(path)
  }
  def sessionRead(path: String): DataFrame = {
   val frame: DataFrame = spark.read.load(path)
   frame
  }
  def rddReadWhole(path: String): RDD[(String,String)] = {
   sc.wholeTextFiles(path)
  }

}

重点解析这段代码

Scala 复制代码
def  hadoopFile(path: String): RDD[String] = {
  //基于 Spark 的 Hadoop API 读取文件,并在每行数据后拼接文件名
  //newAPIHadoopFile:Spark 调用 Hadoop 新 API 读取文件的核心方法,泛型参数说明:
  //LongWritable:Hadoop 文件的行号(键类型);
  //Text:Hadoop 文件的行内容(值类型,对应每行字符串);
  //TextInputFormat:Hadoop 的文本文件输入格式(默认按行读取) ,此处未使用
  val value: RDD[String] = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](path)
   .asInstanceOf[NewHadoopRDD[LongWritable, Text]] //类型转换为 NewHadoopRDD,普通RDD没有mapPartitionsWithInputSplit方法,必须强转为NewHadoopRDD才能获取InputSplit(文件分片信息),文件分片信息中有文件名
   .mapPartitionsWithInputSplit(
    (inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
     val file = inputSplit.asInstanceOf[FileSplit] //将通用分片转为FileSplit(文件分片),才能获取文件路径 / 名称
     val fileName = file.getPath.getName
     iterator.map(line => {
      line._2.toString + fileName
     })
    })
  value
}

entity:业务实体类

样例类(Case Class),封装业务数据结构(如MaapInfo)

用于 Dataset 类型安全操作,封装从文件中读取的rdd的值类型

这块的代码需要的可以从代码仓获取

二、业务流程

1、应用起点ApplicationAnalysis

只做拉起进程和环境配置

Scala 复制代码
package com.dw.maap.job.application

import com.dw.common.utils.{Env, EnvUtil}
import com.dw.maap.job.controller.Controller
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

object ApplicationAnalysis extends App with Env {
  //环境准备
  Logger.getLogger("org").setLevel(Level.ERROR)
  //核心作用是:将 org 包下所有类的日志输出级别设置为 ERROR,仅打印错误级别的日志,屏蔽 INFO/WARN/DEBUG 等低级别日志
  private val sc: SparkContext = getSparkContext()
  private val spark: SparkSession = getSparkSession()
  EnvUtil.setSc(sc)
  EnvUtil.setSession(spark)
  
  private val controller = new Controller  //
  //controller.test()
  
  controller.mainProcess("ALL")

  EnvUtil.clearSc()
  EnvUtil.clearSession()
  sc.stop()
  spark.stop()
}

2、流程控制Controller

用来做大致的流程控制,只是将整体业务流程分为几个业务模块,然后按序进行

第一块整体流程控制包括判断参数是否正确等

第二块局数据加载,分号段H1H2H3局数据和bossmessage局数据

第三块逻辑处理,也分maap局数据和bsms局数据(这里会将加载的局数据作为入参输入)最后会将临时试图汇总为报表打印出来

Scala 复制代码
一共四个函数
mainProcess
cfgDataLoad(两个业务调用同一个数据加载类的不同函数)
maapProcess
BsmsProcess

流程控制代码

Scala 复制代码
package com.dw.maap.job.controller

import com.dw.maap.job.config.CfgInfo.{bsmsTable, maapTable}
import com.dw.maap.job.entity.{BsmsInfo, H1H2H3, MaapInfo, bossmessageInfo}
import com.dw.maap.job.service.{BsmsAnalysisService, MaapAnalysisService, dataLoad, Statistics}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel

class Controller {
  private val maapService = new MaapAnalysisService
  private val loadService = new dataLoad
  private val bsmsAnalysisService = new BsmsAnalysisService
  private val statistics = new Statistics

  var bossmessage: RDD[bossmessageInfo] = null
  var H1H2H3: RDD[H1H2H3] = null

  //主进程
  def mainProcess(serviceType: String): Unit = {
    if (List("BSMS", "MAAP", "ALL") contains serviceType.toUpperCase) {
      cfgDataLoad() //加载局数据(配置数据)
      //然后根据参数不同执行业务进程(参数类型:ALL、MAAP、BSMS)
      if (serviceType.toUpperCase == "ALL") {
        maapProcess()
        BsmsProcess()
      }
      else if (serviceType.toUpperCase == "MAAP") maapProcess()
      else if (serviceType.toUpperCase == "BSMS") BsmsProcess()
    } else println("请输入正确的业务类型:ALL、MAAP、BSMS")
  }


  def cfgDataLoad(): Unit = {
    println("局数据加载...H1h2H3号段数据开始加载.....打印前十条")
    H1H2H3 = loadService.H1H2H3Load().persist(StorageLevel.MEMORY_AND_DISK)
    //persist 是 Scala/Spark 中核心的缓存 API
    //主要用于持久化 RDD/DataFrame/Dataset 到内存 / 磁盘,避免重复计算,提升性能
    H1H2H3.take(10).foreach(println)
    println("局数据加载...bossmessage端口局数据开始加载.....打印前十条")
    bossmessage = loadService.bossMessageLoad().persist(StorageLevel.MEMORY_AND_DISK)
    bossmessage.take(10).foreach(println)
  }

  def maapProcess(): Unit = {
    println("MAAP详单数据开始加载.....打印前十条")
    val cdrRdd: RDD[MaapInfo] = loadService.maapCdrFileLoad()
    cdrRdd.take(10).foreach(println)
    println("MAAP话单对端省预处理开始执行.....打印前十条")
    val maapResult: RDD[MaapInfo] = maapService.maapCdrPreServerCode(bossmessage, cdrRdd)
    maapResult.take(10).foreach(println)

    println("MAAP话单归属地预处理开始执行.....打印前十条")
    val maapCdr: RDD[MaapInfo] = maapService.maapCdrPreH1H2H3(maapResult, H1H2H3).cache()
    maapCdr.take(10).foreach(println)

    println("##############################################")
    println("分地市报表数据:")
    statistics.statisticsCity(maapTable)

    println("##############################################")
    println("分对端省报表数据")
    statistics.statisticsProv(maapTable)
  }

  private def BsmsProcess(): Unit = {
    println("BSMS详单数据开始加载.....打印前十条")
    val cdrRdd: RDD[BsmsInfo] = loadService.bsmsCdrFileLoad().persist(StorageLevel.MEMORY_AND_DISK)
    cdrRdd.take(10).foreach(println)
    //需要先端口预处理打标端口是否生效标识
    println("BSMS话单端口预处理开始执行.....打印前十条")
    val bsmsResult: RDD[BsmsInfo] = bsmsAnalysisService.bsmsCdrPreServerCode(bossmessage, cdrRdd).persist(StorageLevel.MEMORY_AND_DISK)
    bsmsResult.take(100).foreach(println)
    println("BSMS话单手机号码预处理开始执行.....打印前十条")
    val bsmsCdr: RDD[BsmsInfo] = bsmsAnalysisService.bsmsCdrPreH1H2H3(H1H2H3, bsmsResult).persist(StorageLevel.MEMORY_AND_DISK)
    bsmsCdr.take(100).foreach(println)

    println("##############################################")
    println("分地市报表数据:")
    statistics.statisticsCity(bsmsTable)

    println("##############################################")
    println("分对端省报表数据")
    statistics.statisticsProv(bsmsTable)
  }


}

3、数据加载类dataLoad

功能:加载局数据和明细数据加载,在Controller中调用

(主要是将文件里的数据关联实体类,然后输出成DataSet的临时试图)

Scala 复制代码
package com.dw.maap.job.service

import com.dw.common.utils.EnvUtil
import com.dw.maap.job.config.CfgInfo.{H1H2H3FileName, H1H2H3Path, bBsmsTxtFileName, bBsmsTxtPath, bMaapTxtFileName, bMaapTxtPath, bossmessageinfoFilename, bossmessageinfoPath, bsmsHead10String, bsmsTable, bsmsTail10String, bsmsTxtFileName, bsmsTxtPath, dd_bossmessageinfo, dd_h1h2h3_code_allocate, maapHead10String, maapTable, maapTail10String, maapTxtFileName, maapTxtPath}
import com.dw.maap.job.entity.{BsmsInfo, H1H2H3, MaapInfo, bossmessageInfo}
import com.dw.maap.job.dao.Dao
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Dataset, SparkSession}



class dataLoad {
  
  private val spark: SparkSession = EnvUtil.getSession
  private val sc: SparkContext = EnvUtil.getSc
  private val dao = new Dao
  import spark.implicits._
  //导入SparkSession的隐式转换(包含toDS/toDF等方法),必须在SparkSession创建之后

  def bossMessageLoad(): RDD[bossmessageInfo] = {
   val bossmessageRdd: RDD[bossmessageInfo] = dao.rddRead(bossmessageinfoPath + bossmessageinfoFilename)
    .map(_.split("\\u0001"))
    .map(data => {
     bossmessageInfo(
      data(0), data(1), data(2), data(3), data(4), data(5), data(6), data(7), data(8), data(9), data(10), data(11), data(12), data(13), data(14), data(15)
     )
    })
   val bossmessageSet: Dataset[bossmessageInfo] = bossmessageRdd.toDS()
   bossmessageSet.createOrReplaceTempView(dd_bossmessageinfo)
   bossmessageRdd
  }
  
  def H1H2H3Load(): RDD[H1H2H3] = {
   val H1H2H3Rdd: RDD[H1H2H3] = dao.rddRead(H1H2H3Path + H1H2H3FileName)
    .map(_.split("\\u0001"))
    .map(data => {
     H1H2H3(
      data(0).trim,
      data(1).trim,
      data(2).trim,
      data(3).trim,
      data(4).trim,
      data(5).trim,
      data(6).trim,
      data(7).trim,
      data(8).trim,
      data(9).trim,
      data(10).trim,
      data(11).trim,
      data(12).trim,
      data(13).trim
     )
    })
   val H1H2H3Set: Dataset[H1H2H3] = H1H2H3Rdd.toDS()
   H1H2H3Set.createOrReplaceTempView(dd_h1h2h3_code_allocate)
   H1H2H3Rdd
  }


  def maapCdrFileLoad(): RDD[MaapInfo] = {
   val maapRdd: RDD[String] = dao.hadoopFile(bMaapTxtPath + bMaapTxtFileName + "*" + "," + maapTxtPath + maapTxtFileName + "*")


   val filHead: RDD[String] = maapRdd.filter(_.substring(0, 10) != maapHead10String)
   val filHeadTail: RDD[String] = filHead.filter(_.substring(0, 10) != maapTail10String)

   val maapInfo: RDD[MaapInfo] = filHeadTail.map(data => {
    MaapInfo(

     if (data.substring(1174).startsWith(bMaapTxtFileName)) 1
     else if (data.substring(1174).startsWith(maapTxtFileName)) 0
     else {
      -1
     },
     null,
     null,
     null,
     0.00,
     data.substring(0, 2).trim,
     data.substring(2, 22).trim,
     data.substring(22, 54).trim,
     data.substring(54, 56).trim,
     data.substring(56, 184).split(":")(1).split("@")(0),
     data.substring(184, 312).split("\\+86")(1).trim,
     data.substring(312, 314).trim,
     data.substring(314, 316).trim,
     data.substring(316, 352).trim,
     data.substring(352, 388).trim,
     data.substring(388, 424).trim,
     data.substring(424, 426).trim,
     data.substring(426, 428).trim,
     data.substring(428, 429).trim,
     data.substring(429, 443).trim,
     data.substring(443, 457).trim,
     data.substring(457, 521).trim,
     data.substring(521, 585).trim,
     data.substring(585, 649).trim,
     data.substring(649, 777).trim,
     data.substring(777, 905).trim,
     data.substring(905, 907).trim,
     data.substring(907, 909).trim,
     data.substring(909, 924).trim,
     data.substring(924, 939).trim,
     data.substring(939, 954).trim,
     data.substring(954, 974).trim,
     data.substring(974, 1034).trim,
     data.substring(1034, 1054).trim,
     data.substring(1054, 1114).trim,
     data.substring(1114, 1174).trim,
     data.substring(1174)
    )
   })

   val maapSet: Dataset[MaapInfo] = {
    maapInfo.toDS()
   }

   maapSet.createOrReplaceTempView(maapTable)
   maapInfo
  }
  def bsmsCdrFileLoad(): RDD[BsmsInfo] ={
   val bsmsRdd: RDD[String] = dao.hadoopFile(bBsmsTxtPath + bBsmsTxtFileName + "," + bsmsTxtPath + bsmsTxtFileName)
   val filHead: RDD[String] = bsmsRdd.filter(_.substring(0, 10) != bsmsHead10String)
   val filHeadTail: RDD[String] = filHead.filter(_.substring(0, 10) != bsmsTail10String)
   val bsmsInfo: RDD[BsmsInfo] = filHeadTail.map(data => {
    BsmsInfo(
     if (data.substring(198).substring(12,15)=="551") 1
     else if (data.substring(198).substring(16,19)=="551") 0
     else -1,
     null,
     null,
     null,
     0.00,
     data.substring(0, 2).trim,
     data.substring(2, 22).trim,
     data.substring(22, 24).trim,
     data.substring(24, 25).trim,
     data.substring(25, 40).trim,
     data.substring(40, 58).trim,
     data.substring(58, 70).trim,
     data.substring(70, 85).trim,
     data.substring(85, 106).trim,
     data.substring(106, 116).trim,
     data.substring(116, 123).trim,
     data.substring(123, 124).trim,
     data.substring(124, 127).trim,
     data.substring(127, 133).trim,
     data.substring(133, 139).trim,
     data.substring(139, 150).trim,
     data.substring(150, 164).trim,
     data.substring(164, 178).trim,
     data.substring(178, 198).trim,
     data.substring(198)
    )
   })

   val bsmsSet: Dataset[BsmsInfo] = bsmsInfo.toDS()

   bsmsSet.createOrReplaceTempView(bsmsTable)
   bsmsInfo
  }
}

三、业务逻辑开发

在Controller中调用,实现具体的业务逻辑

两个核心业务BSMS(BsmsProcess)和MAAP(maapProcess),核心逻辑类似

MAAP两个核心逻辑

maapCdrPreH1H2H3

maapCdrPreServerCode

BAMS两个核心逻辑

bsmsCdrPreH1H2H3

bsmsCdrPreServerCode

这里就重点拿MAAP进行详细解析

1、maapCdrPreH1H2H3

功能:

入参为maap话单和H1H2H3(号段局数据)

根据号段匹配H1H2H3号段分辨是哪个城市的号码

2、maapCdrPreServerCode

1>匹配局数据的端口号,找到归属的省份(最长匹配原则,明细数据中的端口号匹配局数据中的端口号,如果有一样长的就匹配一样长的,例如123匹配123,如果123在局数据中没有,就匹配1234,没有1234就匹配12345)

核心为先对号段长度进行排序,然后进行双重遍历的时候对第二层也就是端口的遍历增加一层breakable,这样在下面的遍历操作中优先匹配最短的号段,匹配到之后就跳出当前循环,跳出后,程序会回到外层messageInfos循环,继续处理下一个messageInfos元素;这样就实现了 匹配到serv_code后,处理完逻辑就跳过剩余的serverCodeInfo遍历的需求。

2>批价,根据短信类型以及对应的短信单价

Scala 复制代码
package com.dw.maap.job.service
// 导入SparkSession的隐式转换(包含toDS/toDF等方法)

import com.dw.common.utils.EnvUtil
import com.dw.maap.job.config.CfgInfo._
import com.dw.maap.job.dao.Dao
import com.dw.maap.job.entity.{H1H2H3, MaapInfo, bossmessageInfo}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

import scala.util.control.Breaks.breakable

case class MaapAnalysisService() {
  private val dao = new Dao
  private val spark: SparkSession = EnvUtil.getSession
  private val sc: SparkContext = EnvUtil.getSc

  import spark.implicits._
  //导入SparkSession的隐式转换(包含toDS/toDF等方法),必须在SparkSession创建之后


  def maapCdrPreH1H2H3(maapInfo: RDD[MaapInfo], H1H2H3: RDD[H1H2H3]): RDD[MaapInfo] = {

    val h1h2h3: RDD[(String, String)] = H1H2H3
      .map(data =>
        (data.h1h2h3h4, data.long_code)
      )
    val maap: RDD[(String, String)] = maapInfo.map(
      data => (
        data.reserveuser.substring(0, 7), ""
      )
    )
    val h1h2h3Pre: Array[(String, String)] = h1h2h3.join(maap).map(data => {
      (data._1, data._2._1)
    }).collect()
    val filRdd: RDD[MaapInfo] = maapInfo
      .filter(_.is_bal_port != '0') //过滤发送端口失效的话单
    val frame: DataFrame = filRdd.map(data => {
      for (i <- h1h2h3Pre) {
        if (data.sett_direction == 0 && data.reserveuser.substring(0, 7) == i._1) {
          data.prov = i._2
          data.city = "551"
        }
      }
      data
    }
    ).toDF()

    frame.createOrReplaceTempView(maapTable)
    val value: Dataset[MaapInfo] = frame.as[MaapInfo]
    value.rdd
  }


  def maapCdrPreServerCode(bossmessage: RDD[bossmessageInfo], maap: RDD[MaapInfo]): RDD[MaapInfo] = {
    //业务端口为10657586[开头]并且长度为8位或者12位的按0.045元/条进行结算
    //批价
    val messageInfo: Array[MaapInfo] = maap.map(data => {
      if (data.input_file_name.contains("MAAP_TXT")) {
        data.settle_fee = maapTxtPrice * (data.messagelength.toInt / 270) //MAAP_TXT业务,如果短信长度超过270那么费用为短信长度/270 * 短信单价
      } else if (data.input_file_name.contains("MAAP_MMA")) {
        data.settle_fee = maapMmaPrice
      } else if (data.input_file_name.contains("MAAP_CSN")) {
        data.settle_fee = maapCsnPrice
      }
      data
    }).collect()

    val serverCodeInfo: Array[bossmessageInfo] = bossmessage.sortBy(_.serv_code.length) //根据号段长度排序,优先取最短的号段进行匹配
      .collect()

    for (messageInfos <- messageInfo) {
      breakable {
        for (serverCodeInfos <- serverCodeInfo) {
          if (messageInfos.senduser.indexOf(serverCodeInfos.serv_code) != -1) {
            messageInfos.is_bal_port = serverCodeInfos.is_bal_port
            if (messageInfos.sett_direction == 1) {
              messageInfos.prov = "551"
              messageInfos.city = serverCodeInfos.prov_code
            }
          }
          breakable()
        }
      }
    }

    val value: RDD[MaapInfo] = sc.makeRDD(messageInfo)
    value.toDS().createOrReplaceTempView(maapTable)
    value

  }


}

四、结果输出

Statistics:功能,将统计后的数据分组求和输出报表打印并写入文件

Scala 复制代码
package com.dw.maap.job.service

import com.dw.common.utils.EnvUtil
import org.apache.spark.sql.{Dataset, Row, SparkSession}

import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}

class Statistics {
  private val spark: SparkSession = EnvUtil.getSession
  val currentDate: LocalDate = LocalDate.now()
  val currentDateTime: LocalDateTime = LocalDateTime.now()

  val formatter: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd-HH-mm-ss")
  val formattedDateTime: String = currentDateTime.format(formatter)

  def statisticsCity(tableName:String): Unit = {

   val value: Dataset[Row] = spark.sql(s"select '$tableName',sett_direction,city,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction,city " +
    "union " +
    s"select '$tableName',sett_direction,'全省' as prov,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction"
   ).orderBy("city")
   value.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true") save(s"maap-analysis/src/main/output/${currentDate}/${formattedDateTime}/{$tableName}_city.csv")
   
   value.show()
   
  }
  def statisticsProv(tableName:String): Unit = {
   
   val value: Dataset[Row] = spark.sql(s"select '$tableName',sett_direction,prov,sum(settle_fee) as fee ,count(*) as count_num   from $tableName group by sett_direction,prov " +
    "union " +
    s"select '$tableName',sett_direction,'全国' as prov,sum(settle_fee) as fee,count(*) as count_num  from $tableName group by sett_direction").orderBy("prov")
   value.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true") save(s"maap-analysis/src/main/output${currentDate}/${formattedDateTime}/{$tableName}_prov.csv")
   
   value.show()
   
  }
  
}

注意,需要读取的数据文件在maap-analysis/src/main/resources中(本文件仅作学习使用,请勿用作他处)

运行结果:

日志后面有bsms短信的结果,模式一致,不进行展示,大家可以亲自动手尝试

五、项目拓展优化方向

当前已经在本地搭建好了hadoop、hive、spark集群,可以优化成从hadop读取文件,然后报表输出到hive表中,模拟企业级开发流程。后面有机会会对这个项目进行改造升级,当前准备从零开始设计一个规范的企业级流程的离线数仓项目,座椅就暂时搁置,后面有时间了再优化项目,可以持续关注~

(搭建全过程见专栏从零开始部署本地大数据集群_是阿威啊的博客-CSDN博客

相关推荐
2501_9436953326 分钟前
高职大数据技术专业,怎么参与开源数据分析项目积累经验?
大数据·数据分析·开源
Dxy12393102161 小时前
别再让 ES 把你拖垮!5 个实战技巧让搜索性能提升 10 倍
大数据·elasticsearch·搜索引擎
2501_943695332 小时前
大专市场调查与统计分析专业,怎么辨别企业招聘的“画饼”岗位?
大数据
七夜zippoe2 小时前
CANN Runtime跨进程通信 共享设备上下文的IPC实现
大数据·cann
威胁猎人2 小时前
【黑产大数据】2025年全球电商业务欺诈风险研究报告
大数据
L543414462 小时前
告别代码堆砌匠厂架构让你的系统吞吐量翻倍提升
大数据·人工智能·架构·自动化·rpa
证榜样呀2 小时前
2026 大专计算机专业必考证书推荐什么
大数据·前端
LLWZAI2 小时前
让朱雀AI检测无法判断的AI公众号文章,当创作者开始与算法「躲猫猫」
大数据·人工智能·深度学习
難釋懷3 小时前
分布式锁的原子性问题
分布式
SickeyLee3 小时前
产品经理案例分析(五):电商产品后台设计:撑起前台体验的 “隐形支柱”
大数据