一、项目架构
gitee地址:
https://gitee.com/sawa0725/data-ware-house
项目:DataWareHouse
模块
1、maap-analysis 本地spark采用RDD算子短信相关业务的离线模块
2、bigdata-connect-demo 本地连接大数据集群的测试模块
3、adlog-analysis 广告日志分析的spark streaming实时模块
技术栈
Scala 2.12.17 + Spark 3.4.1 + Hadoop 3.3.6 + Hive 3.1.2 + JDK 1.8
目录结构总览
DataWareHouse/ # 根项目(父POM)
├── pom.xml # 父POM,统一管理依赖版本、插件、公共配置
├── common/ # 公共模块(通用工具、常量、基础配置)
│ ├── pom.xml
│ ├── src/main/scala/
│ │ ├── com/dw/common/ # 公共包名
│ │ │ ├── config/ # 通用配置(Spark/Hadoop/Hive连接配置)
│ │ │ ├── utils/ # 工具类(DateUtils、JDBCUtils、KafkaUtils等)
│ │ │ ├── constant/ # 常量定义(Kafka主题、MySQL表名等)
│ │ │ └── exception/ # 自定义异常
│ │ └── resources/ # 公共资源(log4j2.xml、application.conf等)
│ └── src/test/scala/ # 公共模块测试用例
├── maap-analysis/ # 第一个模块:Spark单机文件操作
│ ├── pom.xml
│ ├── src/main/scala/
│ │ ├── com/dw/maap/ # 模块专属包名
│ │ │ ├── core/ # 核心业务逻辑(Spark文件处理)
│ │ │ ├── job/ # 任务入口(Spark Job)
│ │ │ └── service/ # 业务服务层
│ │ └── resources/ # 模块专属配置(如文件路径配置)
│ └── src/test/scala/ # 模块测试用例
├── bigdata-connect-demo/ # 第二个模块:大数据组件连接测试
│ ├── pom.xml
│ ├── src/main/scala/
│ │ ├── com/dw/connect/ # 模块专属包名
│ │ │ ├── hadoop/ # Hadoop连接测试(HDFS操作)
│ │ │ ├── spark/ # Spark集群连接测试
│ │ │ ├── hive/ # Hive连接测试(Spark访问Hive)
│ │ │ └── test/ # 统一测试入口
│ │ └── resources/ # 连接配置(core-site.xml、hive-site.xml等)
│ └── src/test/scala/ # 模块测试用例
└── adlog-analysis/ # 第三个模块:Spark Streaming + Kafka + MySQL
├── pom.xml
├── src/main/scala/
│ ├── com/dw/adlog/ # 模块专属包名
│ │ ├── consumer/ # Kafka消费者逻辑
│ │ ├── processor/ # 日志处理逻辑
│ │ ├── sink/ # 数据写入MySQL
│ │ └── streaming/ # Spark Streaming入口
│ └── resources/ # 模块配置(Kafka/MySQL连接信息)
└── src/test/scala/ # 模块测试用例
二、Maven POM 文件设计
1. 根项目 POM
DataWareHouse/pom.xml
核心作用:统一管理依赖版本、配置公共插件、声明子模块,避免版本冲突。
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<!-- 根项目信息 -->
<groupId>com.dw</groupId>
<artifactId>data-ware-house</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<name>DataWareHouse</name>
<description>数据仓库项目:Spark+Hadoop+Hive+Kafka+MySQL</description>
<!-- 声明子模块 -->
<modules>
<module>common</module>
<module>maap-analysis</module>
<module>bigdata-connect-demo</module>
<module>adlog-analysis</module>
</modules>
<!-- 统一版本管理 -->
<properties>
<!-- 基础环境 -->
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<!-- 核心依赖版本 -->
<scala.version>2.12.17</scala.version>
<scala.compat.version>2.12</scala.compat.version>
<spark.version>3.4.1</spark.version>
<hadoop.version>3.3.6</hadoop.version>
<hive.version>3.1.2</hive.version>
<!-- 其他依赖版本 -->
<kafka.version>3.4.0</kafka.version> <!-- 与Spark 3.4.1兼容 -->
<mysql-connector.version>8.0.33</mysql-connector.version>
<log4j2.version>2.20.0</log4j2.version>
<slf4j.version>1.7.36</slf4j.version>
<commons-lang3.version>3.12.0</commons-lang3.version>
<!-- 插件版本 -->
<maven-compiler-plugin.version>3.11.0</maven-compiler-plugin.version>
<scala-maven-plugin.version>4.8.1</scala-maven-plugin.version>
<maven-shade-plugin.version>3.4.1</maven-shade-plugin.version>
</properties>
<!-- 依赖管理(子模块可直接引用,无需指定版本) -->
<dependencyManagement>
<dependencies>
<!-- Scala核心 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark核心 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<!-- 集群环境需排除内置Hadoop,本地测试可保留 -->
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Spark SQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark Streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark Kafka整合 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark Hive整合 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Hadoop核心 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- Hive JDBC -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- MySQL驱动 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql-connector.version}</version>
</dependency>
<!-- 日志框架 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<!-- 工具类 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>${commons-lang3.version}</version>
</dependency>
<!-- 测试依赖 -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>3.2.15</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-testing-base_${scala.compat.version}</artifactId>
<version>3.4.1_1.4.0</version>
<scope>test</scope>
</dependency>
</dependencies>
</dependencyManagement>
<!-- 公共插件配置 -->
<build>
<pluginManagement>
<plugins>
<!-- Scala编译插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${scala-maven-plugin.version}</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
</args>
</configuration>
</plugin>
<!-- Java编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<!-- 打包插件(生成可执行JAR) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${maven-shade-plugin.version}</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass> <!-- 子模块指定主类 -->
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/services/org.apache.spark.sql.sources.DataSourceRegister</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
<!-- 全局插件 -->
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${scala-maven-plugin.version}</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
</plugin>
</plugins>
</build>
</project>
2、公共模块 POM
common/pom.xml
核心作用:依赖根项目,引入通用依赖(工具类、日志、配置等)
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>com.dw</groupId>
<artifactId>data-ware-house</artifactId>
<version>1.0.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>common</artifactId>
<name>common</name>
<description>公共工具模块</description>
<packaging>jar</packaging>
<!-- 依赖(无需指定版本,继承父POM) -->
<dependencies>
<!-- Scala核心 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
</dependency>
<!-- Spark核心(公共配置依赖) -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
</dependency>
<!-- Hadoop(公共HDFS工具) -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</dependency>
<!-- 日志框架 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
</dependency>
<!-- 工具类 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<!-- 测试依赖 -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
3、MaapAnalysis 模块 POM
maap-analysis/pom.xml
核心作用:依赖 common 模块,引入 Spark 单机文件操作所需依赖。
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>com.dw</groupId>
<artifactId>data-ware-house</artifactId>
<version>1.0.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>maap-analysis</artifactId>
<name>maap-analysis</name>
<description>Spark单机文件操作模块</description>
<packaging>jar</packaging>
<!-- 依赖 -->
<dependencies>
<!-- 公共模块 -->
<dependency>
<groupId>com.dw</groupId>
<artifactId>common</artifactId>
<version>1.0.0</version>
</dependency>
<!-- Spark核心(单机模式) -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
</dependency>
<!-- 测试依赖 -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 打包配置(指定主类,示例:MaapAnalysisJob) -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.dw.maap.job.MaapAnalysisJob</mainClass>
</transformer>
</transformers>
</configuration>
</plugin>
</plugins>
</build>
</project>
4、BigDataConnectDemo 模块 POM
bigdata-connect-demo/pom.xml
核心作用:依赖 common 模块,引入 Hadoop/Spark/Hive 连接测试所需依赖
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>com.dw</groupId>
<artifactId>data-ware-house</artifactId>
<version>1.0.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>bigdata-connect-demo</artifactId>
<name>bigdata-connect-demo</name>
<description>大数据组件连接测试模块</description>
<packaging>jar</packaging>
<!-- 依赖 -->
<dependencies>
<!-- 公共模块 -->
<dependency>
<groupId>com.dw</groupId>
<artifactId>common</artifactId>
<version>1.0.0</version>
</dependency>
<!-- Spark + Hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.compat.version}</artifactId>
</dependency>
<!-- Hadoop客户端 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</dependency>
<!-- Hive JDBC -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
</dependency>
<!-- 测试依赖 -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 打包配置(指定连接测试主类) -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.dw.connect.test.ConnectTestMain</mainClass>
</transformer>
</transformers>
</configuration>
</plugin>
</plugins>
</build>
</project>
5、AdLogAnalysis 模块 POM
adlog-analysis/pom.xml
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>com.dw</groupId>
<artifactId>data-ware-house</artifactId>
<version>1.0.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>adlog-analysis</artifactId>
<name>adlog-analysis</name>
<description>Spark Streaming日志分析模块(Kafka+MySQL)</description>
<packaging>jar</packaging>
<!-- 依赖 -->
<dependencies>
<!-- 公共模块 -->
<dependency>
<groupId>com.dw</groupId>
<artifactId>common</artifactId>
<version>1.0.0</version>
</dependency>
<!-- Spark Streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.compat.version}</artifactId>
</dependency>
<!-- Spark Kafka整合 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.compat.version}</artifactId>
</dependency>
<!-- MySQL驱动 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<!-- 测试依赖 -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-testing-base_${scala.compat.version}</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 打包配置(指定Streaming主类) -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.dw.adlog.streaming.AdLogStreamingJob</mainClass>
</transformer>
</transformers>
</configuration>
</plugin>
</plugins>
</build>
</project>
三、项目模块职责划分
一、父项目
|------|------------------------------------------------|
| 核心职责 | 具体说明 |
| 模块聚合 | 作为 Maven/Gradle 父工程,聚合common、maap-analysis等子模块 |
| 全局配置 | 统一管理所有子模块的依赖版本(如 Spark、Scala 版本)、仓库地址 |
| 构建管控 | 定义全局构建规则(如打包方式、编译插件) |
二、common 子模块(公共工具层)
|------------|--------|--------------------------------------------------------|
| 目录 / 类 | 核心职责 | 具体内容 |
| utils/ 其他类 | 通用工具 | 日期格式化、字符串处理、HDFS 通用操作、异常处理工具等(与业务无关) |
| config/ | 全局配置读取 | 读取 application.conf 等配置文件的通用工具(如加载 Spark 集群地址、HDFS 路径) |
| constant/ | 全局常量 | 系统级常量(如 Spark 任务超时时间、HDFS 根路径)、枚举类(如任务状态) |
| exception/ | 通用异常 | 自定义基础异常(如 SparkEnvException )、全局异常处理器 |
三、业务子模块(如 maap-analysis)
|--------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| 目录 / 类 | 核心职责 | 具体内容 |
| dao/ | 业务数据访问 | 1. 基于 EnvUtil 获取 ThreadLocal 的 Spark 上下文,实现具体业务的数据读取 / 写入(如读取 Hive 表、写入 Parquet 文件); 2. 每个业务 Dao 类仅负责对应业务的数据访问(如 MaapDataDao 处理 maap 相关数据) |
| service/ | 业务逻辑处理 | 1. 调用 Dao 层获取数据; 2. 封装 Spark 计算逻辑(RDD/DF/Dataset 处理); 3. 依赖 common 的工具类完成通用操作 |
| job/ | 任务入口 | 1. Spark 任务的主类(如 MaapAnalysisJob ); 2. 初始化 Spark 上下文并通过 EnvUtil.setSc() / setSession() 绑定到当前线程; 3. 调用 Service 层执行业务逻辑,任务结束后调用 EnvUtil.clear() 清理上下文 |
| entity/ | 业务实体 | 样例类(Case Class),封装业务数据结构(如 MaapInfo ),用于 Dataset 类型安全操作 |
| config/ (可选) | 业务配置 | 仅当前业务模块的专属配置(如 maap 分析的阈值、过滤规则) |