1. case class学习

1.1 样例类（case class）适合用于不可变的数据。它是一种特殊的类，能够被优化以用于模式匹配。

scala 复制代码

case class MetaData(userId: String)

case class Book(name: String) {
  def printBookName(): Unit = {
    println(name)
  }
}

object BookTest {
  def main(args: Array[String]): Unit = {
    val book = Book("Java入门到放弃")
    book.printBookName()
  }
}

在实例化case class类时，不需要使用关键字New，case class类编译成class文件之后会自动生成apply方法，这个方法负责对象的创建。

1.2 模式匹配

一个模式匹配语句包括一个待匹配的值，match关键字，以及至少一个case语句。示例如下：

scala 复制代码

def matchTest(x: Int): String = x match {
    case 1 => "one"
    case 2 => "two"
    case _ => "many"
  }

1.3 case class的匹配

scala 复制代码

abstract class Notification # 抽象类，虚基类，父类

case class Email(sender: String, title: String, body: String) extends Notification
case class SMS(caller: String, message: String) extends Notification
case class VoiceRecording(contactName: String, link: String) extends Notification

Notification 是一个虚基类，它有三个具体的子类Email, SMS和VoiceRecording，我们可以在这些Case Class类上使用模式匹配：

scala 复制代码

def showNotification(notification: Notification): String = {
  notification match {
    case Email(email, title, _) =>
      s"You got an email from $email with title: $title"
    case SMS(number, message) =>
      s"You got an SMS from $number! Message: $message"
    case VoiceRecording(name, link) =>
      s"you received a Voice Recording from $name! Link: $link"
  }
}

val someSms = SMS("12345", "Are you there?")
val someVoiceRecording = VoiceRecording("Tom", "voicerecording.org/id/123")

println(showNotification(someSms))  // prints You got an SMS from 12345! Message: Are you there?
println(showNotification(someVoiceRecording))  // you received a Voice Recording from Tom! Click the link to hear it: voicerecording.org/id/123

2. 类转json

scala自带的Json解析工具，简单使用 ，参考Scala json字符串和json对象互相转换

scala 复制代码

import org.json4s.jackson.{JsonMethods, Serialization}

case class User(name: String, age: Int)

object Json4STest {
  def main(args: Array[String]): Unit = {

    //数据格式：user:{"name": "zs","age": 10}
    val json = "{\"name\": \"zs\",\"age\": 10}"

    //导隐式函数
    implicit val f = org.json4s.DefaultFormats
    /** ************************* json字符串->对象 ***************************************/

    val user: User = JsonMethods.parse(json).extract[User]
    println(user)

    /** ************************* 对象->json字符串 ***************************************/
    val userJStr: String = Serialization.write(user)
    println(userJStr)
  }
}

3. Array 和 List

scala 复制代码

val listTest = List(1,2,3,4)

val arrayTest = Array(1, 2, 3, 4) //正确
for (i <- arrayTest.indices){
  println(arrayTest(i))
}

scala> 1 until 10 by 3
res17: scala.collection.immutable.Range = Range(1, 4, 7)

scala> 1 to 10 by 3
res18: scala.collection.immutable.Range = Range(1, 4, 7, 10)

scala> arrayTest.indices
res19: scala.collection.immutable.Range = Range(0, 1, 2, 3)

4. Scala常用函数式编程

4.1 map

map方法可以将某个函数应用到集合中的每个元素并产出其结果的集合，比如

scala 复制代码

scala> val names = List("a", "b", "c")
names: List[String] = List(a, b, c)

scala> names.map(_.toUpperCase)
res1: List[String] = List(A, B, C)

4.2 foreach

foreach和map相似，只不过它没有返回值，foreach只要是为了对参数进行作用。

scala 复制代码

scala> names.foreach{name => println(name)}
a
b
c

4.3 flatten

flatten可以把嵌套的结构展开.

scala 复制代码

scala> List(List(1,2), List(3,4)).flatten
res3: List[Int] = List(1, 2, 3, 4)

4.4 flatmap

flatMap结合了map和flatten的功能。接收一个可以处理嵌套列表的函数，然后把返回结果连接起来。

scala 复制代码

scala> List(List(1,2), List(3,4)).flatMap(x => x.map(x => x *2))
res4: List[Int] = List(2, 4, 6, 8)

4.5 filter

滤除掉使函数返回false的元素

scala 复制代码

scala> def isEven(i:Int):Boolean = i %2 == 0
isEven: (i: Int)Boolean

scala> List(1,2,3,4,5,6).filter(isEven _)
res5: List[Int] = List(2, 4, 6)

4.6 zip

zip方法将两个集合结合在一起

scala 复制代码

scala> List("a","b", "c").zip(List(1,2,3))
res6: List[(String, Int)] = List((a,1), (b,2), (c,3))

4.7 zipWithIndex

将元素和下标结合在一起

scala 复制代码

scala> List(2,3,4,5).zipWithIndex
res8: List[(Int, Int)] = List((2,0), (3,1), (4,2), (5,3))

4.8 sort

函数学习

在scala集合中，可以使用以下几种方式来进行排序
- sorted默认排序
- sortBy指定字段排序
- sortWith自定义排序

scala 复制代码

//定义一个List集合
scala> val list=List(5,1,2,4,3)
list: List[Int] = List(5, 1, 2, 4, 3)

//默认就是升序
scala> list.sorted
res25: List[Int] = List(1, 2, 3, 4, 5)

scala 复制代码

//定义一个List集合
scala> val list=List("1 hadoop","2 spark","3 flink")
list: List[String] = List(1 hadoop, 2 spark, 3 flink)

//按照单词的首字母进行排序
scala> list.sortBy(x=>x.split(" ")(1))
res33: List[String] = List(3 flink, 1 hadoop, 2 spark)

scala 复制代码

scala> val list = List(2,3,1,6,4,5)
a: List[Int] = List(2, 3, 1, 6, 4, 5)

//降序
scala> list.sortWith((x,y)=>x>y)
res35: List[Int] = List(6, 5, 4, 3, 2, 1)

//升序
scala> list.sortWith((x,y)=>x<y)
res36: List[Int] = List(1, 2, 3, 4, 5, 6)

4.9 分组 - groupBy

scala 复制代码

scala> val a = List("张三"->"男", "李四"->"女", "王五"->"男")
a: List[(String, String)] = List((张三,男), (李四,女), (王五,男))

// 按照性别分组
scala> a.groupBy(_._2)
res0: scala.collection.immutable.Map[String,List[(String, String)]] = Map(男 -> List((张三,男), (王五,男)),
女 -> List((李四,女)))

// 将分组后的映射转换为性别/人数元组列表
scala> res0.map(x => x._1 -> x._2.size)
res3: scala.collection.immutable.Map[String,Int] = Map(男 -> 2, 女 -> 1)

4.10 聚合 - reduce

reduce表示将列表，传入一个函数进行聚合计算

scala 复制代码

scala> val a = List(1,2,3,4,5,6,7,8,9,10)
a: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> a.reduce((x,y) => x + y)
res5: Int = 55

// 第一个下划线表示第一个参数，就是历史的聚合数据结果
// 第二个下划线表示第二个参数，就是当前要聚合的数据元素
scala> a.reduce(_ + _)
res53: Int = 55

// 与reduce一样，从左往右计算
scala> a.reduceLeft(_ + _)
res0: Int = 55

// 从右往左聚合计算
scala> a.reduceRight(_ + _)
res1: Int = 55

4.11 折叠 - fold

fold与reduce很像，但是多了一个指定初始值参数

scala 复制代码

//定义一个List集合
scala> val a = List(1,2,3,4,5,6,7,8,9,10)
a: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

//求和
scala> a.sum
res41: Int = 55

//给定一个初始值，，折叠求和
scala> a.fold(0)(_+_)
res42: Int = 55

scala> a.fold(10)(_+_)
res43: Int = 65

//从左往右
scala> a.foldLeft(10)(_+_)
res44: Int = 65

//从右往左
scala> a.foldRight(10)(_+_)
res45: Int = 65

//fold和foldLet效果一致，表示从左往右计算
//foldRight表示从右往左计算

4.12 高阶函数

使用函数值作为参数，或者返回值为函数值的"函数"和"方法"，均称之为"高阶函数"

4.12.1 函数值作为参数

scala 复制代码

//定义一个数组
scala> val array=Array(1,2,3,4,5)
array: Array[Int] = Array(1, 2, 3, 4, 5)

//定义一个函数
scala> val func=(x:Int)=>x*10
func: Int => Int = <function1>

//函数作为参数传递到方法中
scala> array.map(func)
res0: Array[Int] = Array(10, 20, 30, 40, 50)

4.12.2 匿名函数

scala 复制代码

//定义一个数组
scala> val array=Array(1,2,3,4,5)
array: Array[Int] = Array(1, 2, 3, 4, 5)

//定义一个没有名称的函数----匿名函数
scala> array.map(x=>x*10)
res1: Array[Int] = Array(10, 20, 30, 40, 50)

5. spark程序

5.1 创建sparkSession

scala 复制代码

val spark = SparkSession
.builder()
.master("local")
.appName("test")
.config("spark.sql.execution.arrow.enabled", "true")
.getOrCreate()

spark.sql("select 1 as num").show()

master: 设置运行方式：local 代表本机单核运行，**local $4$ 代表在本机用4核跑， spark://master:7077**是以standalone方式运行；
appName: 设置spark程序的名字，可以在web UI界面看到；
config: 附加配置项；
getOrCreate：创建一个SparkSession。

5.2 读取和保存本地excel文件

xml 复制代码

<!--excel依赖-->
<dependency>
    <groupId>com.crealytics</groupId>
    <artifactId>spark-excel_2.11</artifactId>
    <version>0.13.1</version>
</dependency>

<fake_data.xlsx>

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iJMKgw32-1689610033377)(C:\\Users\\86183\\AppData\\Roaming\\Typora\\typora-user-images\\image-20230109233201179.png)

读excel，字段取值都是string类型

scala 复制代码

val dataPath = "C:\\Users\\86183\\Desktop\\教程\\fake_data.xlsx"
//读excel
val df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true").load(dataPath)

df.show(false)

println(df.dtypes.mkString("Array(", ", ", ")"))

val saveDataPath = "C:\\Users\\86183\\Desktop\\教程\\fake_data_1.xlsx"
//写excel
df.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "A1")
.option("header", "true")
.mode("overwrite")
.save(saveDataPath)

5.3 dataFrame注册hive表

scala 复制代码

df.createTempView("temp_table")
val sqlString =
"""
|select
|*
|from temp_table
|""".stripMargin

spark.sql(sqlString).show(false)

5.4 聚合函数groupBy用法

5.4.1 基础用法

scala 复制代码

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
 
object GroupbyExample extends App {
  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()
 
  spark.sparkContext.setLogLevel("ERROR")
  import spark.implicits._
 
  val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
  val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
  df.show()
 
  //Group By on single column
  df.groupBy("department").count().show(false)
  df.groupBy("department").avg("salary").show(false)
  df.groupBy("department").sum("salary").show(false)
  df.groupBy("department").min("salary").show(false)
  df.groupBy("department").max("salary").show(false)
  df.groupBy("department").mean("salary").show(false)
 
  //GroupBy on multiple columns
  df.groupBy("department","state")
    .sum("salary","bonus")
    .show(false)
  df.groupBy("department","state")
    .avg("salary","bonus")
    .show(false)
  df.groupBy("department","state")
    .max("salary","bonus")
    .show(false)
  df.groupBy("department","state")
    .min("salary","bonus")
    .show(false)
  df.groupBy("department","state")
    .mean("salary","bonus")
    .show(false)
 
  //Running Filter
  df.groupBy("department","state")
    .sum("salary","bonus")
    .show(false)
 
  //using agg function
  df.groupBy("department")
    .agg(
      sum("salary").as("sum_salary"),
      avg("salary").as("avg_salary"),
      sum("bonus").as("sum_bonus"),
      max("bonus").as("max_bonus"))
    .show(false)
 
  df.groupBy("department")
    .agg(
      sum("salary").as("sum_salary"),
      avg("salary").as("avg_salary"),
      sum("bonus").as("sum_bonus"),
      stddev("bonus").as("stddev_bonus"))
    .where(col("sum_bonus") > 50000)
    .show(false)
}

scala 复制代码

# 使用多个聚合函数前，先导入相关的包 "import org.apache.spark.sql.functions._"
import org.apache.spark.sql.functions._
df
.groupBy("first_class_id")
.agg(count("first_class_id").as("count_first_class_id"),sum("exposure_cnt").as("sum_exposure_cnt")).show()

选择多列

scala 复制代码

val groupByList = Array("item_id", "first_class_id")
# : _* 表示讲列名传入select
df.select(groupByList.map(col): _*).show
df.groupBy(groupByList.map(col): _*).agg(sum("exposure_cnt").as("sum_exposure_cnt")).show()

# agg方法
var map = scala.collection.immutable.Map.empty[String, String]
for (func <- aggFunc) {
    for (colName <- calculateCol) {
        map += (colName -> func)
    }
}
df.groupBy(groupByList.map(col): _*).agg(map).show(false)

Spark 指南：Spark SQL（二）------ 结构化操作

Spark中选取DataFrame多列的三种方式

SparkSQL 高级篇(一) 聚合操作

Spark：group by和聚合函数使用

【Saprk】Spark DataFrame 列的类型转换

Spark 指南：Spark SQL（三）------ 结构化类型

Explain different ways of groupBy() in spark SQL

Spark读取本地文件和HDFS文件

spark读取excel成dataframe的几种方式

scala 复制代码

package com.huawei.rcm.study

abstract class Notification

case class Email(sender: String, title: String, body: String) extends Notification

case class SMS(caller: String, message: String) extends Notification

case class VoiceRecording(contactName: String, link: String) extends Notification

object demo {
  def matchTest(x: Int): String = x match {
    case 1 => "one"
    case 2 => "two"
    case _ => "many"
  }

  def showNotification(notification: Notification): String = {
    notification match {
      case Email(email, title, _) =>
        s"You got an email from $email with title: $title"
      case SMS(number, message) =>
        s"You got an SMS from $number! Message: $message"
      case VoiceRecording(name, link) =>
        s"you received a Voice Recording from $name! Link: $link"
    }
  }

  def main(args: Array[String]): Unit = {
    val someSms = SMS("12345", "Are you there?")
    val someVoiceRecording = VoiceRecording("Tom", "voicerecording.org/id/123")

    println(showNotification(someSms)) // prints You got an SMS from 12345! Message: Are you there?
    println(showNotification(someVoiceRecording)) // you received a Voice Recording from Tom! Click the link to hear it: voicerecording.org/id/123
    println(matchTest(1))


  }
}


import org.apache.spark.sql.SparkSession
import org.json4s.jackson.{JsonMethods, Serialization}

case class User(name: String, age: Int)

object Json4STest {
  def main(args: Array[String]): Unit = {

    //数据格式：user:{"name": "zs","age": 10}
    val json = "{\"name\": \"zs\",\"age\": 10}"

    //导隐式函数
    implicit val f = org.json4s.DefaultFormats
    /** ************************* json字符串->对象 ************************************** */

    val user: User = JsonMethods.parse(json).extract[User]
    println(user)

    /** ************************* 对象->json字符串 ************************************** */
    val userJStr: String = Serialization.write(user)
    println(userJStr)
    val strings = new Array[String](3)
    val arrayTest = Array(1, 2, 3, 4) //正确

    val listTest = List(1, 2, 3, 4)

    for (i <- listTest.indices) {
      println(listTest(i))
    }


    // sort
    //    val list=List(5,1,2,4,3)
    //    list.sorted

    val list = List("1 hadoop", "2 spark", "3 flink")
    list.sortBy(x => x.split(" ")(1))
    // groupBy
    //    arrayTest.map()

    //    val a = List("张三"->"男", "李四"->"女", "王五"->"男")
    //    a.groupBy(x=>x._2)
    //    arrayTest.groupBy()
    //    arrayTest.sorted


    //    val a = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    //    a.reduce((x, y) => x + y)
    //    a.reduceLeft(_+_)

    val a = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    a.fold(0)(_ + _)

    val spark = SparkSession
      .builder()
      .master("local")
      .appName("test")
      .config("spark.sql.execution.arrow.enabled", "true")
      .getOrCreate()

    spark.sql("select 1 as num").show()


    val dataPath = "C:\\Users\\86183\\Desktop\\教程\\fake_data.xlsx"
    //读excel
    val df = spark.read.format("com.crealytics.spark.excel")
      .option("header", "true").load(dataPath)

    df.show(false)

    println(df.dtypes.mkString("Array(", ", ", ")"))


    val saveDataPath = "C:\\Users\\86183\\Desktop\\教程\\fake_data_1.xlsx"
    //写excel
    df.coalesce(1).write
      .format("com.crealytics.spark.excel")
      .option("dataAddress", "A1")
      .option("header", "true")
      .mode("overwrite")
      .save(saveDataPath)


    df.createTempView("temp_table")
    val sqlString =
      """
        |select
        |*
        |from temp_table
        |""".stripMargin

    spark.sql(sqlString).show(false)

    import org.apache.spark.sql.functions._
    df.groupBy("first_class_id")
      .agg(count("first_class_id").as("count_first_class_id"), sum("exposure_cnt").as("sum_exposure_cnt")).show()


    val groupByList = Array("item_id", "first_class_id")
    val aggFunc = Array("avg", "max")
    val calculateCol = Array("ctr", "exposure_cnt")



    //    val columns = df.columns
    import org.apache.spark.sql.{Column, DataFrame}
    import org.apache.spark.sql.functions.col


    val columns: Array[String] = df.columns
    val arrayColumn: Array[Column] = columns.map(column => col(column))
    val df2: DataFrame = df.select(arrayColumn: _*)

    df.select(groupByList.map(col): _*).show
    df.groupBy(groupByList.map(col): _*).agg(sum("exposure_cnt").as("sum_exposure_cnt")).show()

    df.groupBy(groupByList.map(col): _*).count().show()

    var map = scala.collection.immutable.Map.empty[String, String]
    for (func <- aggFunc) {
      for (colName <- calculateCol) {
        map += (colName -> func)
      }
    }
    df.groupBy(groupByList.map(col): _*).agg(map).show(false)


  }
}

6. spark ML之特征处理

spark ML之特征处理

1、VectorIndexer

根据源码注释，VectorIndexer是用于在"向量"的数据集中索引分类特征列的类（Class for indexing categorical feature columns in a dataset of Vector），这看起来不太好理解，直接看用法，举例说明就好了。

1.1 数据

我们用普通的数据格式即可：

data1.txt

复制代码

1,-1.0 1.0
0,0.0 3.0
1,-1.0 5.0
0,0.0 1.0

其中第一列为label,后面的为features

spark读取数据程序（供参考）：

scala 复制代码

import spark.implicits._
val data_path = "files/ml/featureprocessing/data1.txt"
val data = spark.read.text(data_path).map {
  case Row(line: String) =>
    var arr = line.split(',')
    (arr(0), Vectors.dense(arr(1).split(' ').map(_.toDouble)))
}.toDF("label", "features")
data.show(false)

7. Spark运行日志去除------log4j.properties

properties 复制代码

log4j.rootLogger=ERROR,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%-20c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h5tlIh8Q-1689610033378)(C:\\Users\\86183\\AppData\\Roaming\\Typora\\typora-user-images\\image-20230115211719071.png)

spark.read.text(data_path).map {

case Row(line: String) =>

var arr = line.split(',')

(arr(0), Vectors.dense(arr(1).split(' ').map(_.toDouble)))

}.toDF("label", "features")

data.show(false)

复制代码

# 7. Spark运行日志去除------log4j.properties

```properties
log4j.rootLogger=ERROR,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%-20c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

scala学习手册

1. case class学习

2. 类转json

3. Array 和 List

4. Scala常用函数式编程

4.1 map

4.2 foreach

4.3 flatten

4.4 flatmap

4.5 filter

4.6 zip

4.7 zipWithIndex

4.8 sort

4.9 分组 - groupBy

4.10 聚合 - reduce

4.11 折叠 - fold

4.12 高阶函数

4.12.1 函数值作为参数

4.12.2 匿名函数

5. spark程序

5.1 创建sparkSession

5.2 读取和保存本地excel文件

5.3 dataFrame注册hive表

5.4 聚合函数groupBy用法

5.4.1 基础用法

6. spark ML之特征处理

1、VectorIndexer

1.1 数据

7. Spark运行日志去除------log4j.properties