DataStream编程模型之数据源、数据转换、数据输出

Flink之DataStream数据源、数据转换、数据输出(scala)

0.前言--数据源

在进行数据转换之前,需要进行数据读取。

数据读取分为4大部分:

(1)内置数据源;

又分为文件数据源;

socket数据源;

集合数据源三类

(2)Kafka数据源






第二个参数用到的SimpleStringSchema对象是一个内置的DeserializationSchema对象,可以把字节数据反序列化程一个String对象。

另外,FlinkKafkaConsumer开始读取Kafka消息时,可以配置他的 读 起始位置,有如下四种。

scala 复制代码
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.windowing.time.Time
object KafkaWordCount {
  def main(args: Array[String]): Unit = {
 
    val kafkaProps = new Properties()
    //Kafka的一些属性
    kafkaProps.setProperty("bootstrap.servers", "localhost:9092")
    //所在的消费组
    kafkaProps.setProperty("group.id", "group1")
    
    //获取当前的执行环境
    val evn = StreamExecutionEnvironment.getExecutionEnvironment
//创建Kafka的消费者,wordsendertest是要消费的Topic
    val kafkaSource = new FlinkKafkaConsumer[String]("wordsendertest",new SimpleStringSchema,kafkaProps)
    //设置从最新的offset开始消费
    kafkaSource.setStartFromLatest()
    //自动提交offset
kafkaSource.setCommitOffsetsOnCheckpoints(true)
    //绑定数据源
    val stream = evn.addSource(kafkaSource)
 
    //设置转换操作逻辑
    val text = stream.flatMap{ _.toLowerCase().split("\\W+")filter{ _.nonEmpty} }
      .map{(_,1)}
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .sum(1)
 
      //打印输出
      text.print()
 
      //程序触发执行
      evn.execute("Kafka Word Count")
  }
}

(3)HDFS数据源

(4)自定义数据源

一个例子:

scala 复制代码
import java.util.Calendar
import org.apache.flink.streaming.api.functions.source.RichSourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import scala.util.Random
 
case class StockPrice(stockId:String,timeStamp:Long,price:Double)
object StockPriceStreaming {
  def main(args: Array[String]) { 
    //设置执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//设置程序并行度    
env.setParallelism(1)    
//股票价格数据流
    val stockPriceStream: DataStream[StockPrice] = env
      //该数据流由StockPriceSource类随机生成
      .addSource(new StockPriceSource)
 
    //打印结果
    stockPriceStream.print()
 
    //程序触发执行
    env.execute("stock price streaming")
  }
 class StockPriceSource extends RichSourceFunction[StockPrice]{ 
    var isRunning: Boolean = true
    val rand = new Random()
    //初始化股票价格
    var priceList: List[Double] = List(10.0d, 20.0d, 30.0d, 40.0d, 50.0d)
    var stockId = 0
    var curPrice = 0.0d
override def run(srcCtx: SourceContext[StockPrice]): Unit = {
      while (isRunning) {
        //每次从列表中随机选择一只股票
        stockId = rand.nextInt(priceList.size)
        val curPrice =  priceList(stockId) + rand.nextGaussian() * 0.05
        priceList = priceList.updated(stockId, curPrice)
        val curTime = Calendar.getInstance.getTimeInMillis
        //将数据源收集写入SourceContext
        srcCtx.collect(StockPrice("stock_" + stockId.toString, curTime, curPrice))
        Thread.sleep(rand.nextInt(10))
      }
} 
    override def cancel(): Unit = {
      isRunning = false
    }
  }
}

1.数据转换之map操作

1.数据转换算子的四种类型

基于单条记录:fliter、map

基于窗口:window

合并多条数据流:union,join,connect

拆分多条数据流:split

2.map(func)操作将一个DataStream中的每个元素传递到函数func中,并将结果返回为一个新的DataStream。输出的数据流DataStream[OUT]类型可能和输入的数据流DataStream[IN]不同

理解:一 一对应的关系,一个x 得到一个y

scala 复制代码
val dataStream = env.fromElements(1,2,3,4,5)
val mapStream = dataStream.map(x=>x+10)

3.演示代码

scala 复制代码
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 
case class StockPrice(stockId:String,timeStamp:Long,price:Double) 
object MapFunctionTest {
  def main(args: Array[String]): Unit = {
 
    //设定执行环境
   val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设定程序并行度
   env.setParallelism(1)

   //创建数据源
   val dataStream: DataStream[Int] = env.fromElements(1, 2, 3, 4, 5, 6, 7)
 
    //设置转换操作逻辑
    val richFunctionDataStream = dataStream.map {new MyMapFunction()}

 //打印输出
    richFunctionDataStream.print()
 
//程序触发执行
    env.execute("MapFunctionTest")
  }
 
  //自定义函数,继承RichMapFunction
  class MyMapFunction extends RichMapFunction[Int, String] {
    override def map(input: Int): String =
      ("Input : " + input.toString + ", Output : " + (input * 3).toString)
  }
}

2.数据转换之flatMap操作

1.flatMap和map相似,每个输入元素都可以映射到0或多个输出结果。

scala 复制代码
val dataStream = env.fromElements("Hadoop is good","Flink is fast","Flink is better")
val flatMapStream = dataStream.flatMap(line => line.split(" "))

可以理解为flatMap比map多了flat操作。如图。map是将输入数据映射成数组,flat是将数据拍扁,成为一个个元素。把元素映射成了多个。

2.代码演示

scala 复制代码
import org.apache.flink.api.common.functions.FlatMapFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
 
case class StockPrice(stockId:String,timeStamp:Long,price:Double) 
object FlatMapFunctionTest {
  def main(args: Array[String]): Unit = {
 
    //设定执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设定程序并行度
env.setParallelism(1)
//设置数据源
val dataStream: DataStream[String] = 
      env.fromElements("Hello Spark", "Flink is excellent") 
    //针对数据集的转换操作逻辑
val result = dataStream.flatMap(new WordSplitFlatMap(15)) 
    //打印输出
result.print() 
//程序触发执行
    env.execute("FlatMapFunctionTest")
  } 
  //使用FlatMapFunction实现过滤逻辑,只对字符串长度大于threshold的内容进行切词
  class WordSplitFlatMap(threshold: Int) extends FlatMapFunction[String, String] {
    override def flatMap(value: String, out: Collector[String]): Unit = {
      if (value.size > threshold) {
        value.split(" ").foreach(out.collect)
      }
    }
  }
}

预计输出:

复制代码
Flink
is
excellent

这里只对字符长度超过15的做切割。threshold是阈值,少于15的不做切割。

3.数据转换之filter和keyBy操作

1.filter(func)操作会筛选出满足函数func的元素,并返回一个新的数据集

2.代码举例

scala 复制代码
val dataStream = env.fromElements("Hadoop is good","Flink is fast","Flink is better")
val filterStream = dataStream.filter(line => line.contains("Flink"))

如图所示

3.keyBy(注意方法里k小写B大写):将相同Key的数据放置在相同的分区中。

keyBy算子根据元素的形状对数据进行分组,相同形状的元素被分到了一起,可被后续算子统一处理

比如在词频统计时:

复制代码
				hello flink 
				hello hadoop
				hello zhangsan

这里 词频(hello,1),(hello,1),(hello,1)统计出来之后,通过keyBy,就可以聚合,放在了相同的分区里进行统一计算。

通过聚合函数后又可以吧KeyedStream转换成DataStream。

4.在使用keyBy算子时,需要向keyBy算子传递一个参数, 可使用数字位置来指定Key

比如刚才词频统计时,keyBy(0)就是hello这个单词。

scala 复制代码
val dataStream: DataStream[(Int, Double)] =
    env.fromElements((1, 2.0), (2, 1.7), (1, 4.9), (3, 8.5), (3, 11.2))
//使用数字位置定义Key 按照第一个字段进行分组
val keyedStream = dataStream.keyBy(0)

这里keyby 是第一个字段1或者2或者3分组(分类)。

5.keyBy代码举例:

scala 复制代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 
//声明一个样例类,包含三个字段:股票ID、交易时间、交易价格
case class StockPrice(stockId:String,timeStamp:Long,price:Double)
 
object KeyByTest{
  def main(args: Array[String]): Unit = {
 
    //获取执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设置程序并行度
    env.setParallelism(1)
//创建数据源
    val stockList = List(
      StockPrice("stock_4",1602031562148L,43.4D),
      StockPrice("stock_1",1602031562148L,22.9D),
      StockPrice("stock_0",1602031562153L,8.2D),
      StockPrice("stock_3",1602031562153L,42.1D),
      StockPrice("stock_2",1602031562153L,29.2D),
      StockPrice("stock_0",1602031562159L,8.1D),
      StockPrice("stock_4",1602031562159L,43.7D),
      StockPrice("stock_4",1602031562169L,43.5D)
    )
    val dataStream = env.fromCollection(stockList) 
    //设定转换操作逻辑
    val keyedStream = dataStream.keyBy("stockId") 
    //打印输出
    keyedStream.print() 
    //程序触发执行
    env.execute("KeyByTest")
  }
}

这里看起来没什么变换 ,因为没进行聚合操作,所以什么变化都没有,原样输出。

我加上聚合函数,看起来就有变化了。

scala 复制代码
//简写上面的代码 加上聚合函数
    val keyedStream = dataStream.keyBy("stockId")
    val aggre = keyedStream.sum(2) //这里相加的是价格price(第三个字段)

   // keyedStream.print()
    aggre.print()//聚合后打印

结果

对比上面哪里变化了呢?

stcok_id顺序,4-1-0-3-2-0(这里之前也有0,就会加上之前的0,变为16.299,后面的4也在累加前面的price了

4.数据转换之reduce操作和聚合操作

1.reduce:reduce算子将输入的KeyedStream通过传入的用户自定义函数滚动地进行数据聚合处理,处理以后得到一个新的DataStream,如下实例

scala 复制代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 
//声明一个样例类,包含三个字段:股票ID、交易时间、交易价格
case class StockPrice(stockId:String,timeStamp:Long,price:Double)
 
object ReduceTest{
  def main(args: Array[String]): Unit = {
 
    //获取执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设置程序并行度
    env.setParallelism(1)
    //创建数据源
    val stockList = List(
      StockPrice("stock_4",1602031562148L,43.4D),
      StockPrice("stock_1",1602031562148L,22.9D),
      StockPrice("stock_0",1602031562153L,8.2D),
      StockPrice("stock_3",1602031562153L,42.1D),
      StockPrice("stock_2",1602031562153L,29.2D),
      StockPrice("stock_0",1602031562159L,8.1D),
      StockPrice("stock_4",1602031562159L,43.7D),
      StockPrice("stock_4",1602031562169L,43.5D)
    )
    val dataStream = env.fromCollection(stockList)

    //设定转换操作逻辑
    val keyedStream = dataStream.keyBy("stockId")
    val reduceStream = keyedStream
      .reduce((t1,t2)=>StockPrice(t1.stockId,t1.timeStamp,t1.price+t2.price))
 
    //打印输出
    reduceStream.print()
 
    //程序触发执行
    env.execute("ReduceTest")
  }
}

reduce结果和上面的一样,就是累加

2.flink也支持自定义的reduce函数

scala 复制代码
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 
//声明一个样例类,包含三个字段:股票ID,交易时间,交易价格
case class StockPrice(stockId:String,timeStamp:Long,price:Double)
 
object MyReduceFunctionTest{
  def main(args: Array[String]): Unit = {
 
    //获取执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设置程序并行度
    env.setParallelism(1)

    //创建数据源
    val stockList = List(
      StockPrice("stock_4",1602031562148L,43.4D),
      StockPrice("stock_1",1602031562148L,22.9D),
      StockPrice("stock_0",1602031562153L,8.2D),
      StockPrice("stock_3",1602031562153L,42.1D),
      StockPrice("stock_2",1602031562153L,29.2D),
      StockPrice("stock_0",1602031562159L,8.1D),
      StockPrice("stock_4",1602031562159L,43.7D),
      StockPrice("stock_4",1602031562169L,43.5D)
    )
    val dataStream = env.fromCollection(stockList) 


//设定转换操作逻辑
    val keyedStream = dataStream.keyBy("stockId")
    val reduceStream = keyedStream.reduce(new MyReduceFunction)
 
    //打印输出
    reduceStream.print()
 
    //程序触发执行
    env.execute("MyReduceFunctionTest")
  }
  class MyReduceFunction extends ReduceFunction[StockPrice] {
    override def reduce(t1: StockPrice,t2:StockPrice):StockPrice = {
      StockPrice(t1.stockId,t1.timeStamp,t1.price+t2.price)
    }
  }
}

主要不同的就是创建了MyReduceFunction ().

3.聚合算子

和excel一样。

代码举例:

scala 复制代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 
//声明一个样例类,包含三个字段:股票ID、交易时间、交易价格
case class StockPrice(stockId:String,timeStamp:Long,price:Double) 
object AggregationTest{
  def main(args: Array[String]): Unit = {
 
    //获取执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
 
    //设置程序并行度
    env.setParallelism(1)    
//创建数据源
    val stockList = List(
      StockPrice("stock_4",1602031562148L,43.4D),
      StockPrice("stock_1",1602031562148L,22.9D),
      StockPrice("stock_0",1602031562153L,8.2D),
      StockPrice("stock_3",1602031562153L,42.1D),
      StockPrice("stock_2",1602031562153L,29.2D),
      StockPrice("stock_0",1602031562159L,8.1D),
      StockPrice("stock_4",1602031562159L,43.7D),
      StockPrice("stock_4",1602031562169L,43.5D)
    )
    val dataStream = env.fromCollection(stockList)
 
    //设定转换操作逻辑
    val keyedStream = dataStream.keyBy("stockId")
    val aggregationStream = keyedStream.sum(2)  //区别在这里   sum聚合 2表示第三个字段
  
    //打印输出
    aggregationStream.print()
 
    //执行操作
    env.execute(" AggregationTest")
  }
}

运行结果

5.数据输出

1.基本数据输出包括:文件输出,客户端输出,socket网络端口输出。

文件输出具体代码

scala 复制代码
val dataStream = env.fromElements("hadoop","spark","flink")
//文件输出
dataStream.writeAsText("file:///home/hadoop/output.txt")
//hdfs输出

//把数据写入HDFS
dataStream.writeAsText("hdfs://localhost:9000/output.txt") 

//通过writeToSocket方法将DataStream数据集输出到指定socket端口
dataStream.writeToSocket(outputHost,outputPort,new SimpleStringSchema())

2.输出到kafka

代码举例:

scala 复制代码
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
 
object SinkKafkaTest{
  def main(args: Array[String]): Unit = {
 
    //获取执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    
//加载或创建数据源
    val dataStream = env.fromElements("hadoop","spark","flink")
    //把数据输出到Kafka
dataStream.addSink(new FlinkKafkaProducer [String]("localhost:9092", "sinkKafka", new SimpleStringSchema()))
    
//程序触发执行
    env.execute()
  }
}
相关推荐
Edingbrugh.南空13 小时前
Flink ClickHouse 连接器数据读取源码深度解析
java·clickhouse·flink
Edingbrugh.南空2 天前
Flink ClickHouse 连接器维表源码深度解析
java·clickhouse·flink
诗旸的技术记录与分享2 天前
Flink-1.19.0源码详解-番外补充3-StreamGraph图
大数据·flink
Edingbrugh.南空3 天前
Flink MySQL CDC 环境配置与验证
mysql·adb·flink
bxlj_jcj3 天前
深入Flink核心概念:解锁大数据流处理的奥秘
大数据·flink
Edingbrugh.南空3 天前
Flink SQLServer CDC 环境配置与验证
数据库·sqlserver·flink
Edingbrugh.南空3 天前
Flink OceanBase CDC 环境配置与验证
大数据·flink·oceanbase
Edingbrugh.南空4 天前
Flink TiDB CDC 环境配置与验证
大数据·flink·tidb
Edingbrugh.南空4 天前
Flink Postgres CDC 环境配置与验证
大数据·flink
lifallen5 天前
Paimon vs. HBase:全链路开销对比
java·大数据·数据结构·数据库·算法·flink·hbase