编写Spark程序使用Kafka数据源
新建KafkaWordProducer.scala,它是产生一系列字符串的程序,会产生随机的整数序列,每个整数被当做一个单词,提供给KafkaWordCount程序去进行词频统计。请在KafkaWordProducer.scala中输入以下代码:
scala
import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
object KafkaWordProducer {
def main(args: Array[String]): Unit = {
if(args.length < 4){
System.err.println("Usage: KafkaWordCountProducer <metadataBrokerList> <topic> "
+ "<messagePreSec> <wordsPreMessage>")
System.exit(1)
}
val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
val props = new HashMap[String, Object]()
// Zookeeper connection properties
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
// Send some messages
while(true) {
(1 to messagesPerSec.toInt).foreach { messageNum =>
val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10).toString)
.mkString(" ")
print(str)
println()
val message = new ProducerRecord[String, String](topic, null, str)
producer.send(message)
}
Thread.sleep(1000)
}
}
}
然后,继续在当前目录下创建KafkaWordCount.scala代码文件.
KafkaWordCount.scala是用于单词词频统计,它会把KafkaWordProducer发送过来的单词进行词频统计,代码内容如下:
scala
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaWordCount{
def main(args:Array[String]){
val sc = new SparkConf().setAppName("KafkaWordCount")
val ssc = new StreamingContext(sc,Seconds(10))
//设置检查点,默认存放在HDFS上
ssc.checkpoint("/streaming/checkpoint")
val zkQuorum = "master:2181" //Zookeeper服务器地址
val group = "1" //topic所在的group,可以设置为自己想要的名称,比如不用1,而是val group = "test-consumer-group"
val topics = "wordsender" //topics的名称
val numThreads = 1 //每个topic的分区数
val topicMap =topics.split(",").map((_,numThreads.toInt)).toMap
val lineMap = KafkaUtils.createStream(ssc,zkQuorum,group,topicMap)
val lines = lineMap.map(_._2)
val words = lines.flatMap(_.split(" "))
val pair = words.map(x => (x,1))
val wordCounts = pair.reduceByKeyAndWindow(_ + _,_ - _,Minutes(2),Seconds(10),2) //这行代码的含义在下一节的窗口转换操作中会有介绍
wordCounts.print
ssc.start
ssc.awaitTermination
}
}
build.sbt:
name := "StreamKafka"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"
运行
首先,请新打开一个终端,执行如下命令,运行"KafkaWordProducer"程序,生成一些单词(是一堆整数形式的单词):
[root@master ~]# spark2-submit --class KafkaWordProducer streamkafka_2.11-0.1.jar master:9092 wordsender 3 5
0 2 6 1 2
1 9 1 5 3
5 4 0 2 7
8 4 7 7 1
3 4 0 5 5
5 9 9 0 9
......
注意,上面命令中,"master:9092 wordsender 3 5"是提供给KafkaWordProducer程序的4个输入参数,第1个参数master:9092是Kafka的broker的地址,第2个参数wordsender是topic的名称,我们在KafkaWordCount.scala代码中已经把topic名称写死掉,所以,KafkaWordCount程序只能接收名称为"wordsender"的topic。第3个参数"3"表示每秒发送3条消息,第4个参数"5"表示,每条消息包含5个单词(实际上就是5个整数)。
这个终端窗口就放在这里,不要关闭,千万不要关闭,就让它一直不断发送单词。
然后,请新打开一个终端,执行下面命令,运行KafkaWordCount程序,执行词频统计:
[root@master ~]# spark2-submit --class KafkaWordCount streamkafka_2.11-0.1.jar
-------------------------------------------
Time: 1505456410000 ms
-------------------------------------------
(4,696)
(8,730)
(6,665)
(0,692)
(2,693)
(7,691)
(5,748)
(9,731)
(3,717)
(1,732)
-------------------------------------------
Time: 1505456420000 ms
-------------------------------------------
(4,703)
(8,745)
(6,687)
(0,711)
(2,704)
(7,699)
(5,761)
(9,752)
(3,730)
(1,753)
......
恭喜你,顺利完成了Spark Streaming和Kafka的集成。
注意:本教程使用的Kafka版本为:0.8