spark 如何生成更多的task spark的update-摩杜云开发者社区

在Spark Streaming中，DStream的转换分为有状态和无状态两种。无状态的操作，即当前批次的处理不依赖于先前批次的数据，如map()、flatMap()、filter()、reduceByKey()、groupByKey()等等;而有状态的操作，即当前批次的处理需要依赖先前批次的数据，这样的话，就需要跨批次维护状态。

总结spark streaming中的状态操作:updateStateByKey、mapWithState、和基于window的状态操作。

updateStateByKey

//状态更新函数
 val updateFunc=(currentValues:Seq[Int], previousState:Option[Int])=>{
   //该批次结果
   val currentCount = currentValues.sum
   //先前状态
   val previousCount= previousState.getOrElse(0)
   //如何更新
   Some(currentCount+previousCount)
 }
 
 //updateStateByKey
 val result=kafkaDirectStream
   .map(_.value())
   //json string 转换成 case class
   .map(parse(_).extractOrElse[TwitterFeed](null))
   .filter(_!=null)
   .map(item=>(item.language,1))
   .updateStateByKey(updateFunc)

注意:

需要开启checkpoint,并设置checkpoint目录。用于存放元数据(任务元数据、kafka offset等)、数据(如Key的State)。
应通过StreamingContext.getOrCreate(checkpointPath、creatingFunc)来获取StreamingContext，业务逻辑都应封装在creatingFunc函数中。

mapWithState

mapWithState从spark 1.6.0开始出现，可以看做是updateStateByKey的升级版,有一些updateStateByKey所没有的特征:

支持输出只发生更新的状态和全量状态

mapWithState默认每个批次只会返回当前批次中有新数据的Key的状态，也可返回全量状态。updateStateByKey每个批次都会返回所有状态。

内置状态超时管理

内置状态超时管理，可对超时的Key单独处理。也可实现如会话超时(Session Timeout)的功能。在updateStateByKey中,如果要实现类似功能，需要大量编码。

初始化状态

可以选择自定义的RDD来初始化状态。

可以返回任何我们期望的类型

由mapWithState函数可知，可以返回任何我们期望的类型，而updateStateByKey做不到这一点。

性能更高

实现上，mapWithState只是增量更新，updateStateByKey每个批次都会对历史全量状态和当前增量数据进行cogroup合并，状态较大时，性能较低。

//状态更新函数
 val mappingFunction = (currentKey: String, currentValue: Option[Int], previousState: State[Int]) => {
   //超时时处理
   if(previousState.isTimingOut()){
     println("Key:"+currentKey+" is timing out!")
     None
   }else{
   //没有超时时处理
     val currentCount =currentValue.getOrElse(0)+previousState.getOption().getOrElse(0)
     previousState.update(currentCount)
     (currentKey,currentCount)
   }
 }

 //mapWithState
 val result=kafkaDirectStream
   .map(_.value())
   //json string 转换成 case class
   .map(parse(_).extractOrElse[TwitterFeed](null))
   .filter(_!=null)
   .map(item=>(item.language,1))
   .reduceByKey(_+_)
   .mapWithState(
     StateSpec
       //更新函数
       .function(mappingFunction)
       //初始状态
       .initialState(initialStateRDD)
       //超时时间
       .timeout(Seconds(10))
   )

 //输出全量状态
 //result.stateSnapshots().foreachRDD(_.foreach(println))

 //只输出有更新的状态
 result.foreachRDD(_.foreach(println))

注意:

需要开启checkpoint,并设置checkpoint目录。用于存放元数据(任务元数据、kafka offset等)、数据(如Key的State)。
应通过StreamingContext.getOrCreate(checkpointPath、creatingFunc)来获取StreamingContext，业务逻辑都应封装在creatingFunc函数中。

基于窗口的状态转换

reduceByKeyAndWindow

基于滑动窗口，对(K,V)类型的DStream按Key进行Reduce操作，返回新的DStream。

普通版的reduceByKeyAndWindow

def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]

增量版的reduceByKeyAndWindow

def reduceByKeyAndWindow(reduceFunc: (V, V) => V,invReduceFunc: (V, V) => V,windowDuration: Duration,slideDuration: Duration = self.slideDuration,numPartitions: Int = ssc.sc.defaultParallelism,filterFunc: ((K, V)) => Boolean = null): DStream[(K, V)]

对于普通版的reduceByKeyAndWindow，每个滑动间隔都对窗口内的数据做聚合，性能低效。
对于一个滑动窗口，每一个新窗口实际上是在之前窗口中减去一些数据，再加上一些新的数据，从而构成新的窗口。增量版的reduceByKeyAndWindow便采用这种逻辑，通过加新减旧实现增量窗口聚合。reduceFunc,即正向的reduce 函数(如_+_)；invReduceFunc，即反向的reduce 函数(如 _-_)。

注意:加新减旧版的reduceByKeyAndWindow需要checkpoint支持。

val result: DStream[(String, Int)] =kafkaDirectStream
    .map(_.value())
    //json string 转换成 case class
    .map(parse(_).extractOrElse[TwitterFeed](null))
    .filter(_!=null)
    .map(item=>(item.language,1))
    .reduceByKeyAndWindow(
      (v1:Int,v2:Int)=>{v1+v2},//加新
      (v1:Int,v2:Int)=>{v1-v2},//减旧
      Seconds(60),//窗口间隔:必须是batchDuration整数倍
      Seconds(10)//滑动间隔:必须是batchDuration整数倍
    )

reduceByWindow

基于滑动窗口，对非(K,V)类型的DStream进行Reduce操作，返回新的DStream。

普通版的reduceByWindow

def reduceByWindow(
      reduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[T] = ssc.withScope {
    this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
  }

增量版的reduceByWindow

def reduceByWindow(
      reduceFunc: (T, T) => T,
      invReduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[T] = ssc.withScope {
      this.map((1, _))
          .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
          .map(_._2)
  }

实现上调用了加新减旧版的reduceByKeyAndWindow。需要checkpoint支持。

countByValueAndWindow

统计窗口内每类元素个数。

def countByValueAndWindow(
      windowDuration: Duration,
      slideDuration: Duration,
      numPartitions: Int = ssc.sc.defaultParallelism)
      (implicit ord: Ordering[T] = null)
      : DStream[(T, Long)] = ssc.withScope {
    this.map((_, 1L)).reduceByKeyAndWindow(
      (x: Long, y: Long) => x + y,
      (x: Long, y: Long) => x - y,
      windowDuration,
      slideDuration,
      numPartitions,
      (x: (T, Long)) => x._2 != 0L
    )

countByValueAndWindow先将窗口内的每个元素转换成(_,1L),然后调用加新减旧版的reduceByKeyAndWindow进行计算。需要checkpoint支持。

countByWindow

统计窗口内所有元素个数。

def countByWindow(
      windowDuration: Duration,
      slideDuration: Duration): DStream[Long] = ssc.withScope {
    this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
  }

countByWindow先将窗口内的每个元素转换成1L，然后调用加新减旧版的reduceByWindow进行计算。需要checkpoint支持。