pyspark.streaming.DStream¶
- 
class pyspark.streaming.DStream(jdstream, ssc, jrdd_deserializer)[source]¶
- A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data (see - RDDin the Spark core documentation for more details on RDDs).- DStreams can either be created from live data (such as, data from TCP sockets, etc.) using a - StreamingContextor it can be generated by transforming existing DStreams using operations such as map, window and reduceByKeyAndWindow. While a Spark Streaming program is running, each DStream periodically generates a RDD, either from live data or by transforming the RDD generated by a parent DStream.- DStreams internally is characterized by a few basic properties:
- A list of other DStreams that the DStream depends on 
- A time interval at which the DStream generates an RDD 
- A function that is used to generate an RDD after each time interval 
 
 - Methods - cache()- Persist the RDDs of this DStream with the default storage level (MEMORY_ONLY). - checkpoint(interval)- Enable periodic checkpointing of RDDs of this DStream - cogroup(other[, numPartitions])- Return a new DStream by applying ‘cogroup’ between RDDs of this DStream and other DStream. - combineByKey(createCombiner, mergeValue, …)- Return a new DStream by applying combineByKey to each RDD. - context()- Return the StreamingContext associated with this DStream - count()- Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream. - Return a new DStream in which each RDD contains the counts of each distinct value in each RDD of this DStream. - countByValueAndWindow(windowDuration, …[, …])- Return a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream. - countByWindow(windowDuration, slideDuration)- Return a new DStream in which each RDD has a single element generated by counting the number of elements in a window over this DStream. - filter(f)- Return a new DStream containing only the elements that satisfy predicate. - flatMap(f[, preservesPartitioning])- Return a new DStream by applying a function to all elements of this DStream, and then flattening the results - Return a new DStream by applying a flatmap function to the value of each key-value pairs in this DStream without changing the key. - foreachRDD(func)- Apply a function to each RDD in this DStream. - fullOuterJoin(other[, numPartitions])- Return a new DStream by applying ‘full outer join’ between RDDs of this DStream and other DStream. - glom()- Return a new DStream in which RDD is generated by applying glom() to RDD of this DStream. - groupByKey([numPartitions])- Return a new DStream by applying groupByKey on each RDD. - groupByKeyAndWindow(windowDuration, …[, …])- Return a new DStream by applying groupByKey over a sliding window. - join(other[, numPartitions])- Return a new DStream by applying ‘join’ between RDDs of this DStream and other DStream. - leftOuterJoin(other[, numPartitions])- Return a new DStream by applying ‘left outer join’ between RDDs of this DStream and other DStream. - map(f[, preservesPartitioning])- Return a new DStream by applying a function to each element of DStream. - mapPartitions(f[, preservesPartitioning])- Return a new DStream in which each RDD is generated by applying mapPartitions() to each RDDs of this DStream. - mapPartitionsWithIndex(f[, …])- Return a new DStream in which each RDD is generated by applying mapPartitionsWithIndex() to each RDDs of this DStream. - mapValues(f)- Return a new DStream by applying a map function to the value of each key-value pairs in this DStream without changing the key. - partitionBy(numPartitions[, partitionFunc])- Return a copy of the DStream in which each RDD are partitioned using the specified partitioner. - persist(storageLevel)- Persist the RDDs of this DStream with the given storage level - pprint([num])- Print the first num elements of each RDD generated in this DStream. - reduce(func)- Return a new DStream in which each RDD has a single element generated by reducing each RDD of this DStream. - reduceByKey(func[, numPartitions])- Return a new DStream by applying reduceByKey to each RDD. - reduceByKeyAndWindow(func, invFunc, …[, …])- Return a new DStream by applying incremental reduceByKey over a sliding window. - reduceByWindow(reduceFunc, invReduceFunc, …)- Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream. - repartition(numPartitions)- Return a new DStream with an increased or decreased level of parallelism. - rightOuterJoin(other[, numPartitions])- Return a new DStream by applying ‘right outer join’ between RDDs of this DStream and other DStream. - saveAsTextFiles(prefix[, suffix])- Save each RDD in this DStream as at text file, using string representation of elements. - slice(begin, end)- Return all the RDDs between ‘begin’ to ‘end’ (both included) - transform(func)- Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. - transformWith(func, other[, keepSerializer])- Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and ‘other’ DStream. - union(other)- Return a new DStream by unifying data of another DStream with this DStream. - updateStateByKey(updateFunc[, …])- Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. - window(windowDuration[, slideDuration])- Return a new DStream in which each RDD contains all the elements in seen in a sliding window of time over this DStream.