Package pyspark :: Module sql :: Class SchemaRDD

Class SchemaRDD

object --+    
         |    
   rdd.RDD --+
             |
            SchemaRDD

An RDD of Row objects that has an associated schema.

The underlying JVM object is a SchemaRDD, not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.

For normal pyspark.rdd.RDD operations (map, count, etc.) the SchemaRDD is not operated on directly, as it's underlying implementation is an RDD composed of Java objects. Instead it is converted to a PythonRDD in the JVM, on which Python operations can be done.

This class receives raw tuples from Java but assigns a class to it in all its data-collection methods (mapPartitionsWithIndex, collect, take, etc) so that PySpark sees them as Row objects with named fields.

Instance Methods

__init__(self, jschema_rdd, sql_ctx)
x.__init__(...) initializes x; see help(type(x)) for signature

source code

saveAsParquetFile(self, path)
Save the contents as a Parquet file, preserving the schema.

source code

registerTempTable(self, name)
Registers this RDD as a temporary table using the given name.

source code

registerAsTable(self, name)

source code

insertInto(self, tableName, overwrite=False)
Inserts the contents of this SchemaRDD into the specified table.

source code

saveAsTable(self, tableName)
Creates a new table with the contents of this SchemaRDD.

source code

schema(self)
Returns the schema of this SchemaRDD (represented by a StructType).

source code

schemaString(self)
Returns the output schema in the tree format.

source code

printSchema(self)
Prints out the schema in the tree format.

source code

count(self)
Return the number of elements in this RDD.

source code

collect(self)
Return a list that contains all of the rows in this RDD.

source code

mapPartitionsWithIndex(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

source code

cache(self)
Persist this RDD with the default storage level (MEMORY_ONLY_SER). source code

persist(self, storageLevel)
Set this RDD's storage level to persist its values across operations after the first time it is computed.

source code

unpersist(self, blocking=True)
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.

source code

checkpoint(self)
Mark this RDD for checkpointing.

source code

isCheckpointed(self)
Return whether this RDD has been checkpointed or not

source code

getCheckpointFile(self)
Gets the name of the file to which this RDD was checkpointed

source code

coalesce(self, numPartitions, shuffle=False)
Return a new RDD that is reduced into `numPartitions` partitions.

source code

distinct(self)
Return a new RDD containing the distinct elements in this RDD.

source code

intersection(self, other)
Return the intersection of this RDD and another one.

source code

repartition(self, numPartitions)
Return a new RDD that has exactly numPartitions partitions.

source code

subtract(self, other, numPartitions=None)
Return each value in self that is not contained in other. source code

Inherited from rdd.RDD: __add__, __repr__, aggregate, aggregateByKey, cartesian, cogroup, collectAsMap, combineByKey, context, countByKey, countByValue, filter, first, flatMap, flatMapValues, fold, foldByKey, foreach, foreachPartition, getNumPartitions, getStorageLevel, glom, groupBy, groupByKey, groupWith, histogram, id, join, keyBy, keys, leftOuterJoin, map, mapPartitions, mapPartitionsWithSplit, mapValues, max, mean, min, name, partitionBy, pipe, reduce, reduceByKey, reduceByKeyLocally, rightOuterJoin, sample, sampleByKey, sampleStdev, sampleVariance, saveAsHadoopDataset, saveAsHadoopFile, saveAsNewAPIHadoopDataset, saveAsNewAPIHadoopFile, saveAsPickleFile, saveAsSequenceFile, saveAsTextFile, setName, sortBy, sortByKey, stats, stdev, subtractByKey, sum, take, takeOrdered, takeSample, toDebugString, top, union, values, variance, zip, zipWithIndex, zipWithUniqueId

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties
Inherited from `object`: `__class__`

Method Details

Class SchemaRDD

__init__(self, jschema_rdd, sql_ctx) (Constructor)

saveAsParquetFile(self, path)

registerTempTable(self, name)

insertInto(self, tableName, overwrite=False)

count(self)

collect(self)

mapPartitionsWithIndex(self, f, preservesPartitioning=False)

cache(self)

persist(self, storageLevel)

unpersist(self, blocking=True)

checkpoint(self)

isCheckpointed(self)

getCheckpointFile(self)

coalesce(self, numPartitions, shuffle=False)

distinct(self)

intersection(self, other)

repartition(self, numPartitions)

subtract(self, other, numPartitions=None)

init(self, jschema_rdd, sql_ctx)
(Constructor)