pyspark.sql.DataFrameWriter.bucketBy¶
- 
DataFrameWriter.bucketBy(numBuckets, col, *cols)[source]¶
- Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing. - New in version 2.3.0. - Parameters
- numBucketsint
- the number of buckets to save 
- colstr, list or tuple
- a name of a column, or a list of names. 
- colsstr
- additional names (optional). If col is a list it should be empty. 
 
 - Notes - Applicable for file-based data sources in combination with - DataFrameWriter.saveAsTable().- Examples - >>> (df.write.format('parquet') ... .bucketBy(100, 'year', 'month') ... .mode("overwrite") ... .saveAsTable('bucketed_table'))