ybunload Output Files

This section describes the output files that ybunload exports to the client.

Naming of Output Files

Use the --prefix option to give unique names to ybunload output files. If you do not use this option, files are named with the default unload prefix. When an unload operation produces multiple files, they are numbered consecutively. For example:
unload_1_1_.csv
unload_1_2_.csv
unload_1_3_.csv
...
The convention for naming files is as follows:
<prefix_><streamID_><partnumber_>.<extension>
  • prefix_: As defined, or unload by default.
  • stream id_: A number assigned to each data stream from the workers. The streams are not in any particular order relative to a specific worker, and a single worker may provide multiple streams.
  • partnumber_: An incrementing number starting from 1 for each stream.
  • .extension: file type, such as .csv or .gz.
Note: By default, unloaded GZIP compressed files do not have a file extension (such as .csv or .txt) when you unzip them. If necessary, you can use the gunzip command with the -c option to unzip and rename each file. For example:
$ gunzip -c unload_1_1_.gz > unload_1_1_.txt

Number and Size of Output Files

The number and size of files generated to complete the unload depends on the following factors:
  • max_file_size setting for the ybunload command. If a file reaches this limit (by default, 50GB for regular file systems and 60GB for S3), ybunload closes that file and starts writing to a new file. This occurs as many times as necessary to complete the unload. The max_file_size value is not the maximum size of the unload; it is the maximum size of a single file generated by an unload stream from a single worker. Each worker unloads a separate data stream.
  • The plan that is generated for the unload query and how many worker nodes have data at the top of the plan. A ybunload plan is the same as a SELECT query plan except that the top of the plan tree has a "data output" node.
  • The compression (--compress) option that is chosen for the ybunload command.

Compressed and Uncompressed Files

Unload files are compressed in either "block mode" or "stream mode."

The GZIP_* compression options operate in block mode, which consolidates output files as much as possible, such that the number of files exported back to the client is no greater than it would be without the use of compression. The size of compressed versus uncompressed files will differ, but the number is the same. Block mode compression (or no compression) results in one file per worker node that has data at the top of the plan tree. If a worker node is not used in the plan at all or produces no data at the top of the plan, it does not contribute a file. Yellowbrick recommends the use of block mode GZIP options in most cases.

The GZIP_STREAM_* options operate in stream mode. Stream mode compression results in potentially more files being exported: typically one file per worker node per core. The GZIP_STREAM_* options are intended to be used only if your downstream workflow tools cannot handle gzip files containing multiple compression blocks. Additionally, the GZIP_STREAM_* options consume significantly more network connections than the GZIP* options, meaning many network routers won't be able to handle the increased number of connections reliably.

Where possible, all workers and CPU cores are used for unload queries. Regardless of the compression option that you use, the following exceptions apply:
  • Single-worker queries: Only one worker executes the plan, so only as many files as there are cores on that worker are exported.
  • Queries that select only part of a data set: If the unload query constrains a narrow segment of data (such as a date range or a set of low-cardinality values), only a subset of the workers may contain any data to be unloaded, resulting in fewer output files.
  • Queries involving sorts that do not specify the --parallel option in the ybunload command. In this case, a final sort on a single worker and a single core will yield only one output file. (To guarantee a single-file unload, make sure the unload query has an ORDER BY clause and do not specify the --parallel option in the command.
  • Queries involving aggregates. The final aggregation phase of the plan is done on one worker, resulting in a reduced number of files, similar to the sort query case.

To summarize, in most cases the number of output files produced by an unload query will be one file per worker that has data at the top of the plan. If you change the max_file_size setting, the unload may generate more files or fewer files. For example, if you are doing a 6TB unload (uncompressed) from a 15-blade appliance with perfect data distribution, that would be ~410GB per blade. With a default max_file_size of 50GB, the unload would produce 9 files per blade for a total of 135 files. To reduce the number of output files for the unload to 15, you could specify a larger max_file_size, such as 500GB.

Specific Compression Options

Block mode:
  • GZIP and GZIP_FAST are synonyms (for the fastest compression option).
  • GZIP_MORE provides better compression, but slower unload performance
  • GZIP_BEST provides the best compression but much slower performance.
Stream mode:
  • GZIP_STREAM and GZIP_STREAM_FAST are synonyms (for the fastest compression option).
  • GZIP_STREAM_MORE provides better compression, but slower unload performance
  • GZIP_STREAM_BEST provides the best compression but much slower performance.