Flushing Tables

Tables that are loaded via INSERT statements do not write rows directly to the file systems on the worker nodes. Instead, these rows are stored temporarily in the front-end database (the row store), and a background Java service periodically flushes them to the backend storage system (the column store). This service scans all of your database tables and queues them for flushing. The service issues the appropriate commands automatically, taking into account the expected number of workers where the flush operation will run. You do not need to flush tables manually.

For efficient storage in the column store, data is not flushed immediately, but in batches, when a threshold of 100,000 rows per table per worker is reached (or a size of 500MB, whichever comes first). For example, if you have a cluster with 15 operational workers, data is flushed when 1,500,000 rows (for a specific table) accumulate in the row store. Note that for a replicated table, the effective limit that causes the flush operation in this example is still 1,500,000 rows, but the same set of 100,000 rows is flushed to each worker.

To make sure that smaller batches of data do not remain in the row store for too long before being flushed (for example, updates to small dimension tables), a time limit of 6 hours also applies. Data does not remain in the row store any longer than 6 hours, regardless of the number of rows ready to be flushed.

These row and time limits are set globally and apply to all tables.