Partitioning
Fast analytics in StarRocks begin with a table layout that matches your query patterns. This guide distills handsāon experience into clear rules for partitioning, helping you:
- Scan less data via aggressive partition pruning
- Manage lifecycle tasks (TTL, GDPR deletes, tiering) with metadataāonly ops;
- Scale smoothly as tenant counts, data volume, or retention windows grow.
- Controlled write amplificationāNew data lands in the āhotā partition; compaction happens in historical partitions
Keep this advice close when modeling a new table or refactoring an old oneāeach section gives purposeādriven criteria, design heuristics, and operational guardārails so you avoid costly reāpartitioning down the road.
Partitioning vs. Bucketingādifferent jobsā
Understanding the distinction between partitioning and bucketing is fundamental when designing performant StarRocks tables. While both help manage large datasets, they serve different purposes:
- Partitioning allows StarRocks to skip entire blocks of data at query time using partition pruning, and enables metadata-only lifecycle operations like dropping old or tenant-specific data.
- Bucketing, on the other hand, helps distribute data across tablets to parallelize query execution and balance load, especially when combined with hash functions.
| Aspect | Partitioning | Bucketing (Hash/Random) |
|---|---|---|
| Primary goal | Coarseāgrain data pruning and lifecycle control(TTL, archiving). | Fineāgrain parallelism and data locality inside each partition. |
| Planner visibility | Partitions are catalog objects; FE can skip them via predicates. | Only equality predicates support bucket pruning |
| Lifecycle ops | DROP PARTITION is metadataāonlyāideal for GDPR deletes, monthly rollāoff. | Buckets canāt be dropped; they change only with ALTER TABLE ⦠MODIFY DISTRIBUTED BY. |
| Typical count | 10^2ā10^4 per table (days, weeks, tenants). | 10ā120 per partition; StarRocks BUCKETS xxx tunes this. |
| Skew handling | Merge or split partitions; consider composite/hybrid scheme. | Raise bucket count, hash on compound key, isolate āwhalesā, or use random bucketing |
| Red flags | >100 k partitions can introduce significant memory footprint for FE | >200 k tablets per BE; tablets exceeding 10 GB may encounter compaction issues. |
When should I partition?ā
| Table type | Partition? | Typical key |
|---|---|---|
| Fact / event stream | Yes | date_trunc('day', event_time) |
| Huge dimension (billions rows) | Sometimes | Time or business key change date |
| Small dimension / lookup | No | Rely on hash distribution |
Choosing the partition keyā
- Timeāfirst defaultāIf 80āÆ% of queries include a time filter, lead with
date_trunc('day', dt). - Tenant isolationāAdd
tenant_idinto the key when you need to manage the data in tenant basis - Retention alignmentāPut the column you plan to purge on into the key.
- Composite keys:
PARTITION BY tenant_id, date_trunc('day', dt)prunes perfectly but creates#tenants Ć #dayspartitions. Keep below āāÆ100āÆk total or FE memory & BE compaction suffer.
Picking granularityā
The granularity of PARTITION BY date_trunc('day', dt) should be adjusted based on the use case. You can use "hour," "day," or "month," etc. See date_trunc
| Granularity | Use when | Pros | Cons |
|---|---|---|---|
| Daily (default) | Most BI & reporting | Few partitions (365/yr); simple TTL | Less precise for "last 3 h" queries |
| Hourly | > 2 Ć tablet per day; IoT bursts | Hotāspot isolation; 24 partitions/day | 8 700 partitions/yr |
| Weekly / Monthly | Historical archive | Tiny metadata; merges easy | Coarser pruning |
- Rule of thumb: Keep each partition ā¤āÆ100āÆGB and ā¤āÆ20āÆk tablets/partition across replicas.
- Mixed granularity: Starting from version 3.4, StarRocks supports mixed granularity by merging historical partitions into coarser granularity.
Example recipesā
Clickāstream fact (singleātenant)ā
CREATE TABLE click_stream (
user_id BIGINT,
event_time DATETIME,
url STRING,
...
)
DUPLICATE KEY(user_id, event_time)
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(user_id) BUCKETS xxx;
SaaS metrics (multiātenant, pattern A)ā
Recommended for most SaaS workloads. Prunes on time, keeps tenant rows coālocated.
CREATE TABLE metrics (
tenant_id INT,
dt DATETIME,
metric_name STRING,
v DOUBLE
)
PRIMARY KEY(tenant_id, dt, metric_name)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS xxx;
Whale tenant composite (pattern B)ā
When tenant-specific DML/DDL is necessary or large-scale tenants are present, be cautious of potential partition explosion.
CREATE TABLE activity (
tenant_id INT,
dt DATETIME,
id BIGINT,
....
)
DUPLICATE KEY(dt, id)
PARTITION BY tenant_id, date_trunc('MONTH', dt)
DISTRIBUTED BY HASH(id) BUCKETS xxx;