6.2 partitioning of key value data
- goal of partitioning -- spread the data & query load evenly across nodes
-
skewed -- an unfair partitioning, some partitions have more data / queries than others ==> less effective
-
hot pot -- a partition with disproportionately high load
- to avoid hot pot ==> the simplest approach would be to assign records to nodes randomly
- disadvantage of random assign -- when you read, you have to query all nodes in parallel as there's no way knowing which node the record is on
6.2.1 partitioning by key range
- assign a continuous range fo keys to each partition, like a paper encyclopedia
- with range boundaries + node partition assignment ==> you can make the request directly to the appropriate node (pick the correct book)
Figure 6-2. A print encyclopedia is partitioned by key range.
- key ranges may not be evenly spaced, because your data may not be evenly distributed
- to distribute the data evenly, the partition boundaries need to adapt to the data -- could be chosen manually/automatically
- this strategy is used by Bigtable ==> HBase, RethinkDB, MongoDB < 2.4
- within each partition, we can keep keys in sorted order ==> easy range scan, can fetch several related records in 1 query (e.g. key = timestamp, get all records for a month)
downside of key range partitioning
- certain access patterns can lead to hot spots
- e.g. key = timestamp, partitions => time range, then 1 day data of all sensors can go to the same partition while others sit idle
- to avoid the problem, use sth other than the timestamp as the key => e.g. partitioning by sensor name + time
6.2.2 partitioning by hash of key
- to avoid the risk of skew and hot spots, we could use the hash of a given key to determine the partition
suitable hash for partitioning
- take skewed data ==> make it uniformly distributed
- no need to be cryptographically strong
product |
hash function |
MongoDB |
MD5 |
Cassandra |
Murmur3 |
Voldemort |
Fowler-Noll-Vo |
unsuitable hash for partitioning
- the same key may have a different hash in different processes