11.2.2 Change Data Capture
problem of DB's replication log
- long been considered to be an internal implementation detail of the DB ==> not a public API
- many DB didn't have documentation of getting the log of changes ==> difficult to replicate changes
change data capture (CDC)
the process of observing all data changes written to a DB and extracting them in a form in which they can be replicated to other systems
Figure 11-5. Taking data in the order it was written to one database, and applying the changes to other systems in the same order.
11.2.2.1 implementing change data capture
derived data systems -- log consumers
CDC
- a mechanism for ensuring the all changes are reflected in the derived data systems
- makes 1 DB the leader and turns the others into followers
- log-based MQ is well suited for transporting the change events since it preserves the ordering of msgs
implementation of CDC
DB triggers
- observe changes and save to a changelog table
- tend to be fragile & have significant performance overheads
parsing the replication log
- more robust
- challenges - ex. handling schema changes
CDC tools
company / DB |
product |
comment |
LinkedIn |
Databus |
|
Facebook |
Wormhole |
|
Yahoo |
Sherpa |
|
PostgreSQL |
BotteldWater |
decodes the write-ahead log |
MySQL |
Maxwell & Debezium |
parse the binlog |
MongoDB |
mongoriver |
parse the oplog |
Oracle |
GoldenGate |
|
various DB |
Kafka Connect |
|
CDC is usually async
- pro -- adding a slow consumer doesn't affect too much
- con -- all the issues of replication lag apply
11.2.2.2 initial snapshot
- if you have the log of all changes, you can reconstruct the entire state of the DB
==> but require too much disk space, take too long time
==> log needs to be smaller
- use recent log changes + a consistent snapshot
==> the snapshot of the DB must correspond to a known position or offset in the change log
11.2.2.3 log compaction
problem
if you can only keep recent log history, you need to go through the snapshot process every time you want to add a new derived data system
alternative: log compaction
- with log-structured DB, log compaction will retain only the latest value
==> the disk space required depends only on the current contents of the DB
- same idea works for log-based MQ & CDC
==> if CDC gives every change a PK, and every update replaces the previous value for that key, then we only need to keep the most recent write
solution
whenever you want to rebuild a derived data system, you can start a new consumer from offset 0 of the log-compacted topic, and sequentially scan all msgs in the log
usage
- supported by Apache Kafka -- allows the MQ to be used for durable storage
11.2.2.4 API support for change streams
DBs are beginning to support change streams as a first-class interface, rather than retrofitted CDC efforts.
product |
description |
RethinkDB |
allows queries to subscribe to change notifs |
Firebase & CouchDB |
provide data sync based on a change feed |
Meteor |
uses the MongoDB oplog to subscribe to data changes & update the UI |
Kafka Connect |
integrate CDCs for a wide range of DB with Kafka |
Reference
Designing Data-Intensive Applications by Martin Kleppman