Designing DIA note 115 -- change data capture

星辰破 · 简书 · · 2019-07-15 09:33

11.2.2 Change Data Capture

problem of DB's replication log

long been considered to be an internal implementation detail of the DB ==> not a public API
many DB didn't have documentation of getting the log of changes ==> difficult to replicate changes

change data capture (CDC)

the process of observing all data changes written to a DB and extracting them in a form in which they can be replicated to other systems

Figure 11-5. Taking data in the order it was written to one database, and applying the changes to other systems in the same order.

11.2.2.1 implementing change data capture

derived data systems -- log consumers

CDC

a mechanism for ensuring the all changes are reflected in the derived data systems
makes 1 DB the leader and turns the others into followers
log-based MQ is well suited for transporting the change events since it preserves the ordering of msgs

implementation of CDC

DB triggers

observe changes and save to a changelog table
tend to be fragile & have significant performance overheads

parsing the replication log

more robust
challenges - ex. handling schema changes

CDC tools

company / DB	product	comment
LinkedIn	Databus
Facebook	Wormhole
Yahoo	Sherpa
PostgreSQL	BotteldWater	decodes the write-ahead log
MySQL	Maxwell & Debezium	parse the binlog
MongoDB	mongoriver	parse the oplog
Oracle	GoldenGate
various DB	Kafka Connect

CDC is usually async

pro -- adding a slow consumer doesn't affect too much
con -- all the issues of replication lag apply

11.2.2.2 initial snapshot

if you have the log of all changes, you can reconstruct the entire state of the DB
==> but require too much disk space, take too long time
==> log needs to be smaller
use recent log changes + a consistent snapshot
==> the snapshot of the DB must correspond to a known position or offset in the change log

11.2.2.3 log compaction

problem

if you can only keep recent log history, you need to go through the snapshot process every time you want to add a new derived data system

alternative: log compaction

with log-structured DB, log compaction will retain only the latest value
==> the disk space required depends only on the current contents of the DB
same idea works for log-based MQ & CDC
==> if CDC gives every change a PK, and every update replaces the previous value for that key, then we only need to keep the most recent write

solution

whenever you want to rebuild a derived data system, you can start a new consumer from offset 0 of the log-compacted topic, and sequentially scan all msgs in the log

usage

supported by Apache Kafka -- allows the MQ to be used for durable storage

11.2.2.4 API support for change streams

DBs are beginning to support change streams as a first-class interface, rather than retrofitted CDC efforts.

product	description
RethinkDB	allows queries to subscribe to change notifs
Firebase & CouchDB	provide data sync based on a change feed
Meteor	uses the MongoDB oplog to subscribe to data changes & update the UI
Kafka Connect	integrate CDCs for a wide range of DB with Kafka

Reference
Designing Data-Intensive Applications by Martin Kleppman

原文地址：访问原文地址
快照地址：访问文章快照

分享到微博