今天看啥  ›  专栏  ›  星辰破

Designing DIA note 115 -- change data capture

星辰破  · 简书  ·  · 2019-07-15 09:33

11.2.2 Change Data Capture

problem of DB's replication log
  • long been considered to be an internal implementation detail of the DB ==> not a public API
  • many DB didn't have documentation of getting the log of changes ==> difficult to replicate changes

change data capture (CDC)

the process of observing all data changes written to a DB and extracting them in a form in which they can be replicated to other systems

Figure 11-5. Taking data in the order it was written to one database, and applying the changes to other systems in the same order.

11.2.2.1 implementing change data capture

derived data systems -- log consumers

CDC
  • a mechanism for ensuring the all changes are reflected in the derived data systems
  • makes 1 DB the leader and turns the others into followers
  • log-based MQ is well suited for transporting the change events since it preserves the ordering of msgs
implementation of CDC

DB triggers

  • observe changes and save to a changelog table
  • tend to be fragile & have significant performance overheads

parsing the replication log

  • more robust
  • challenges - ex. handling schema changes
CDC tools
company / DB product comment
LinkedIn Databus
Facebook Wormhole
Yahoo Sherpa
PostgreSQL BotteldWater decodes the write-ahead log
MySQL Maxwell & Debezium parse the binlog
MongoDB mongoriver parse the oplog
Oracle GoldenGate
various DB Kafka Connect

CDC is usually async

  • pro -- adding a slow consumer doesn't affect too much
  • con -- all the issues of replication lag apply

11.2.2.2 initial snapshot

  • if you have the log of all changes, you can reconstruct the entire state of the DB
    ==> but require too much disk space, take too long time
    ==> log needs to be smaller
  • use recent log changes + a consistent snapshot
    ==> the snapshot of the DB must correspond to a known position or offset in the change log

11.2.2.3 log compaction

problem

if you can only keep recent log history, you need to go through the snapshot process every time you want to add a new derived data system

alternative: log compaction
  • with log-structured DB, log compaction will retain only the latest value
    ==> the disk space required depends only on the current contents of the DB
  • same idea works for log-based MQ & CDC
    ==> if CDC gives every change a PK, and every update replaces the previous value for that key, then we only need to keep the most recent write
solution

whenever you want to rebuild a derived data system, you can start a new consumer from offset 0 of the log-compacted topic, and sequentially scan all msgs in the log

usage
  • supported by Apache Kafka -- allows the MQ to be used for durable storage

11.2.2.4 API support for change streams

DBs are beginning to support change streams as a first-class interface, rather than retrofitted CDC efforts.

product description
RethinkDB allows queries to subscribe to change notifs
Firebase & CouchDB provide data sync based on a change feed
Meteor uses the MongoDB oplog to subscribe to data changes & update the UI
Kafka Connect integrate CDCs for a wide range of DB with Kafka

Reference
Designing Data-Intensive Applications by Martin Kleppman




原文地址:访问原文地址
快照地址: 访问文章快照