Airbyte CDC: Moving from MySQL to BigQuery Reliably
Integrations Covered
A practical guide to using Airbyte CDC to move MySQL changes into BigQuery with reliable syncs, a clean warehouse layout, and predictable operations.
Airbyte CDC: Moving from MySQL to BigQuery Reliably
Change data capture is one of the fastest ways to make a warehouse useful without rebuilding every upstream system. If your team is still doing full refreshes from MySQL into BigQuery, the result is usually predictable: slow sync windows, wasted compute, inconsistent dashboards, and a backlog of manual fixes.
This guide shows how to set up a production-minded CDC pipeline using Airbyte as the ingestion layer and BigQuery as the destination. The goal is not to move rows only. The goal is to create a pipeline that is reliable, observable, and cheap enough to run every day.
When this pattern is a good fit
Use this pattern when your source data lives in MySQL, analysts need fresh data multiple times per day, and full refreshes are too slow or too expensive.
It is especially useful when tables are large, stakeholders expect near-real-time reporting, or source-schema changes make manual scripts too fragile.
- Operational data lives in MySQL.
- Business users need fresh data in BigQuery during the day.
- Full refreshes are too slow, too expensive, or too risky.
- You want a low-maintenance path from source database to warehouse.
Reference architecture
A practical setup stays simple: MySQL remains the operational source, Airbyte captures changes, BigQuery stores the raw load, dbt shapes the data, and monitoring protects freshness and trust.
- MySQL as the operational source.
- Airbyte to read changes and load them into BigQuery.
- BigQuery as the analytics warehouse.
- dbt for cleaning, typing, and modeling the raw data.
- Monitoring for failures, latency, and schema drift.
This keeps operational concerns separate from business logic, which makes the system easier to debug, cheaper to evolve, and safer to hand off.
Prepare MySQL first
Before you connect Airbyte, make sure MySQL is actually ready for CDC. That means binary logging is enabled, the replication user has the right permissions, primary keys are stable, and binlog retention is long enough to survive a temporary outage.
- Binary logging enabled.
- A replication user with the needed permissions.
- Stable primary keys on the tables you care about.
- Enough binlog retention to let Airbyte catch up.
If you skip this step, the connector may work in a demo and then fail the moment the source starts receiving real traffic.
Define the business scope
Do not move every table just because you can. Start with the data that directly supports reporting and decisions, then expand after the pipeline is stable.
- Orders
- Customers
- Payments
- Subscriptions
- Products
- Support tickets
For each table, decide whether the team needs every update or only the latest state. That choice changes how you model dashboards later.
Configure the Airbyte source
Create a MySQL source in Airbyte and verify the connection before you sync anything at scale. Start with one or two low-risk tables so you can validate freshness, row counts, and field mapping before expanding the connector scope.
- Host, port, username, and password.
- Database name and SSL settings.
- Replication method set to CDC.
- Primary-key validation for the selected tables.
Configure the BigQuery destination
Set up BigQuery with a clear dataset layout. Keep raw replicated data separate from staging and marts, and limit service-account permissions to the minimum required.
- Use a dedicated raw dataset.
- Separate raw, staging, and marts datasets.
- Match the region to your compliance and performance needs.
- Grant only the permissions the pipeline actually needs.
Choose the sync mode deliberately
Use full refresh only for small reference tables. Use incremental sync for large transactional or event tables. Use CDC for the tables where freshness matters enough to justify the extra complexity.
- Full refresh for small lookup data.
- Incremental sync for large but non-critical tables.
- CDC for critical tables that drive timely decisions.
Model the raw data in layers
Do not point dashboards directly at raw replicated tables. Keep the data layered so each step has one job: raw keeps the source shape, staging cleans and standardizes it, and marts serve reporting.
- Raw: exact replicated data.
- Staging: cleaned columns, standardized types, deduplicated keys.
- Marts: business-ready reporting tables.
Validate and monitor
Once the pipeline is running, watch freshness, sync failures, row-count drift, and schema changes. The point is not just to move data. The point is to trust the data enough to use it every day.
- Track freshness and latency.
- Alert on failed syncs.
- Compare source and destination row counts.
- Watch for schema drift before it reaches dashboards.
Common mistakes to avoid
- Skipping CDC prerequisites on MySQL.
- Syncing every table before proving freshness.
- Querying raw tables directly from BI dashboards.
- Ignoring monitoring until users report stale data.
Conclusion
A reliable Airbyte CDC pipeline is mostly about discipline: prepare MySQL correctly, keep the scope focused, land data into a clean BigQuery structure, and monitor freshness before issues reach the business. That is what makes the setup useful beyond a demo.
If you want the next step, model the raw tables in dbt and keep the reporting layer separate from ingestion.