When to Use This Pattern
Data Residency
Your compliance requirements mandate that raw data stays within your network
Private Networks
Your database is behind a firewall or in a private VPC with no public access
Custom Extraction
You need custom transformation logic before data leaves your environment
Existing CDC Pipeline
You already have Debezium, Fivetran, or another CDC tool running internally
If your database is publicly accessible and you don’t have data residency requirements, consider using Definite’s built-in connectors instead. They’re fully managed and require no infrastructure on your end.
Architecture Overview
With push-based ingestion, you run the data extraction within your own infrastructure. Only the extracted data is sent to Definite via HTTPS.Your Infrastructure
Your Database (Postgres, MySQL, etc.) lives in your VPC or private network.
Your Extractor
You run an Extractor (Lambda, ECS, EC2, or any compute) that queries your database and formats the data.
HTTPS to Definite
Your extractor sends data via HTTPS POST to
api.definite.app/v2/stream. Only outbound traffic—no inbound connections required.DuckLake
Data lands in DuckLake (Iceberg tables on GCS) and is immediately queryable.
- Raw data never leaves your network
- You control the extraction schedule and logic
- Only outbound HTTPS traffic required (no inbound connections)
- Works with any database or data source
Example: Postgres Incremental Sync
This example shows a Python script that incrementally syncs data from Postgres to Definite. It uses a watermark column (likeupdated_at) to only sync changed rows.
Full Implementation
Environment Variables
Handling Deletes and Updates
The Stream API uses append-only semantics. To handle updates and deletes from your source database, include CDC metadata in your records:Include CDC Operation Type
Create a View for Current State
In DuckLake, create a view that shows only the current state of each record:- Full history in the
bronzelayer - Current state in the
silverlayer - Ability to query point-in-time snapshots
Deployment Options
AWS Lambda (Scheduled)
Run your sync on a schedule using EventBridge:ECS Fargate (Scheduled Task)
EC2 / On-Premise (Cron)
Error Handling
Implement retry logic with exponential backoff for transient failures:Dead Letter Queue
For production deployments, consider sending failed batches to a dead letter queue:Monitoring
Track these metrics to ensure your sync is healthy:| Metric | Description |
|---|---|
rows_extracted | Rows read from source database |
rows_pushed | Rows sent to Definite |
push_latency_ms | Time to push each batch |
sync_lag_seconds | Time since last successful sync |
errors_total | Failed push attempts |
Health Check Endpoint
If running as a service, expose a health endpoint:Alternative: Debezium + Kafka
For high-volume, real-time CDC, you can use Debezium with a Kafka HTTP Sink connector:- True CDC from the Postgres WAL
- Exactly-once delivery semantics
- Lower latency than batch polling
Related
- Stream API Reference - Full API documentation
- Extractors - Definite’s built-in managed connectors
- Webhooks - Trigger blocks from external events

