Performance Tuning Kafka JDBC Source Connector

Apache Kafka is widely used for building real-time data pipelines and streaming applications. One common scenario is to stream data from relational databases into Kafka. This can be achieved using the Kafka JDBC Source Connector, which is part of the Confluent Kafka Connect framework. Tuning the performance of this connector is crucial to ensure that data is ingested efficiently and reliably. This blog will delve into the various configurations and strategies to optimize the performance of the Kafka JDBC Source Connector.

Understanding the Kafka JDBC Source Connector

The Kafka JDBC Source Connector allows you to pull data from relational databases and publish it to Kafka topics. It uses JDBC to connect to the database, execute SQL queries, and push the result set to Kafka. Key components include:

JDBC URL: The connection string to your database.
Tasks: The number of parallel tasks that can be run.
Poll interval: The frequency with which the connector polls the database for new data.
Batch size: The number of records to fetch in each poll.
Mode: The strategy to detect new or updated rows (e.g., timestamp, incrementing, timestamp+incrementing).

Key Performance Tuning Parameters

Tasks (tasks.max)

This parameter determines the maximum number of tasks the connector will use. More tasks can increase throughput by parallelizing the work but may also put more load on the database.
Sample Configuration:

"tasks.max": "10"

Poll Interval (poll.interval.ms)

Defines how often the connector polls the database for new data. A lower interval means more frequent polling, which can lead to higher load but lower latency.
Sample Configuration:

 "poll.interval.ms": "50000"

Batch Size (batch.max.rows)

Controls the number of rows fetched in each poll. Larger batch sizes can improve throughput but require more memory and can increase processing time.
Sample Configuration:

"batch.max.rows": "50000"

Mode (mode)

Determines how the connector identifies new or updated rows. Common modes include:
- incrementing: Uses an incrementing column (e.g., an auto-increment primary key).
- timestamp: Uses a timestamp column to detect changes.
- timestamp+incrementing: Combines both strategies.
Sample Configuration:

"mode": "timestamp+incrementing", 
"timestamp.column.name": "last_modified", 
"incrementing.column.name": "id"

Connection Pool Size (connection.pool.size)

Defines the number of JDBC connections in the pool. More connections can help handle more simultaneous tasks but may increase load on the database.
Sample Configuration:

"connection.pool.size": "10"

Additional Configuration Considerations

Database-Specific Optimizations: Each database has its own set of optimizations. For instance, for PostgreSQL, consider using connection parameters like tcpKeepAlive, connectTimeout, and tuning the fetchSize.
Connector Logging: Enable detailed logging to understand performance bottlenecks. Adjust the log levels as necessary.

  "logging.level": "DEBUG"

Error Handling: Proper error handling can ensure that the connector doesn’t fail unexpectedly and can handle transient issues gracefully.

  "errors.tolerance": "all",
  "errors.log.enable": "true",
  "errors.deadletterqueue.topic.name": "dlq_topic"

Producer Properties Overrides: In addition to the above configurations, overriding some producer configurations can improve the performance drastically.
- "producer.override.acks": "all": Ensures that the producer waits for the highest level of acknowledgment from Kafka.
- "producer.override.retries": "10": Sets the number of retries for sending records to Kafka. Default is 3
- "producer.override.batch.size": "100000": Sets batch size in bytes. Default is 16384. Set it to almost double the batch.max.rows. However it is best to calculate by multiplying number of rows with size of one row.
- "producer.override.max.request.size": Sets the max request size in bytes. Default is 1048576 = 1 MB. Increase it to 3 MB or 3000000
- "producer.override.linger.ms": "100": Controls how long the producer waits before sending a batch.
- "producer.override.buffer.memory": "33554432": Sets the total memory the producer can use to buffer records.
- "producer.override.compression.type": "lz4": Specifies the compression type for the records.

Example Configuration

Here’s an example configuration for a Kafka JDBC Source Connector optimized for performance:

{
  "name": "jdbc-source-connector",
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "tasks.max": "10",
  "connection.url": "jdbc:postgresql://localhost:5432/mydb",
  "connection.user": "myuser",
  "connection.password": "mypassword",
  "mode": "timestamp+incrementing",
  "timestamp.column.name": "last_modified",
  "incrementing.column.name": "id",
  "poll.interval.ms": "50000",
  "batch.max.rows": "50000",
  "topic.prefix": "jdbc_",
  "table.whitelist": "my_table",
  "connection.pool.size": "10",
  "errors.tolerance": "all",
  "errors.log.enable": "true",
  "errors.deadletterqueue.topic.name": "dlq_topic",
  "logging.level": "DEBUG",
  "producer.override.acks": "all",
  "producer.override.retries": "10"
  "producer.override.batch.size": "100000",
  "producer.override.max.request.size":"3000000",
  "producer.override.linger.ms": "100",
  "producer.override.compression.type": "lz4"
}

Monitoring and Scaling

To effectively tune and maintain performance, continuous monitoring is essential. Use tools like Kafka Connect’s REST API to monitor task statuses, throughput, and connector metrics. Scaling strategies may include:

Horizontal Scaling: Increase the number of connectors and tasks to distribute the load.
Vertical Scaling: Upgrade the underlying hardware or increase resources allocated to Kafka Connect.

Tuning the Kafka JDBC Source Connector involves balancing the load on your database, the connector’s performance, and Kafka’s throughput capabilities. By carefully configuring parameters like the number of tasks, poll intervals, batch sizes, and mode, you can achieve optimal performance. Play around with the producer overrides and continuous monitoring and adjustments based on real-time metrics will help maintain an efficient data pipeline.