As businesses continue to embrace event-driven architectures and tackle Big Data opportunities, companies are finding great success integrating Apache Kafka and MongoDB. These two complementary technologies provide the power and flexibility to solve these large scale challenges. Today, MongoDB continues to invest in the MongoDB Connector for Apache Kafka releasing version 1.4! Over the past few months, we’ve been collecting feedback and learning how to best help our customers integrate MongoDB within the Apache Kafka ecosystem. This article highlights some of the key features of this new release.
Selective Replication in MongoDB
Being able to track just the data that has changed is an important use case in many solutions. Change Data Capture (CDC) has been available on the sink since the original version of the connector. However, up until version 1.4, the source for CDC events could only be sourced from MongoDB via the Debezium MongoDB Connector. WIth the latest release you can specify the MongoDB Change Stream Handler on the sink to read and replay MongoDB events sourced from MongoDB using the MongoDB Connector for Apache Kafka. This feature enables you to record insert, update, and delete activities on a namespace in MongoDB and replay them on a destination MongoDB cluster. In effect you have a lightweight way to perform basic replication of MongoDB data via Kafka.
Let’s dive in and see what is happening under the hood. Recall that when the connector is used as a source to MongoDB, it starts a change stream on a specific namespace. Depending on how you configure the source connector, documents are written into a Kafka topic based on this namespace and pipeline that match your criteria. These documents are by default in the change stream event format. Here is a partial message in the Kafka topic that was generated from the following statement: db.Source.insert({proclaim: "Hello World!"});
{
 "schema": {
 "type": "string",
 "optional": false
 },
 "payload": {
 "_id": {
 "_data": "82600B38...."
 },
 "operationType": "insert",
 "clusterTime": {
 "$timestamp": {
 "t": 1611348141,
 "i": 2
 }
 },
 "fullDocument": {
 "_id": {
 "$oid": "600b38ad6011ef6265c3acd1"
 },
 "proclaim": "Hello World!"
 },
 "ns": {
 "db": "Tutorial3",
 "coll": "Source"
 },
 "documentKey": {
 "_id": {
 "$oid": "600b38ad6011ef6265c3acd1"
 }
 }
 }
}

Now that our change stream message is in the Kafka topic, we can use the connector as a sink to read the stream of messages and replay them at the destination cluster. To set up the sink to consume these events, set the “change.data.capture.handler" to the new com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler property.
Notice that one of the fields is “operationType”. The sink connector will only support insert, update and delete operations on the namespace and does not support actions like creation of database objects such as users, namespaces, indexes, views, and other metadata that occurs in more traditional replication solutions. In addition this capability is not intended as a replacement for a full featured replication system as it can not guarantee transactional consistency between the two clusters. That said, if all you are looking to do is move data and can accept its lack of consistency then you have a simple solution using the new ChangeStreamHandler.
To work through a tutorial on this new feature, check out Tutorial 3 of the MongoDB Connector for Apache Kafka Tutorials in GitHub.
Dynamic Namespace Mapping
When we use the MongoDB connector as a sink we take data that resides on a Kafka Topic and insert it into a collection. Prior to 1.4, once this mapping is defined it isn’t possible to route topic data to another collection. In this release we added the ability to dynamically map a namespace to the contents of the kafka topic message.
For example, consider a Kafka Topic “Customers.Orders” that contains the following messages:
{"orderid":1,"country":"ES"}

{"orderid":2,"country":"US"}

We would like these messages to be placed in their own collection based upon the country value. Thus, the message with the field “orderid” that has a value of 1 will be copied in a collection called, “ES”. Likewise, the message with the field “orderid” that has a value of 2 will be copied to a collection called, “US”.
To see how we configure this scenario, we will define a sink using the new namespace.mapper property configured with a value of “com.mongodb.kafka.connect.sink.namespace.mapping.FieldPathNamespaceMapper”. Using this mapper, we can use a key or value field to determine the database and collection respectively. In our example above let’s define our config using the value of the country field as the collection name to sink to:
'{"name": "mongo-dynamic-sink",
"config": {
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"topics":"Customers.Orders",
"connection.uri":"mongodb://mongo1:27017,mongo2:27017,mongo3:27017",
"database":"Orders",
"collection":"Other"
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false",
"namespace.mapper":"com.mongodb.kafka.connect.sink.namespace.mapping.FieldPathNamespaceMapper",
"namespace.mapper.value.collection.field":"country" }}

Messages that do not have a country value will by default be written to the namespace defined in the configuration just like they would have been without the mapping. However, If you want messages that do not conform to the map to generate an error simply set the property namespace.mapper.error.if.invalid to true. This will raise an error and stop the connector when messages can not be mapped to a namespace due to missing fields or fields that are not strings.
If you’d like to have more control over the namespace you can use the new “getNamespace” method of the interface com.mongodb.kafka.connect.sink.namespace.mapping.NamespaceMapper. Implementations of this method can implement more complex business rules and can access the SinkRecord or SinkDocument as part of the logic to determine the destination namespace.
Dynamic Topic Mapping
Once the source connector is configured, change stream events flow from the namespace defined in the connector to a Kafka Topic. The name of the Kafka Topic is made up of three configuration parameters: topic.prefix, database and collection. For example, if you had as part of your source connector configuration:
“topic.prefix”:”Stocks”,
“database”:”Customers”,
“collection”:”Orders”

The Kafka topic that would be created would be “Stocks.Customers.Orders”. However, what if you didn’t always want the events in the Orders collection to always go to this specific topic? What if you wanted to determine at run-time which topic a specific message should be routed to?
In 1.4 you can now specify a namespace map that defines which kafka topic a namespace should be written to. For example, consider the following map:
{"Customers": "CustomerTopic",
 "Customers.Orders": "Orders"}

This will map all change stream documents from the Customers
database to CustomerTopic.<collectionName>
apart from any documents from the Customers.Orders
namespace which map to the Orders
topic.
If you need to use complex business logic to determine the route, you can implement the getTopic method in the new TopicMapper class to handle this mapping logic.
Also note that 1.4 introduced a topic.suffix
configuration property in addition to the topic.prefix. Using our example above, you can configure
“topic.prefix”:”Stocks”,
“database”:”Customers”,
“collection”:”Orders”,
topics.suffix:”US”

This will define the topic to write to as “Stocks.Customers.Orders.US”
Next Steps
Download the latest MongoDB Connector for Apache Kafka 1.4 from the Confluent Hub!
Read the MongoDB Connector for Apache Kafka documentation
Questions/Need help with the connector? Ask the Community
Have a feature request? Provide Feedback or a file a JIRA