Intro
Tailable cursors, and in particular tailing MongoDB’s oplog, is a popular feature with many uses, such as real-time notifications of all the changes to your database. A tailable cursor is conceptually the same as the Unix "tail -f" command. Once you've reached the end of the result set, the cursor will not be closed, rather it will continue to wait forever for new data and when it arrives, return that too.
Tailing the oplog is very simple for replica sets, but when it comes to sharded clusters things are a little more complex. In this post we explain how to tail MongoDB’s oplog in a sharded cluster.
Why tail the oplog?
Tailable cursors can be used on capped collections and are often used for Publish-Subscribe type of data flows. In particular, MongoDB's Oplog, that we use internally for replication, is a capped collection and secondaries will use a tailable cursor to get the operations that are to be replicated.
Also 3rd party tools in the ETL or heterogeneous replication domain can read events from the MongoDB oplog. For example the Mongo Connector or the MongoDB ElasticSearch River do exactly that.
But which such a powerful interface, there's more we can do than just replication! Reactive programming has become the dominant paradigm especially in HTML5 / JavaScript applications. Several modern JavaScript frameworks will update the user interface immediately and automatically as you change some value in your data model.
Tailing a MongoDB collection, or the entire database by way of tailing the oplog, is a perfect match for such a programming model! It means the application server will be notified real-time of any changes happening in the entire database.
In fact, one wonderful JavaScript framework is already doing this: Meteor. They have a cool video demo on their site, check it out! This makes Meteor a full stack reactive platform: changes propagate automatically all the way from database to the UI.
Reading the oplog with a tailable cursor
Here's an example how to do a tailable cursor from the mongo shell:
shard01:PRIMARY>c = db.oplog.rs.find( { fromMigrate : { $exists : false } } ).addOption( DBQuery.Option.tailable ).addOption(DBQuery.Option.awaitData)
{ "ts" : Timestamp(1422998530, 1), "h" : NumberLong(0), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : Timestamp(1422998574, 1), "h" : NumberLong("-6781014703318499311"), "v" : 2, "op" : "i", "ns" : "test.mycollection", "o" : { "_id" : 1, "data" : "hello" } }
{ "ts" : Timestamp(1422998579, 1), "h" : NumberLong("-217362260421471244"), "v" : 2, "op" : "i", "ns" : "test.mycollection", "o" : { "_id" : 3, "data" : "hello" } }
{ "ts" : Timestamp(1422998584, 1), "h" : NumberLong("7215322058367374253"), "v" : 2, "op" : "i", "ns" : "test.mycollection", "o" : { "_id" : 5, "data" : "hello" } }
shard01:PRIMARY>c.hasNext()
true
shard01:PRIMARY>c.next()
{
"ts" : Timestamp(1423049506, 1),
"h" : NumberLong("5775895302295493166"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 12,
"data" : "hello"
}
}
shard01:PRIMARY>c.hasNext()
false
As you can see, when used from the shell, the cursor will not wait forever, rather will timeout after a few seconds. Then you can use the hasNext() and next() methods to check if any new data has arrived. And it has!
You can of course apply any filter to find() to capture only the events you want. For example, this is what a tailing cursor from Meteor looks like:
meteor:PRIMARY> db.currentOp()
{
"inprog" : [
{
"opid" : 345,
"active" : true,
"secs_running" : 4,
"op" : "getmore",
"ns" : "local.oplog.rs",
"query" : {
"ns" : {
"$regex" : "^meteor\\."
},
"$or" : [
{
"op" : {
"$in" : [
"i",
"u",
"d"
]
}
},
{
"op" : "c",
"o.drop" : {
"$exists" : true
}
}
],
"ts" : {
"$gt" : Timestamp(1422200128, 7)
}
},
Tailing the Oplog on sharded clusters
But what happens when you use sharding? Well, first of all you'll have to tail each oplog on each shard separately.
That's still doable, but there are more complications. In a sharded cluster the MongoDB balancer will occasionally be moving data from one shard to another. This means that on one shard you will see a bunch of deletes, and on the next one you'll simultaneously see a corresponding bunch of inserts. But these are purely a MongoDB internal matter. If you were tailing the oplog to capture changes in the database, most likely you wouldn't want to see these and might even be confused by these internal events. For example, a Meteor app tailing oplogs on a sharded cluster might mysteriously delete some data all of a sudden!
Let me illustrate. First let's setup a sharded cluster using the mlaunch utility:
$ mlaunch --sharded 2 --replicaset
launching: mongod on port 27018
launching: mongod on port 27019
launching: mongod on port 27020
launching: mongod on port 27021
launching: mongod on port 27022
launching: mongod on port 27023
launching: config server on port 27024
replica set 'shard01' initialized.
replica set 'shard02' initialized.
launching: mongos on port 27017
adding shards. can take up to 30 seconds...
Now I'll connect to the mongos, shard a collection and insert some data into it:
$ mongo
MongoDB shell version: 2.6.7
connecting to: test
mongos> sh.enableSharding( "test" )
{ "ok" : 1 }
mongos> sh.shardCollection( "test.mycollection", { _id : 1 }, true )
{ "collectionsharded" : "test.mycollection", "ok" : 1 }
mongos> db.mycollection.insert( { _id : 1, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 3, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 5, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 7, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 9, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 11, data : "hello" } )
WriteResult({ "nInserted" : 1 })
And if I connect to the mongod on shard01, we can see that all data is there. We can also see the insert events from the oplog:
$ mongo --port 27018
MongoDB shell version: 2.6.7
connecting to: 127.0.0.1:27018/test
shard01:PRIMARY> show collections
mycollection
system.indexes
shard01:PRIMARY> db.mycollection.find()
{ "_id" : 1, "data" : "hello" }
{ "_id" : 3, "data" : "hello" }
{ "_id" : 5, "data" : "hello" }
{ "_id" : 7, "data" : "hello" }
{ "_id" : 9, "data" : "hello" }
{ "_id" : 11, "data" : "hello" }
shard01:PRIMARY> use local
switched to db local
shard01:PRIMARY> show collections
me
oplog.rs
slaves
startup_log
system.indexes
system.replset
shard01:PRIMARY> db.oplog.rs.find().pretty()
{
"ts" : Timestamp(1422998530, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422998574, 1),
"h" : NumberLong("-6781014703318499311"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 1,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998579, 1),
"h" : NumberLong("-217362260421471244"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 3,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998584, 1),
"h" : NumberLong("7215322058367374253"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 5,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998588, 1),
"h" : NumberLong("-5372877897993278968"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 7,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998591, 1),
"h" : NumberLong("-243188455606213719"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 9,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998597, 1),
"h" : NumberLong("5040618552262309692"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 11,
"data" : "hello"
}
}
On shard02 there's so far nothing, because there's still so little data that the balancer didn't run. Let's split the data into 2 chunks, this will trigger a balancer round:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54d13c0555c0347d23e33cdd")
}
shards:
{ "_id" : "shard01", "host" : "shard01/hingo-sputnik:27018,hingo-sputnik:27019,hingo-sputnik:27020" }
{ "_id" : "shard02", "host" : "shard02/hingo-sputnik:27021,hingo-sputnik:27022,hingo-sputnik:27023" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : true, "primary" : "shard01" }
test.mycollection
shard key: { "_id" : 1 }
chunks:
shard01 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard01 Timestamp(1, 0)
mongos> sh.splitAt( "test.mycollection", { _id : 6 } )
{ "ok" : 1 }
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54d13c0555c0347d23e33cdd")
}
shards:
{ "_id" : "shard01", "host" : "shard01/hingo-sputnik:27018,hingo-sputnik:27019,hingo-sputnik:27020" }
{ "_id" : "shard02", "host" : "shard02/hingo-sputnik:27021,hingo-sputnik:27022,hingo-sputnik:27023" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : true, "primary" : "shard01" }
test.mycollection
shard key: { "_id" : 1 }
chunks:
shard02 1
shard01 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : 6 } on : shard02 Timestamp(2, 0)
{ "_id" : 6 } -->> { "_id" : { "$maxKey" : 1 } } on : shard01 Timestamp(2, 1)
mongos>
As you can see, the collection is split into 2 chunks now, and the balancer has done its job and spread them evenly across the shards. If we go back to shard01, we can see how half of the records disappeared ( { "op" : "d"} are deletions):
shard01:PRIMARY> use test
switched to db test
shard01:PRIMARY> db.mycollection.find()
{ "_id" : 7, "data" : "hello" }
{ "_id" : 9, "data" : "hello" }
{ "_id" : 11, "data" : "hello" }
shard01:PRIMARY> use local
switched to db local
shard01:PRIMARY> db.oplog.rs.find().pretty()
{
"ts" : Timestamp(1422998530, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422998574, 1),
"h" : NumberLong("-6781014703318499311"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 1,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998579, 1),
"h" : NumberLong("-217362260421471244"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 3,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998584, 1),
"h" : NumberLong("7215322058367374253"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 5,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998588, 1),
"h" : NumberLong("-5372877897993278968"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 7,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998591, 1),
"h" : NumberLong("-243188455606213719"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 9,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998597, 1),
"h" : NumberLong("5040618552262309692"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 11,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998892, 1),
"h" : NumberLong("3056127588031004421"),
"v" : 2,
"op" : "d",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 1
}
}
{
"ts" : Timestamp(1422998892, 2),
"h" : NumberLong("-7633416138502997855"),
"v" : 2,
"op" : "d",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 3
}
}
{
"ts" : Timestamp(1422998892, 3),
"h" : NumberLong("1499304029305069766"),
"v" : 2,
"op" : "d",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 5
}
}
shard01:PRIMARY>
And on shard02 we can see the same records appearing:
$ mongo --port 27021
MongoDB shell version: 2.6.7
connecting to: 127.0.0.1:27021/test
shard02:PRIMARY> db.mycollection.find()
{ "_id" : 1, "data" : "hello" }
{ "_id" : 3, "data" : "hello" }
{ "_id" : 5, "data" : "hello" }
shard02:PRIMARY> use local
switched to db local
shard02:PRIMARY> db.oplog.rs.find().pretty()
{
"ts" : Timestamp(1422998531, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422998890, 1),
"h" : NumberLong("-6780991630754185199"),
"v" : 2,
"op" : "i",
"ns" : "test.system.indexes",
"fromMigrate" : true,
"o" : {
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test.mycollection"
}
}
{
"ts" : Timestamp(1422998890, 2),
"h" : NumberLong("-165956952201849851"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 1,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998890, 3),
"h" : NumberLong("-7432242710082771022"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 3,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998890, 4),
"h" : NumberLong("6790671206092100026"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 5,
"data" : "hello"
}
}
If we again insert some more data...
mongos> db.mycollection.insert( { _id : 2, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 4, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 6, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 8, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.insert( { _id : 10, data : "hello" } )
WriteResult({ "nInserted" : 1 })
mongos> db.mycollection.find()
{ "_id" : 1, "data" : "hello" }
{ "_id" : 7, "data" : "hello" }
{ "_id" : 3, "data" : "hello" }
{ "_id" : 9, "data" : "hello" }
{ "_id" : 5, "data" : "hello" }
{ "_id" : 11, "data" : "hello" }
{ "_id" : 2, "data" : "hello" }
{ "_id" : 6, "data" : "hello" }
{ "_id" : 4, "data" : "hello" }
{ "_id" : 8, "data" : "hello" }
{ "_id" : 10, "data" : "hello" }
...then these inserts appear as expected on shard01...
shard01:PRIMARY> use local
switched to db local
shard01:PRIMARY> db.oplog.rs.find().pretty()
...beginning is the same as above, omitted for brevity ...
{
"ts" : Timestamp(1422998892, 3),
"h" : NumberLong("1499304029305069766"),
"v" : 2,
"op" : "d",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 5
}
}
{
"ts" : Timestamp(1422999422, 1),
"h" : NumberLong("-6691556866108433789"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 6,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999426, 1),
"h" : NumberLong("-3908881761176526422"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 8,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999429, 1),
"h" : NumberLong("-4997431625184830993"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 10,
"data" : "hello"
}
}
shard01:PRIMARY>
...and shard02:
shard02:PRIMARY> use local
switched to db local
shard02:PRIMARY> db.oplog.rs.find().pretty()
{
"ts" : Timestamp(1422998531, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422998890, 1),
"h" : NumberLong("-6780991630754185199"),
"v" : 2,
"op" : "i",
"ns" : "test.system.indexes",
"fromMigrate" : true,
"o" : {
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test.mycollection"
}
}
{
"ts" : Timestamp(1422998890, 2),
"h" : NumberLong("-165956952201849851"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 1,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998890, 3),
"h" : NumberLong("-7432242710082771022"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 3,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998890, 4),
"h" : NumberLong("6790671206092100026"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 5,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999414, 1),
"h" : NumberLong("8160426227798471967"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 2,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999419, 1),
"h" : NumberLong("-3554656302824563522"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 4,
"data" : "hello"
}
}
shard02:PRIMARY>
Separating internal operations from application operations
So if an application like Meteor was reading the above, it would certainly be challenging to figure out what the end state of the data model is. If we simply combine the oplog events from boths shards, it seems there has been these inserts and deletes:
insert | 1 |
insert | 3 |
insert | 5 |
insert | 7 |
insert | 9 |
insert | 11 |
insert | 1 |
insert | 3 |
insert | 5 |
delete | 1 |
delete | 3 |
delete | 5 |
insert | 2 |
insert | 4 |
insert | 6 |
insert | 8 |
insert | 10 |
So, given the above sequence, do _id's 1, 3 and 5 exist in the data or not?
Fortunately, it is possible to distinguish cluster-internal operations from application operations. You may have noticed that the operations caused by the migrations have a fromMigrate flag set:
{
"ts" : Timestamp(1422998890, 2),
"h" : NumberLong("-165956952201849851"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"fromMigrate" : true,
"o" : {
"_id" : 1,
"data" : "hello"
}
}
Since we are only interested in operations that actually alter the database state when taking the cluster as a whole, we can filter out everything where this flag is set. Note that the correct way is to use $exists, rather than false:
shard01:PRIMARY> db.oplog.rs.find( { fromMigrate : false } ).pretty()
shard01:PRIMARY> db.oplog.rs.find( { fromMigrate : { $exists : false } } ).pretty()
{
"ts" : Timestamp(1422998530, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422998574, 1),
"h" : NumberLong("-6781014703318499311"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 1,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998579, 1),
"h" : NumberLong("-217362260421471244"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 3,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998584, 1),
"h" : NumberLong("7215322058367374253"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 5,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998588, 1),
"h" : NumberLong("-5372877897993278968"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 7,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998591, 1),
"h" : NumberLong("-243188455606213719"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 9,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422998597, 1),
"h" : NumberLong("5040618552262309692"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 11,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999422, 1),
"h" : NumberLong("-6691556866108433789"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 6,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999426, 1),
"h" : NumberLong("-3908881761176526422"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 8,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999429, 1),
"h" : NumberLong("-4997431625184830993"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 10,
"data" : "hello"
}
}
shard01:PRIMARY>
And on shard02:
shard02:PRIMARY> db.oplog.rs.find( { fromMigrate : { $exists : false } } ).pretty()
{
"ts" : Timestamp(1422998531, 1),
"h" : NumberLong(0),
"v" : 2,
"op" : "n",
"ns" : "",
"o" : {
"msg" : "initiating set"
}
}
{
"ts" : Timestamp(1422999414, 1),
"h" : NumberLong("8160426227798471967"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 2,
"data" : "hello"
}
}
{
"ts" : Timestamp(1422999419, 1),
"h" : NumberLong("-3554656302824563522"),
"v" : 2,
"op" : "i",
"ns" : "test.mycollection",
"o" : {
"_id" : 4,
"data" : "hello"
}
}
shard02:PRIMARY>
Exactly what we want!
If you’re interested in learning more about the operational best practices of MongoDB, download our guide: