Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

Adam Comerford is manager of 10gen’s support team in Dublin.

MongoDB Management Service (MMS) is is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this post, we'll outline 5 alerts you should set up in MMS to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.

The goal here is to make this a repeatable process for any and all alerts. First let’s describe the general validation process you should go through when picking an alert:

Is there an absolute limit to alert on, regardless of context? (usually the answer is no, or if it is yes, you will generally want to know at some point prior to this happening)
Determine what is normal (baseline)
Determine what is worrying (warning)
Determine what is a definite problem (critical)
Determine the likelihood of false positives

The answers to these question will dictate how to alert on something, the severity of that alert and whether it is worth alerting on.

Each environment is different and will have different requirements for uptime, responsiveness, IO, CPU etc. There is no magic formula that will fit everyone, or a short cut to truly relevant alerting. Getting to that point (without blind luck) always requires a mix of research, testing and tweaking over time.

5 Recommended Alerts

Host Recovering (All, but by definition Secondary)
Repl Lag (Secondary)
Connections (All mongos, mongod)
Lock % (Primary, Secondary)
Replica (Primary, Secondary)

1. Host Recovering

The first alert is a basic type and will send you an alert if any of you MongoDB instances enters RECOVERING mode (see this page for more information on states). This can be intentional, for example if you resync a secondary node, but if it happens outside of known work, then determining the cause and resolving the issue will be key to keeping a healthy replica set.

As such, there is no need to run through our general evaluation process. This is an example of that rare, definitive alert that everyone running a replica set should have as-is.

2. Repl Lag

Outside of a node with a configured slaveDelay, you do not want to have any of your nodes falling behind. This is particularly important if you are serving reads from your secondary nodes, since any lag will potentially present a vastly different view of data from a delayed node when compared to an up-to-date node.

This one is not as definitive as the first, so let’s go through the process:

Is there an absolute limit?	Yes, in an ideal world the absolute limit is relatively low - about 1 or 2 seconds. However, at this threshold there would be false positives, and there can be spikes due to a wayward statistical sample. Hence, I would recommend any lag over 240 seconds as an absolute threshold.
Normal	Lag is ideally 0, and for most sets without load issues, this will stay at 0 (barring aforementioned statistical anomalies).
Worrying	Any lag is potentially worrying, so if you can live with the spam you could set one alert low (<60 seconds) and be prepared for false positives.
Critical	Over 240 seconds will catch critical errors and while it may catch an occasional false blip, it should eliminate most false positive noise.
False Positives	At the recommended 240 second threshold, the likelihood for false positives in a healthy set is low.

3. Connections

Connection levels are important for a number of reasons. Each connection consumes a certain amount of resources (1MB of memory and a file descriptor), and that can add up to problems quickly. For example, imagine 5,000 connections consuming 5GB of RAM on a machine. At the very least, high connection counts must be budgeted for when determining appropriate resources on a host.

Overall, stability and predictability are the keys here - knowing how many connections you should have under low/normal/peak load allows you to set alerts appropriately.

Is there an absolute limit?	Yes, but it is so high that we need to alert earlier, and the threshold should probably be different for mongos and mongod processes.
Normal	The normal level will depend on how many nodes (including mongos nodes in a sharded environment) you have, how many applications servers you have, how busy things are, pools settings in the drivers, and the response time of the application. Therefore, to get a “normal” level will require monitoring the connection levels through busy times to get an idea of real world connection usage. For example purposes, let us consider a system that sees 500 connections at its busiest times.
Worrying	Worrying would be a 50% increase (750 total for our example) in the number of connections seen at peak.
Critical	Once the connection count has doubled versus our normal peak (1000 total for our example), it should be considered a serious issue.

4. Lock %

Lock contention can cause the performance of any database to degrade very quickly, and MongoDB is no exception to that rule. High lock percentage will have impacts across the board: it can starve replication, make application calls slow, timeout, or fail. Lock contention generally needs to be kept within a reasonable threshold for a database to function smoothly.

Is there an absolute limit?	Yes, it is hard to imagine a system that will run smoothly with lock percentage (averaged over a minute) that is hitting more than 80%, and certainly 90% is going to see major impact on a regular basis.
Normal	The normal level will vary greatly from system to system, and at present it is only possible to alert on the global lock percentage (rather than per-database). Therefore normality will (like connection count above) depend very much on your usage. A write heavy system might regularly see >60% lock whereas a read heavy system may never see anything above 10%.
Worrying	Once you establish a baseline expectation, anything that doubles your lock percentage (unless it is really low, of course) should cause concern.
Critical	This is something of a judgement call, but the absolute limits mentioned above are a good starting point. Any lock percentage above 80% for a production system is certainly something you will want to know about as soon as possible.

5. Replica

This statistic is less obvious in terms of its name and purpose than some of the others we have looked at above. Replica is calculated by taking the timestamp of the most recent entry in an oplog, and subtracting the timestamp of the oldest entry in the oplog. This then represents what is commonly referred to as the “oplog window” - which is the amount of time, based on your current traffic, that it will take to completely roll over the oplog.

Hence this is directly derived from three factors: how many operations you are inserting into the oplog, what size those operations are, and the size of the oplog capped collection. A good rule of thumb is to take your normal maximum maintenance window (let’s say 8 hours) and then multiply it by 3 (24 hours) for safety. This should then be the minimum length of your oplog.

Resizing the oplog is non-trivial, so you want to catch issues here as soon as possible and take remedial action before the value gets too low.

Is there an absolute limit?	Yes, it is hard to imagine a system that will run smoothly with lock percentage (averaged over a minute) that is hitting more than 80%, and certainly 90% is going to see major impact on a regular basis.
Normal	Normality will depend on the factors mentioned. Similar to the connection rate, you will want to benchmark the range of replica values you get at low/normal/peak times.
Worrying	25% below your replica value during your peak traffic period in terms of usage would be a worrying value.
Critical	50% below your replica value during your peak traffic period should be considered critical, and actioned immediately.

Conclusion

The methods above are valid for evaluating any such alert in (or outside of) MMS. There may be a key metric that is not included above that surprises you (Queues for example). An alert can be added for any such metric in essentially the same way and, as with the levels you choose to alert on, every system will have the alerts that are most relevant to it.

As implied by the “worrying” versus “critical” evaluation, you are also free to have more than one alert, at different levels. Imagine an alert that only hits the ops team when lock percentage goes above 60% (perhaps as a prelude to a capacity discussion), and then an alert for a much wider audience when the more critical level of 80% is breached and all eyes are needed on the issue.

To get started with monitoring and alerting, you can create a free account for MongoDB Management Service at mms.10gen.com.

Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

5 Recommended Alerts

1. Host Recovering

2. Repl Lag

3. Connections

4. Lock %

5. Replica

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112