Adam Comerford is manager of 10gen’s support team in Dublin.
MongoDB Management Service (MMS) is is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this post, we'll outline 5 alerts you should set up in MMS to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.
The goal here is to make this a repeatable process for any and all alerts. First let’s describe the general validation process you should go through when picking an alert:
- Is there an absolute limit to alert on, regardless of context? (usually the answer is no, or if it is yes, you will generally want to know at some point prior to this happening)
- Determine what is normal (baseline)
- Determine what is worrying (warning)
- Determine what is a definite problem (critical)
- Determine the likelihood of false positives
The answers to these question will dictate how to alert on something, the severity of that alert and whether it is worth alerting on.
Each environment is different and will have different requirements for uptime, responsiveness, IO, CPU etc. There is no magic formula that will fit everyone, or a short cut to truly relevant alerting. Getting to that point (without blind luck) always requires a mix of research, testing and tweaking over time.
5 Recommended Alerts
- Host Recovering (All, but by definition Secondary)
- Repl Lag (Secondary)
- Connections (All mongos, mongod)
- Lock % (Primary, Secondary)
- Replica (Primary, Secondary)
1. Host Recovering
The first alert is a basic type and will send you an alert if any of you MongoDB instances enters RECOVERING mode (see this page for more information on states). This can be intentional, for example if you resync a secondary node, but if it happens outside of known work, then determining the cause and resolving the issue will be key to keeping a healthy replica set.
As such, there is no need to run through our general evaluation process. This is an example of that rare, definitive alert that everyone running a replica set should have as-is.
2. Repl Lag
Outside of a node with a configured slaveDelay, you do not want to have any of your nodes falling behind. This is particularly important if you are serving reads from your secondary nodes, since any lag will potentially present a vastly different view of data from a delayed node when compared to an up-to-date node.
This one is not as definitive as the first, so let’s go through the process:
Is there an absolute limit? | Yes, in an ideal world the absolute limit is relatively low - about 1 or 2 seconds. However, at this threshold there would be false positives, and there can be spikes due to a wayward statistical sample. Hence, I would recommend any lag over 240 seconds as an absolute threshold. |
Normal | Lag is ideally 0, and for most sets without load issues, this will stay at 0 (barring aforementioned statistical anomalies). |
Worrying | Any lag is potentially worrying, so if you can live with the spam you could set one alert low (<60 seconds) and be prepared for false positives. |
Critical | Over 240 seconds will catch critical errors and while it may catch an occasional false blip, it should eliminate most false positive noise. |
False Positives | At the recommended 240 second threshold, the likelihood for false positives in a healthy set is low. |
3. Connections
Connection levels are important for a number of reasons. Each connection consumes a certain amount of resources (1MB of memory and a file descriptor), and that can add up to problems quickly. For example, imagine 5,000 connections consuming 5GB of RAM on a machine. At the very least, high connection counts must be budgeted for when determining appropriate resources on a host.
Overall, stability and predictability are the keys here - knowing how many connections you should have under low/normal/peak load allows you to set alerts appropriately.
Is there an absolute limit? | Yes, but it is so high that we need to alert earlier, and the threshold should probably be different for mongos and mongod processes. |
Normal | The normal level will depend on how many nodes (including mongos nodes in a sharded environment) you have, how many applications servers you have, how busy things are, pools settings in the drivers, and the response time of the application. Therefore, to get a “normal” level will require monitoring the connection levels through busy times to get an idea of real world connection usage. For example purposes, let us consider a system that sees 500 connections at its busiest times. |
Worrying | Worrying would be a 50% increase (750 total for our example) in the number of connections seen at peak. |
Critical | Once the connection count has doubled versus our normal peak (1000 total for our example), it should be considered a serious issue. |
4. Lock %
Lock contention can cause the performance of any database to degrade very quickly, and MongoDB is no exception to that rule. High lock percentage will have impacts across the board: it can starve replication, make application calls slow, timeout, or fail. Lock contention generally needs to be kept within a reasonable threshold for a database to function smoothly.
Is there an absolute limit? | Yes, it is hard to imagine a system that will run smoothly with lock percentage (averaged over a minute) that is hitting more than 80%, and certainly 90% is going to see major impact on a regular basis. |
Normal | The normal level will vary greatly from system to system, and at present it is only possible to alert on the global lock percentage (rather than per-database). Therefore normality will (like connection count above) depend very much on your usage. A write heavy system might regularly see >60% lock whereas a read heavy system may never see anything above 10%. |
Worrying | Once you establish a baseline expectation, anything that doubles your lock percentage (unless it is really low, of course) should cause concern. |
Critical | This is something of a judgement call, but the absolute limits mentioned above are a good starting point. Any lock percentage above 80% for a production system is certainly something you will want to know about as soon as possible. |
5. Replica
This statistic is less obvious in terms of its name and purpose than some of the others we have looked at above. Replica is calculated by taking the timestamp of the most recent entry in an oplog, and subtracting the timestamp of the oldest entry in the oplog. This then represents what is commonly referred to as the “oplog window” - which is the amount of time, based on your current traffic, that it will take to completely roll over the oplog.
Hence this is directly derived from three factors: how many operations you are inserting into the oplog, what size those operations are, and the size of the oplog capped collection. A good rule of thumb is to take your normal maximum maintenance window (let’s say 8 hours) and then multiply it by 3 (24 hours) for safety. This should then be the minimum length of your oplog.
Resizing the oplog is non-trivial, so you want to catch issues here as soon as possible and take remedial action before the value gets too low.
Is there an absolute limit? | Yes, it is hard to imagine a system that will run smoothly with lock percentage (averaged over a minute) that is hitting more than 80%, and certainly 90% is going to see major impact on a regular basis. |
Normal | Normality will depend on the factors mentioned. Similar to the connection rate, you will want to benchmark the range of replica values you get at low/normal/peak times. |
Worrying | 25% below your replica value during your peak traffic period in terms of usage would be a worrying value. |
Critical | 50% below your replica value during your peak traffic period should be considered critical, and actioned immediately. |
Conclusion
The methods above are valid for evaluating any such alert in (or outside of) MMS. There may be a key metric that is not included above that surprises you (Queues for example). An alert can be added for any such metric in essentially the same way and, as with the levels you choose to alert on, every system will have the alerts that are most relevant to it.
As implied by the “worrying” versus “critical” evaluation, you are also free to have more than one alert, at different levels. Imagine an alert that only hits the ops team when lock percentage goes above 60% (perhaps as a prelude to a capacity discussion), and then an alert for a much wider audience when the more critical level of 80% is breached and all eyes are needed on the issue.
To get started with monitoring and alerting, you can create a free account for MongoDB Management Service at mms.10gen.com.