Pythian Blog: Technical Track

MongoDB metrics and thresholds

In my earlier blog post about Replica Set Maintenance Activities (available here), I emphasized the significance of monitoring as a vital component in ensuring the overall health of the system.In this follow-up post, I aim to delve deeper into specific monitoring metrics extracted from MongoDB server status outputs, a crucial aspect for database administrators (DBAs). These metrics, when visualized as time series data, offer valuable insights into the database's performance. Tools like Cloud Manager, PMM, and others can be used to visualize these metrics in various panel formats.

Connections 

This metric displays the number of current connections to the machine hosting MongoDB. When the Mongod process is started, there is an upper limit on the number of available database connections—around 50k.

Current: The number of incoming connections from clients to the database server. The value will include all incoming connections, including any shell connections or connections from other servers, such as replica set members or Mongos instances. A stack is allocated per connection; thus, very many connections can result in significant RAM usage.

Depending on your OS resources, keep an eye on the number of connections and configure some alerts if the number of connections goes above some threshold. 

Operations

This metric displays the number of the specified operations by type (commands, queries, deletes, etc.). This metric shows the trends for database load.

Command: The average rate of commands performed per second over the selected sample period 

Delete: The average rate of deletes performed per second over the selected sample period

Update: The average rate of updates performed per second over the selected sample period

Getmore: The average rate of getMores performed per second on any cursor over the selected sample period. On a primary, this number can be high even if the query count is low, as the secondary "getMore" from the primary is often part of replication.

Insert: The average rate of inserts performed per second over the selected sample period

Query: The average rate of queries performed per second over the selected sample period.

If you're noticing a spike in system load and your operations are hitting peak levels, it might be time to review your system resources and scale horizontally or vertically. Conducting benchmarks to gauge your system's capacity is a good initial step. Additionally, keep an eye out for unusual activity in inserts, updates, and deletes, as it could signal the introduction of new application features or even application bugs. 

Memory

This metric displays information on memory usage for the Mongod processes.

Resident: The number of megabytes resident. In a standard deployment, resident is the amount of memory used by the WiredTiger cache plus the memory dedicated to other memory structures used by the Mongod process. 

Virtual: The virtual megabytes for the Mongod process. WiredTiger: Generally, virtual should be a little larger than mapped, but if virtual is many gigabytes larger, it indicates that excessive memory is being used by other aspects than the memory mapping of files, which would be bad/suboptimal. The most common case of usage of a high amount of non-mapped memory usage is that there are very many connections to the database. Each connection has a thread stack, and the memory for those stacks can add up to a considerable amount.

If the resident memory is approaching the system memory limit, it might be good to review your working set, analyze your query patterns, review the indexes, or consider scaling, either vertically or horizontally. It's advisable to set up alert thresholds for resident memory nearing the physical memory capacity on your server. 

Queues locked

This metric provides information concerning the number of operations queued because of a lock. Queue readers and queue writers should be compared with active readers and writers, and if the value of queued readers/writers is small, there should not be a concern.

Readers: The number of operations that are currently queued and waiting for the read lock. A consistently small read-queue, particularly of shorter operations, should cause no concern.

Writers: The number of operations that are currently queued and waiting for the write lock. A consistently small write-queue, particularly for shorter operations, is no cause for concern. 

Total: The total number of operations queued waiting for the lock (i.e., the sum of globalLock.currentQueue.readers and globalLock.currentQueue.writers). A consistently small queue, particularly for shorter operations, should cause no concern

If the reader or writer queue count remains consistently high, it could be a sign that the Mongod process is taking longer to respond to clients, which increases the probability that the server is overutilized. This suggests that some of the queries need optimization or that the server where Mongod is running might need scaling up. 

Network 

This metric displays data on MongoDB’s network use.

Bytes In: The number of MB that reflects the amount of network traffic received by the database.

Bytes Out: the number of MB that reflects the amount of network traffic sent from the database.

Num Requests: The total number of distinct requests that the server has received. 

This value provides context for the bytesIn and bytesOut values to ensure that MongoDB’s network utilization is consistent with expectations and application use.

When the volume of bytes transferred in and out nears your system's limits, it's likely a cue to evaluate your network capacity and explore options for scaling. Ensuring your network can accommodate the increasing traffic is crucial for maintaining optimal performance and avoiding potential bottlenecks.

Cursors

This metric displays the average rate of cursors that have timed out per second over the selected sample period. The value of 0 timed out cursors is the desired scenario and means that all open cursors are being properly closed by the application.

Total Open: The number of cursors that the server is maintaining for clients. Because MongoDB exhausts unused cursors, typically this value is small or zero. However, if there is a queue, stale tailable cursors, or a large number of operations, this value may rise.

Timed Out: The average rate of cursors that have timed out per second over the selected sample period. If this number is large or growing at a regular rate, this may indicate an application error.

Keep an eye on the timed-out cursors. If this number is large or growing at a regular rate, this may indicate an application error.

Scan and Order

This metric provides information on the total number of queries that return sorted numbers that cannot perform the sort operation using an index. Ideally, this should be as close to 0 as possible. If there is a very high number of queries doing scan and order without any index, this metric will grow.

Query Executor

This metric displays data from the query execution system.

Scanned: The average rate per second over the selected sample period of index items scanned during queries and query-plan evaluation. This rate is driven by the same value as totalKeysExamined in the output of explain().

Scanned objects: the average rate per second over the selected sample period of documents scanned during queries and query-plan evaluation. This rate is driven by the same value as totalDocsExamined in the output of explain().

If the number of scanned objects is high, it might indicate that your queries are not using proper indexes and you are rather doing COLLSCAN. Perform query analysis and review collection growth rates if some query patterns change.

Cache Usage

Starting in MongoDB 3.4, the default WiredTiger internal cache size is the larger of either:

50% of 1 GB of RAM, or 256 MB.

If the dirty bytes start approaching the value of used bytes from the WT cache, that indicates the Mongod process is not flushing the dirty pages promptly, and it should trigger a review and more detailed analysis. The number of bytes used by the WT cache should not be greater than the maximum number of bytes configured for the WT cache. In a normal workload, WT will use 80% of the maximum configured cache.

Dirty: The size in bytes of the dirty data in the cache. This value should be less than the bytes currently in the cache. Calculated as the percentage of the WiredTiger cache with dirty bytes, wiredTiger.cache.tracked dirty bytes in the cache, / wiredTiger.cache.maximum bytes configured.

Used: size in bytes of the data currently in cache. This value should not be greater than the maximum number of bytes configured. Calculated as the percentage of the WiredTiger cache that is in use, wiredTiger.cache.bytes currently in the cache, / wiredTiger.cache.maximum bytes configured.

Monitor the WT cache usage and if you see a pattern of the WT cache consistently used above 80% to 100% of the configured size, it might indicate your working set is large and you need to review your query patterns. If your dirty bytes are also close to the number of bytes configured, it might be a sign of the Mongod process not flushing the dirty pages promptly. 

Tickets Available

By default, WT has 127 read and 128 write tickets available. As these represent the number of concurrent read or write operations allowed into the storage engine, concern should be raised if the value of tickets available starts dropping below 30%.

Reads: The number of read tickets available to the WiredTiger storage engine. Read tickets represent the number of concurrent read operations allowed into the storage engine. When this value reaches zero, new read requests may queue until a read ticket becomes available.

Writes: The number of write tickets available in WiredTiger storage.

Keep an eye on the number of tickets available and set some alerts if the number of reader or writer tickets drops to zero. If the storage engine runs out of tickets, all operations will be on hold until a new ticket is released. 

Cache Activity

This metric displays the average rate of MB per second read into and written from WiredTiger’s cache over the selected sample period.

Read Into: The average rate of bytes per second read into WiredTiger's cache over the selected sample period.

Written From: The average rate of bytes per second written from WiredTiger's cache over the selected sample period.

Monitoring the cache activity should provide information about IO activity on the server. If there are frequent data changes and bulk operations, it’s expected to see an increase in cache activity. Poor query design or inefficient indexes might also cause too many pages to be read into the cache.

Document Metrics

This metric displays information that reflects document access and modification patterns. It is useful to compare the values to the data in the opcounters document, which tracks the total number of operations (inserts, updates, deletes, and queries).

Returned: The average rate per second of documents returned by queries over the selected sample period.

Inserted: The average rate per second of documents inserted over the selected sample period.

Updated: The average rate per second of documents updated over the selected sample period.

Deleted: The average rate per second of documents deleted over the selected sample period.

The document metrics are good for trend analysis. One metric could show how many documents, on average,  your queries are returning compared to the opcounters for the find query. Having a very large number of returned documents compared to one find operation might indicate your queries are returning unnecessary documents or being inefficient. 

Summary

Monitoring MongoDB instances is crucial for ensuring optimal performance, reliability, and scalability. Effective monitoring provides real-time insights into the health and performance of MongoDB deployments, allowing administrators to proactively identify and address potential issues before they escalate. By monitoring key metrics such as wired tiger storage engine metrics, operations count, and query execution, administrators can gain a comprehensive understanding of MongoDB's behavior and resource usage. This enables them to make informed decisions regarding capacity planning, resource allocation, and performance optimization, ultimately enhancing the overall stability and responsiveness of MongoDB deployments.

 

No Comments Yet

Let us know what you think

Subscribe by email