Pythian Blog: Technical Track

Troubleshooting Transactional Replication Latency Using Agent Statistics

In this blog post, we will look at how you can leverage the Agent statistics to troubleshoot the replication latency issues. Before understanding how to decode the agent statistics, let's take a look at the some of the basic things that will help us troubleshoot the replication performance issue in a better way. The following MSDN diagram depicts the transactional replication architecture in a simple manner. Transactional replication components and data flow Troubleshooting latency issues is a multi-step approach. The first step is to identify which agent is slow.
  • Log Reader Agent (Publisher to Distributor)
  • Distribution Agent (Distributor to Subscriber)
So, the problem can be either the log reader or distribution agent. We can identify the origin of the issue by simply inserting the tracer token. Once we determine the problematic agent, the next step is to identify within the agent which particular thread is causing the issue. Let me introduce you to the important threads and how they work on these replication agents in a nutshell. Log Reader Agent Reader Thread – It scans the publisher database transaction log using sp_replcmds Writer Thread - Add the queued transactions to the Distribution database using sp_MSadd_repl_commands Distribution Agent Reader Thread – It finds the watermark from the table Msreplication_subscriptions(on subscriber) and uses this information to retrieve pending commands from the Distribution database. It basically uses the stored procedure sp_MSget_replcommands to achieve it. Writer Thread – Writer thread uses the Batched RPC calls to write the information to subscriber database. Now that we understand the threads in the replication agents. Let's assume we've already identified which agent is slow by inserting the tracer token. Next is to dig deeper on the thread level, which is where our replication agent statistics come to rescue us. By default, agent statistics entries are appended to history tables every 5 minutes. The statistics provide the historical view of how the agent has been performing and keeps the last 3 days data. You can retain data longer by changing the history retention period. MSlogreader_history MSdistribution_history the above two tables are located in the Distribution database. The statistics information is added as XML blob in the comments column of these tables. Now, let's take a look at how to decipher this XML Data for each agent. Log Reader Agent statistics <stats state="1" work="948" idle="351940" > <reader fetch="859" wait="0"/> <writer write="822" wait="395390"/> <sincelaststats elapsedtime="300" work="49" cmds="176998" cmdspersec="3543.000000"><reader fetch="17" wait="0"/><writer write="29" wait="350833"/></sincelaststats></stats>

– State = 1 means stats after batch commit

–Work = cumulative time spent by the agent since restart - idle time ( seconds)

–Idle = Time spent waiting to call sp_replcmds

–Reader fetch = Time to do execute sp_replcmds

Wait = Time spent waiting on writer to release buffer

–Writer write = Time spent writing commands into distribution database

Wait = Time spent waiting on reader to populate buffer

Note: Each thread will have their own buffer with 40k in size. Here, we need to look at the wait time to understand where the bottleneck exists.For example, if you notice that the wait time for the Reader thread is high, then it essentially means your Writer thread is slow since the Reader thread is waiting for Writer to release the buffer. Similarly, if you notice high wait times for your Writer thread then your Reader thread is slow. The easiest way to decode this is, HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the commands to distribution database) HIGH Wait time on Writer thread = Reader thread is slow ( thread which scans the transaction log)

Distribution Agent Statistics

<stats state="1" work="154" idle="351464"> <reader fetch="144" wait="11"/> <writer write="12" wait="338"/> <sincelaststats elapsedtime="305" work="10" cmds="81262" cmdspersec="8041.000000"><reader fetch="0" wait="9"/><writer write="10" wait="0"/></sincelaststats></stats> – State =1 means stats after a batch commit – Work = cumulative time spend by the agent since restart - idle time – Idle = Time spend waiting to call sp_msget_repl_commands – Reader fetch = Time to do execute sp_msget_repl_commands Wait = Time spent waiting on writer to release buffer. – Writer write = Time spend writing commands into distribution database Wait = Time spent waiting on reader to populate buffer. Similar to log reader agent, the decoding of wait time is same way we did for log reader agent. HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the subscriber database using batched RPC Calls) HIGH wait time on Writer thread = Reader thread is slow ( thread which takes the pending commands from Distribution database) Distributor Writer Thread Slow Scenario We would be able to understand this concept better by looking at sample statistics. In the example below, I explicitly started the transaction on Subscriber table to simulate blocking on the subscriber side, making the writer thread of the distribution agent wait and build up latency. This is how the stats looked.
  • <stats state="1" work="755" idle="354505">
  • <reader fetch="153" wait="604"/>
  • <writer write="613" wait="346"/>
  • <sincelaststats elapsedtime="636" work="515" cmds="45033" cmdspersec="87.000000"><reader fetch="0" wait="515"/><writer write="515" wait="0"/></sincelaststats></stats>
We can clearly see Reader thread wait time is high (515) which means the Writer thread is slow since we simulated the blocking on subscriber side. Similarly,we can simulate the blocking on replication tables msrepl_commands and msrepl_transactions which will cause the Log reader writer thread to be slow and stats will show Reader thread wait time as high. Ok, we've isolated the source of the bottleneck in the Thread level. Now, we can just follow the standard performance troubleshooting approach described in this Whitepaper to troubleshoot the slowness of the replication session. For instance, check out the video where Joe Sack talks about using Extended events to troubleshoot the Distributor writer thread slowness. In Summary 1. Find which agent is causing slowness using the tracer token. 2. Leverage the Agent statistics to narrow down problems to the thread level. 3. Follow standard performance troubleshooting approach to resolve the issue. Thanks for reading! I hope this will help you troubleshoot the replication performance better next time.

No Comments Yet

Let us know what you think

Subscribe by email