Pythian Blog: Technical Track

SQL Server replication quick tips

There is a time in every SQL Server DBA career where a mail came in with a “replication is not working, please check” message. This article is intended to provide with quick tips on how to handle common replication errors and performance problems in a one way transactional replication topology

Oh boy, there is a data problem:

ID-10039897

You check replication monitor and get a :

“Transaction sequence number: 0x0003BB0E000001DF000600000000, Command ID: 1”

The row was not found at the Subscriber when applying the replicated command. (Source: MSSQLServer, Error number: 20598)

Note the sequential number will be used in the following scripts, also the commandID is important to note as not necessarily the whole sequential number has issues, it might be tied to just one command.

Go to the distributor database en run the following command to get the list of articles involved in this issue:

select * from dbo.MSarticles
where article_id in (
select article_id from MSrepl_commands
where xact_seqno = 0x0003BB0E000001DF000600000000)

To get the whole list of commands you can run below query

exec sp_browsereplcmds
@xact_seqno_start = ‘0x0003BB0E000001DF000600000000’,
@xact_seqno_end = ‘0x0003BB0E000001DF000600000000’

With this last query you can get to the exact command that is failing (by searching the command number in the commandID column)

You will notice that a transactional replication will typically(depending on setup) use insert, delete, update stored procedures to replicate the data, so the command you will see over here will look something like:

{CALL [sp_MSdel_dboMyArticle] (118)}

That is the stored procedure generated to process delete statement over dbo.MyArticle table, and in this case it is trying to delete ID 118. Based on the error reported you will now realize that the issue is that the replication is trying to delete MyArtcile on ID 118 and is not there, so it is trying to delete a non existent record.

Options:

  1. You can either check the publisher for this record and manually insert it in the subscriber, this will cause the replication command to succeed and will fix the issue.
  2. You can skip the command, for this specific example you can skip the command as there is no need to delete something that has been already deleted, by removing the command from the MSrepl_commands table. (Beware, only do this when you know what you are doing, manually removing transactions can result in a unstable replication.) In this example you would use something like
    Delete from MSrepl_commands
    where xact_seqno = 0x0003BB0E000001DF000600000000 and commandID=1
  3. Reinitialize, this option is the least famous, you should try to fix the issue before doing this, however if after skipping the command you still get new errors everywhere, something definitely went wrong and there is no easy way to guarantee that your subscription is up to date and stable, this can be indicator that someone or something messed around with the data, there was some type of modification at the subscription and this is causing issues with the replication. Remember most likely a one way transactional replication is intended to have a copy of the data so it can be queried, no modification should be made to the data as this won´t replicate back to the publisher.

Query time outs:

After checking the replication monitor you get a message like:ID-10054415

Query timeout expired
The process is running and is waiting for a response from the server
Initializing…

and then terminating with this error…
Agent ‘MyAgent’ is retrying after an error, YY retries attempted

This can be due to several reasons:

  • Your transaction is taking a long time and needs some tuning. If your transaction is touching too much data or is using a bad query plan it can result in a long running query, check your TSQL and see if the execution plan is optimal
  • There is a problem with the network. If you normally don´t have this issue and this just happened out of the blue, you can try to check the network, sometimes a network failure or saturated endpoint can increase transfer rates affecting your replication.
  • Server performance, either the publisher or subscriber can have a performance problem, either too much CPU or Memory usage can eventually impact a replication transaction causing it to timeout
  • The query just needs some more time to complete. If this is the case you can tweak the time out setting to give the transaction some more time so it can process properly. To do this:
  1. Right click the Replication folder
  2. Click Distributor Properties and select General
  3. Click ‘Profile Defaults’
  4. Choose ‘Distribution Agents’ on left
  5. Click ‘New’ to create a new default agent profile
  6. Choose ‘Default Agent Profile’ from the list displayed, (to copy this)
  7. Pick a name for your new profile and upate the QueryTimeout value in right column
  8. Save
  9. Choose to use this profile across all your replication sets. However I would recommend to only apply to the agent that requires this change
  10. To individually assign the profile, open Replication Monitor and then in the left pane click your replication set
  11. In the right pane, select your desired agent, right click and change the profile to the new one you just created

 Mini Hack on expired subscriptionsID-10098834

When a replication is marked as expired, it will tell you that you need to reinitialize.

To activate it “under the hood”, check your replication monitor last error, it will show you the last sequential number that tried to process, then run this command(using the corresponding seq_no):

update MSsubscriptions
set status=2
where subscription_seqno=0x0002AADE00005030000100000002

The status column means:

0 = Inactive.

1 = Subscribed.

2 = Active.

You can change it to Active and it will try to process again. Why would you use this? if the subscription expired but your distribution cleanup job haven´t run, then it can try to reprocess everything again, if the issue was related to a network time out and now you have your network back up, you can try this as it will try to start from the last sequential number. Also you can try to do this to reproduce the last error reported, so it will fail and eventually expire again but you will have a better idea on why it failed in the first place.

Multi threading or “Streams”

A slow replication, and by slow I mean when you know that your replication is experiencing a delay when your command goes from the distributor to the subscription, you can check this with performance counters or quickly insert a token(https://technet.microsoft.com/en-us/library/ms151846%28v=sql.105%29.aspx)

You can improve the performance by adding streams, normally a default setting will write sequentially the replication transactions one by one, with Streams you can add more threads, say you specify to use 4 strems, you will be processing 4 transactions at a time meaning a faster turnaround. This can work beautifully but it can also generate deadlocks and inconsistencies, I would recommend to start low and just add 1 stream at a time and stop when you start seeing a problem. Do not go crazy and feel this is a turbo button and add 30 streams, and like most features, test it in QA first!

To Enable this option follow these steps:

  1. Open Replication Monitor, expand the Publisher and select the Publication in the left pane.
  2. On the right pane window , under “All Subscriptions” , you will see a list of all the Subscribers.
  3. Right Click the Subscriber you want to modify and click on “View Details”. A new Window will appear with the distribution agent session details.
  4. Now click on “Action” in the Menu bar at top and select “Distribution Agent Job Properties”, this will open the corresponding job properties.ID-100203331
  5. Go to  “Steps” in the left pane window followed by highlighting “Run Agent” on the Right pane window, click Edit.
  6. A new Windows will popup , scroll to the right end of the command section and append this parameter “ -SubscriptionStreams 2”
  7. Save the settings and restart the Distribution Agent job.

You might encounter some issues when implementing this, you can read this KB for further info:

https://support.microsoft.com/kb/953199

Conclusion

There are many tips on how to fix a replication, sometimes is easier to just reinitialize, but sometimes this is not an option when critical systems depend on the subscription to be up to date or your database is so huge that it will take days to complete. When possible try to troubleshoot instead of just restarting the replication from scratch as it will give you a lot more insight on what is going on.

No Comments Yet

Let us know what you think

Subscribe by email