13045

Replication suspended because RSSD restarted

4.6 RSSD SQL Server Bounced

Symptoms

These error messages are displayed in the Replication Server error log:

E. 96/09/30 14:59:01. ERROR #13045 LTM USER(longdss.amercyp) - seful/cm.c(2597)
        FAILED to connect to server 'westss' as user 'westrs_rssd_prim'. See ct-lib and/or server error messages for more information.
I. 96/09/30 14:59:01. Trying to connect to server 'westss' as user 'westrs_rssd_prim' ......

After the Adaptive Server with the RSSD has been started again, the following error messages are displayed in the Replication Server error log:

E. 96/09/30 14:59:01. ERROR #1027 DIST(westss.amerttp) - seful/cm.c(2593)
        Open Client Client-Library error: Error: 84083972, Severity: 5 -- 'ct_connect(): unable to get layer message string: unable to get origin message string: Net-Lib protocol driver call to connect two endpoints failed', Operating System error 0 -- 'Socket connect failed - errno 22'.
E. 96/09/30 14:59:01. ERROR #13045 DIST(westss.amerttp) - seful/cm.c(2597)
        FAILED to connect to server 'westss' as user 'westrs_rssd_prim'. See ct-lib and/or server error messages for more information.
I. 96/09/30 14:59:01. Trying to connect to server 'westss' as user 'westrs_rssd_prim' ......
E. 96/09/30 14:59:08. ERROR #1027 DSI(westss.amerttp) - /dsiutil.c(278)
        Open Client Client-Library error: Error: 84083974, Severity: 5 -- 'ct_results(): unable to get layer message string: unable to get origin message string: Net-Library operation terminated due to disconnect'.
E. 96/09/30 14:59:08. ERROR #5097 DSI(westss.amerttp) - /dsiutil.c(281)
The ct-lib function 'ct_results' returns FAIL for database 'westss.amerttp'. The errors are retryable. The DSI thread will restart automatically. See ct-lib messages for more information.
...
E. 96/09/30 14:59:12. ERROR #13043 LTM USER(longdss.amercyp) - seful/cm.c(2796)
 Failed to execute the 'USE westrs_rssd' command on server 'westss'. See ct-lib and sqlserver error messages for more information.
E. 96/09/30 14:59:12. ERROR #1028 LTM USER(longdss.amercyp) - seful/cm.c(2796)
 Message from server: Message: 921, State: 1, Severity: 14 -- 'Database 'westrs_rssd' has not been recovered yet - please wait and try again.'.
I. 96/09/30 14:59:12. Message from server: Message: 5701, State: 1, Severity: 10 -- 'Changed database context to 'master'.'.
I. 96/09/30 14:59:15. LTM for longdss.amercyp connected in passthru mode.
E. 96/09/30 14:59:16. ERROR #13043 USER(westrs_ltm) - seful/cm.c(2796)
Failed to execute the 'USE westrs_rssd' command on server 'westss'. See ct-lib and sqlserver error messages for more information.
E. 96/09/30 14:59:16. ERROR #1028 USER(westrs_ltm) - seful/cm.c(2796)
 Message from server: Message: 921, State: 1, Severity: 14 -- 'Database 'westrs_rssd' has not been recovered yet - please wait and try again.'.
I. 96/09/30 14:59:16. Message from server: Message: 5701, State: 1, Severity: 10 -- 'Changed database context to 'master'.'.
...
E. 96/09/30 14:59:23. ERROR #13043 DIST(westss.amerttp) - seful/cm.c(2796)
 Failed to execute the 'USE westrs_rssd' command on server 'westss'. See ct-lib and sqlserver error messages for more information.
E. 96/09/30 14:59:23. ERROR #1028 DIST(westss.amerttp) - seful/cm.c(2796)
        Message from server: Message: 921, State: 1, Severity: 14 -- 'Database 'westrs_rssd' has not been recovered yet - please wait and try again.'.
...

Explanation

The Adaptive Server that controls the RSSD was shut down and restarted while the Replication Server was running. The DIST and SQT threads to the databases controlled by the Replication Server were terminated. Replication to those databases was terminated and will not resume even after the RSSD becomes available again.

Running the admin_who_is_down command at the Replication Server shows that both DIST and SQT threads are down as follows:

Spid 	Name 		State 		Info
 --------------------------------------------
 DIST			Down		westernDS.westDB 
 SQT			Down 		105:1 westernDS.westDB

Solution

To solve the problem:

  1. At the Replication Server, execute resume distributor for each database to resume SQT and DIST threads.

  2. Run admin_who_is_down at each database to verify that the SQT and DIST threads are up.