Troubleshooting Failover for Sun Cluster

This section includes troubleshooting information about common errors.

When you shut down a companion, it is restarted on the same node instead of failing over the first time. It fails over on the second shutdown. This is an issue with the Sun Monitor.

As a work around, set restart_delay to a large value (say, 50000) when you issue hasybase insert (step 6 on 158) so the companion always fails over within the time specified by the restart_delay value if the companion is shut down. To use this work around, you must start the companion using the hasybase start command; you cannot start the companion using the Sybase RUNSERVER file.
Sybase has not analyzed the hasybase_config_V1 file for Adaptive Server version 12.5.
If any of your nodes have a large number of remote NFS mounts, you may see NFS errors, and your response time from this node may be slow when the logical host is deported from this node. Specifically, when you issue sp_companion...prepare_failback from the secondary node, and the primary companions logical host is being deported to the primary host, you will see a slow response from the secondary node. This is temporary, and should revert to the normal response time in a few minutes. To avoid this, make sure your secondary host is working with a normal response time before you issue sp_companion...resume from the primary host.
If your cluster includes only two nodes and does not include any quorum disks, and a node in your cluster fails, split-brain partitions occur and failover does not proceed without user intervention. Every 10 seconds, the system displays:
```
*** ISSUE ABORTPARTITION OR CONTINUEPARTITION ***
```
along with the commands you must issue to either abort or continue. To continue, issue:
```
scadmin continuepartition <localnode> <clustername>
```
To avoid this situation, make sure you have quorum disks defined on both nodes.
Error message 18759. If a companion server issues error message 18750, check the @@cmpstate of your servers. If your primary companion is in normal companion mode, but the secondary companion is in secondary failover mode, your cluster is in an inconsistent state, and you need to manually recover from this. This inconsistent state may be caused by an sp_companion 'prepare_failback' command failing on the secondary companion. To recover from this, perform the following steps manually:
1. Issue the following to stop monitoring both companion servers:
```
hasybase stop companion_name
```
2. Shut down both the primary and the secondary companions.
3. As root, issue the following to move the primary logical host back to the secondary node:
```
haswitch secondary_host_name primary_log_host
```
4. restart the secondary companion.
5. Repair all databases marked “suspect.” To determine which databases are suspect, issue:
```
select name, status from sysdatabases
```
  Databases marked suspect have a status value of 320.
6. Allow updates to system tables:
```
sp_configure “allow updates”, 1
```
7. For each suspect failed over database, perform the following:
```
1> update sysdatabase set status=status-256 where name='database_name'
2> go
1> dbcc traceon(3604)
2> go
1> dbcc dbrecover(database_name)
2> go
```
8. From the secondary companion, issue:
```
sp_companion primary_companion_name, prepare_failback
```
  For example, from primary companion MONEY1:
```
sp_companion MONEY1, prepare_failback
```
  Make sure that this command executes successfully.
9. Issue the following to resume monitoring the primary companion:
```
hasybase start primary_companion_name
```