Tuesday 8 November 2011

Replication P&T and Troubleshooting

Performance Tuning


Performance Tuning is one of the important tasks that a Replication Server DBA must considered for better performance and avoiding latency in transmitting data from one database to another database.  To meet the needs and demands of your Replication Server system, you must manage resources effectively and optimize the performance of individual Replication Servers. You can improve the performance of a Replication Server by changing the values of configuration parameters, by using parallel DSI threads, or by choosing disk allocations. To manage these resources successfully, you should understand something about Replication Server’s internal processing.  During replication, data operations are carried out by several Replication Server threads. On UNIX platforms, they are POSIX threads. On Windows platforms, they are WIN32 threads.

Prerequisites: Reader should have an understanding of Sybase Replication Server, and should have some experience with Sybase ASE databases.

Audience: The documentation designed for everyone who uses Replication Server.

Processing in the primary Replication Server


This section describes how a transaction that originates in a primary data server is sent to the primary Replication Server and subsequently distributed to a replicate Replication Server as illustrated in Figure below:



Replication Agent User Thread: RepAgent logs in to Replication Server through an Open Client interface. It scans the transaction log, converts log records directly into LTL (Log Transfer Language) commands, and sends them to Replication Server as soon as they are logged—either in batches or one at a time. Replication Server then distributes the transaction information to subscribing replicate databases.

Stable Queue Manager Thread: There is one Stable Queue Manager (SQM) thread for each stable queue accessed by the primary Replication Server, whether inbound or outbound. Each RepAgent user thread works with a dedicated SQM thread that reclaims stable queue space after a transaction is forwarded to a data server or to another Replication Server.

Stable Queue Transaction Thread: Commands stored in transaction log records and in the inbound queue are ordered according to the sequence in which they were committed—although they are not necessarily grouped by transaction. It is the task of the Stable Queue Transaction (SQT) thread to reassemble transactions and place the transactions in commit order. Transactions must be in commit order for final application on the destination’s data servers and for materialization processing.  The SQT thread reassembles transactions as it reads commands from its stable queue and keeps a linked list of transactions.

Distributor Thread and Related Modules: For each primary database managed by a Replication Server, there is a distributor (DIST) thread, which in turn uses SQT to read from the inbound queue and SQM threads to write transactions to the outbound queue. Thus, for example, if there are three primary databases, then there are three inbound queues, and three DISTand SQT threads.

In determining the destination of each transaction row, the DIST thread makes calls to the following modules: Subscription Resolution Engine, Transaction Delivery, and Message Delivery.

Subscription Resolution Engine: The subscription resolution engine (SRE) matches transaction rows with subscriptions. When it finds a match, it attaches a destination-database ID to each row. It marks only rows required for subscriptions, thereby minimizing network traffic. If no subscriptions match, the DIST thread discards the row data.

Transaction Delivery Module: The Transaction Delivery (TD) module is called by the DIST thread to package transaction rows for distribution to data servers and other Replication Servers.

Message Delivery Module: The Message Delivery (MD) module is called by the DIST thread to optimize routing of transactions to data servers or other Replication Servers.

Data Server Interface Threads: Replication Server starts DSI threads to submit transactions to a replicate database to which it maintains a connection. Each DSI thread is composed of a scheduler thread (DSI-S) and one or more executor threads (DSI-E). Each DSI executor thread opens an Open Client connection to a database.

The DSI scheduler thread calls the SQT interface to:

  • Collect small transactions into groups by commit order
  • Dispatch transaction groups to the next available

The DSI executor threads:

  • Map functions using the function strings defined for the functions, according to the function-string class assigned to the database connection
  • Execute the transactions in the replicate database
  • Take action on any errors returned by the data server; depending on the assigned error actions, also record any failed transactions in the exceptions log

Replication Server Interface Thread: RSI threads are asynchronous interfaces to send messages from one Replication Server to another. One RSI thread exists for each destination Replication Server to which the source database has a direct route.

Suggestions for using tuning parameters


This section provides basic recommendations for improving Replication Server performance.  Please note that only some of the important configuration parameters are described in this section.  To have complete list of tuning parameters to meet specific performance problem not addressed here, please refer sybase documentation.

Setting the amount of time SQM Writer waits


Replication Server configuration parameters: init_sqm_write_delay and init_sqm_write_max_delay

In a low-volume system, set init_sqm_write_delay and init_sqm_write_max_delay to a low value so that the SQM Writer need not wait long before writing a partially full block. In a high-volume system, set these parameters higher because the SQM Writer rarely waits to fill a block.

Caching system tables


Replication Server configuration parameters: sts_cache_size and sts_full_cache_table_name

You can fully cache certain system tables so that simple select statements on those tables do not require access to the RSSD. By default, rs_repobjs and rs_users are fully cached. Sybase recommends that you cache rs_objects, rs_columns, and rs_functions. Depending on the number of replication definitions and subscriptions used, fully caching these tables may significantly reduce RSSD access requirements. However, if the number of unique rows in rs_objects is approximately equal to the value for sts_cachesize, these tables may already be fully cached.

Sizing the SQT cache


Replication Server configuration parameter: sqt_max_cache_size

sqt_max_cache_size applies to all SQT caches supporting DIST clients, and provides a default value for SQT caches that support DSI clients. The DISTs can push through transactions rapidly; their SQT caches do not need to be as large as SQT caches for DSIs. Thus, it is advisable to set SQT cache sizes for DSIs individually using the connection configuration parameter dsi_sqt_max_cache_size, and using sqt_max_cache_size for DIST SQT caches only.

Setting wake up intervals


Replication Server configuration parameters: rec_daemon_sleep_time, sub_daemon_sleep_time, and stats_daemon_sleep_time

By default, the recovery and subscription daemons wake up every two minutes to check the RSSD for messages. In a typical production environment, the subscription daemon is used rarely. As a consequence, you may be able to set the subscription daemon wake-up interval to the maximum value: 31,536,000 seconds. Similarly, you can evaluate whether you want to set the recovery and statistics daemon to a longer wake-up interval.

Controlling the number of network operations


Database connection configuration parameter: dsi_cmd_batch_size

dsi_cmd_batch_size controls the size of a DSI command batch. That is, it controls the size of the buffer a DSI uses to send commands to a replicate data server. When the DSI configuration batch is set on, the DSI places as many commands as will fit into a single command batch before sending it to the replicate. In some cases, increasing the value of dsi_cmd_batch_size improves throughput by reducing the number of network operations.

Controlling the number of outstanding bytes


Database connection configuration parameters: exec_sqm_write_request_limit and md_sqm_write_request_limit

exec_sqm_write_request_limit controls the maximum number of outstanding bytes the RepAgent User thread can hold before it must wait for some of those bytes to be written to the inbound queue. Similarly, md_sqm_write_request_limit controls the number of outstanding bytes a Distributor can hold before it must wait for some of those bytes to be written to the outbound queue.

Specifying the number of stable queue segments allocated


Replication Server configuration parameter: sqm_recover_segs

sqm_recover_segs specifies the number of stable queue segments Replication Server allocates before updating the RSSD with recovery QID information.

If sqm_recover_segs is set low, more RSSD updates are performed, possibly slowing performance. If sqm_recover_segs is set high, fewer RSSD updates are performed, possibly improving performance at the expense of longer recovery times.

Controlling the number of commands the RepAgent executor can process


Database connection configuration parameter: exec_cmds_per_timeslice

By default, the value of the exec_cmds_per_timeslice parameter is 5, which indicates that the RepAgent executor thread can process no more than five commands before it must yield the CPU to other threads. Depending on your environment, increasing or decreasing these values may improve performance.

If the in-bound queue is slow to be processed, try increasing these values to give the RepAgent executor thread and the distributor thread more time to perform their work. If, however, the out-bound queue is slow to be processed, try decreasing these parameter values so that the DSI has more time to work.

Selecting disk partitions for stable queues


Database connection configuration parameter: disk_affinity

You can choose the disk partition to which Replication Server allocates segments for stable queues. By choosing the stable queue placement, you can enhance load balancing and read/write distribution.

To choose the segment allocation, use the alter connection or alter route command with the “set disk_affinity” option.

The syntax is:

alter connection to dataserver.database
set disk_affinity to [ 'partition' | 'off' ]

alter route to replication_server
set disk_affinity to [ 'partition' | 'off' ]

partition is the logical name of the partition to which you want to allocate the next segment for the connection or route.

An example

You can allocate different disk partitions to different stable queues. You could, for example, make partitions of different sizes available to different database connections. In this example, we add partitions of 10MB and 20MB to the Replication Server and specify allocation hints for the TOKYO_DS and SEATTLE_DS data servers. The procedure is:

  • Make the partitions P1 and P2 on the device named /dev/rds0a available to Replication Server, enter:

add partition P1 on '/dev/rds0a' with size 20
add partition P2 on '/dev/rds0a' with size 10


  • Suspend the connection to the TOKYO_DS and SEATTLE_DS data servers, enter:

suspend connection to TOKYO_DS
suspend connection to SEATTLE_DS

  • Specify allocation hints for the connection to the TOKYO_DS and SEATTLE_DS data servers, enter:

alter connection to TOKYO_DS.db1
set disk_affinity to 'P1'

alter connection to SEATTLE_DS.db5
set disk_affinity to 'P2'

  • Resume the connections to the TOKYO_DS and SEATTLE_DS data servers, enter:

resume connection to TOKYO_DS
resume connection to SEATTLE_DS

  • Dropping hints and partitions

alter connection to TOKYO_DS.db1
set disk_affinity to 'P1' to 'off'

Specifying the number of transactions in a group


You can use different configuration parameters to control the number of transactions in a group.

Database configuration parameter : dsi_max_xacts_in_group

dsi_max_xacts_in_group specifies the maximum number of transactions in a group. Larger numbers may reduce commit processing at the replicate database, and thereby improve throughput.

Using parallel DSI threads


You can configure a database connection so that transactions are applied to a replicate data server using parallel DSI threads rather than a single DSI thread. Applying transactions in parallel increases the speed of replication, yet maintains the serial order of the transactions that occurred at the primary site.  When used parallel DSI, there will be one DSI Schedule and multiple DSI executor threads as shown in the following figure.


DSI Scheduler: The DSI scheduler thread (shown as DSI-S in Figure) collects small transactions into groups by commit order. Once transactions are grouped, the DSI scheduler dispatches the groups to the next available DSI executor thread. The DSI scheduler attempts to dispatch groups for different origins in parallel, because they can commit in parallel.

DSI executor threads: The DSI executor threads (shown as DSI-E in Figure) map functions to function strings and execute the transactions on the replicate database. The DSI executor threads also take action on any errors the replicate data server returns.

Example to setup parallel DSI :

alter connection to SYDNEY_DS.pubs2
set parallel_dsi to 'on'

Configuration Parameters for Parallel DSI :


Configuration parameter
Description
dsi_num_threads
The number of parallel DSI threads to be used for a connection. A value of 1 disables the parallel DSI feature.
Default: 1
Minimum: 1
Maximum: 255
dsi_large_xact_size

The number of statements allowed in a transaction before it is considered to be a large transaction.
Default: 100
Minimum: 4
dsi_num_large_xact_threads
The number of parallel DSI threads to be reserved for use with large transactions. The maximum value is one less than the value of dsi_num_threads.
Default: 0
dsi_serialization_method

Specifies the method used to maintain serial consistency between parallel DSI threads when applying transactions to a replicate data server.
•no_wait – specifies that a transaction can start as soon as it is ready—without regard to the state of other transactions.
•wait_for_start – specifies that a transaction can start as soon as the transaction scheduled to commit immediately before it has started.
 •isolation_level_3– specifies that a transaction can start as soon as the transaction scheduled to commit immediately before it has started, and that transaction isolation level 3 locking be used in the replicate data server.
•wait_for_commit – specifies that a transaction cannot start until the transaction scheduled to commit immediately preceding it is ready to commit.
•none – same as wait_for_start. Retained for backward compatibility.
•single_transaction_per_origin – same as wait_for_start with dsi_partitioning_rule set to origin. Retained for backward compatibility.
Default: wait_for_commit
parallel_dsi

A shorthand method for configuring parallel DSI to default values. A value of “on” sets dsi_num_threads to 5, dsi_num_large_xact_threads to 2, dsi_serialization_method to wait_for_commit, and dsi_sqt_max_cache_size to 1 million bytes. A value of “off” sets the parallel DSI values to their defaults. You can set this parameter to “on” and then set individual parallel DSI configuration parameters to fine-tune your configuration.
Default: off






SMP Enable


Replication Server configuration parameter: smp_enable

To determine the number of processors required to make effective use of SMP, establish a base of two processors plus one more for every four queues. Processor speed may determine whether these numbers are correct to meet your performance needs. If you have outbound queues supporting parallel DSI, and there are more than 12 DSI Executor threads, you may want to increase the processor/thread ratio for outbound queues—one processor for every three or even two outbound queues.

Replication Server always uses a finite number of threads based on the number of supported connections and routes. Even if all threads are to be kept always busy, making more and more processors available to Replication Server will eventually cause “CPU saturation”—beyond which more processors will not increase performance. At that point, any performance issues you experience as a result of CPU resources may best be addresssed by introducing CPUs running at faster speeds.

Although test results are inconclusive, there is evidence that making too many processors available to Replication Server can actually decrease performance. In such cases, the issue seems to be the amount of time taken to force thread context switches among the available processors. Use your operating system (OS) monitoring utilities to monitor the OS’s management of the Replication Server process and its threads. These utilities will help you determine if a reduction in CPUs made available to Replication Server reduces the number of such context switches.



Troubleshooting

Overview


A correctly configured Replication Server system is designed to be fault tolerant. However, in the event of a serious failure, it might be necessary to manually intervene and fix the problem. This guide will help you locate, identify, and fix the cause of the problem.  Use the following troubleshooting methods to narrow down the possible causes:

  • Identify recent tasks or events that might have adversely affected the replication system. You or another coworker could have performed a task that caused a replication failure or reduced the performance of the replication system.   An event such as a temporary network outage may also have caused a replication system problem.
  • Analyze error logs.
  • Determine the general problem category (configuration, subscription, replication, recovery).
  • Use diagnostic tools, such as Replication Server programs or stored procedures, or isql, to analyze the replication system.

Analyzing Error Messages


Error messages provide important information for identifying the cause of replication system problems. When a Replication Server or RepAgent error occurs, an error message is recorded in an error log. Error messages are displayed in a standard format; once you learn this format, you will be able to read and understand all Replication Server and RepAgent error messages.

Replication Server error messages are recorded in text files called error logs or sent to the standard error output (stderr), which is usually a terminal window. In general, replication agent error messages are recorded in their own error logs, except for RepAgent error messages, which are recorded in the Adaptive Server data server error log. LTM error messages are recorded in the LTM’s error log. Data server error messages are recorded in the data server’s error log.

Replication Server errors


Replication Server errors are recorded in the Replication Server error log.  A Replication Server error log contains informational, warning, thread terminated, fatal, and internal error messages.

Example

Examine the following error message block in the Replication Server log:
H. 97/08/23 . THREAD FATAL ERROR #5049 DSI(WESTERNDS.westDB) - dsiqmint.c(3675) The DSI thread for database ‘westernDS.westDB’ is being shutdown. DSI received data server error #102 which is mapped to STOP_REPLICATION. See logged data server errors for more information. The data server error was caused by RS output command #1 mapped from input command #1 of the failed transaction.

I. 97/08/23 09:52:02. The DSI thread for database ‘WESTERNDS.westDB’ is shutdown.

Severity

I
An informational message.
W
A warning about a condition that has not yet caused an error, but may require attention, for example, running out of a resource.
E
An error that does not prevent further processing, such as a site that is unavailable.
H
A Replication Server thread has died, for example, a lost network connection.
F
Fatal. A serious error caused Replication Server to exit. For example, if you start the  Replication Server using an incorrect configuration.
N
Internal error, caused by anomalies in the Replication Server software. Report these errors to Sybase Technical Support.


Error Message Text
Explanation
H. 97/08/23 . THREAD FATAL ERROR #5049 DSI
(WESTERNDS.westDB)
 - dsiqmint.c(3675)

This is the first error. The H identifies the error as a thread termination error. This message shows that the DSI thread terminated.

The error message includes the dataserver.database name where the error occurred, the internal Replication Server module (dsiqmint.c), and the line of code (3675) where the error occurred.
The DSI received data server error # 102 which is mapped to STOP_REPLICATION.
See the logged data server errors for more information.
This Adaptive Server error message identifies the cause of the problem.
Find a description for Adaptive Server Error 102 by:
• Finding the error in the Adaptive Server error log (the error would occur at approximately the same time as it occurred in the Replication Server error log),
• Performing a select * from sysmessages where error = 102 in the master database, or
• Looking up the number in the Adaptive Server Error Messages or Troubleshooting and Error Messages Guide.
I. 97/08/23 09:52:02. The DSI thread for database ‘WESTERNDS.westDB’ is
shutdown.
This last error message is informational (I) and caused by the problem specified in a previous block. After fixing the Adaptive Server problem, you must restart the DSI thread for the specified database.


Example : RepAgent error message format

01/08/1998 14:47:58 PM RepAgent (10) 9209, EX_CMDFATAL, 1: Missing datarow in TEXT/IMAGE insert log record. Transaction log may be corrupt. Please contact SYBASE Technical Support. (current marker = (%d, %d)).

Common Errors


This section describes some of the most common Replication Server errors and solutions.

rs_init errors :


Error : Could not find entry for Adaptive Server

Message text : Error messages stating that the Adaptive Server entry does not exist in the interfaces file are displayed in the rs_init error log.

08/23/96 Running task to check the Adaptive Server.
08/23/96 02:06:49 PM Could not find entry for Adaptive Server ’westdss’ in the interfaces file.
08/23/96 02:06:49 PM Task to check the Adaptive Server failed; terminating configuration.
08/23/96 02:06:49 PM Configuration failed.
08/23/96 Exiting.

Solution: Check for unprintable characters (such as control characters) in the Adaptive Server name entry in the resource file. The Adaptive Server name entry is specified as the rs.rs_ds_name parameter.

Error: Unknown host machine name.

Message text: The following messages are displayed in the rs_init error log.

08/09/98 11:54:43 AM Running task to check the RSSD Adaptive Server.
08/09/98 CONNECTIVITY ERROR: DB-Library error: ‘Unknown host
machine name.’. 08/09/98 Unable to connect to the RSSD SQL Server
‘axptst’.Make sure that the Adaptive Server is running, and the ‘sa’ password
is correct.
08/09/98 Task to check the RSSD Adaptive Server failed;
terminating configuration.
08/09/98 Configuration failed. 08/09/98 11:54:45 AM Exiting.

Solution:  Add an additional query line for the machine to the interfaces file and specify the host machine’s IP address instead of its name. For example, for the bss.bsslt.ch.corma.com host name substitute its IP address, 2.41.100.35, as follows:

query tcp ether bss.bsslt.ch.corma.com 31440
query tcp ether 2.41.100.35 31440.

Replication Server error message descriptions


1028    Data server errors

Explanation : Error 1028 is displayed in the Replication Server error log. An Adaptive Server error is reported in the text of error 1028.


E. 95/08/09 . ERROR #1028 DSI(westernDS.westDB)
- dsiqmint.c(3675) Message from server: Message: ...,
State: ..., Severity: ...-- ‘...’.

The most common error situations are described in these sections:

• DSI shuts down because of SQL error in transaction
• Adaptive Server and DB2 table names do not match
• Adaptive Server last-chance threshold passed
• Accessing a database while in recovery

Solution: In general, you must identify and fix the Adaptive Server problem that is causing the error. If you cannot correct the cause of the problem, as a last resort, you can resume the connection and skip the transaction. Skipping a transaction, however, causes inconsistencies between the primary and replicate databases, which you must manually fix in the replicate database.

Correcting Adaptive Server error

Checking Exceptions in RSSD


Attempt to fix the Adaptive Server error by analyzing the transaction text in the exceptions log as follows:

1. Log in to the RSSD and display the transaction ID by executing the rs_helpexception stored procedure.

            1> rs_helpexception
2> go

Summary of Logged Transactions on ‘westernRS’
Total # of Logged Transactions = 1

Xact                        ID Org Site                            Org User                Org Date    Dest Site # Recs/Xact
-----------                 ---------------            --------     ----------- ---------------
107                         mil01hprdss.eur eurian    Jan 26 1996 westernDS.eur 3

For Detailed Information on a Logged Xact., type ‘rs_helpexception {Xact
ID}’ (return status = 0)

2. Using the transaction ID displayed in the previous step, display the full transaction text by executing the rs_helpexception stored procedure with the v option. The v option includes the text of the transaction in the output.

1> rs_helpexception 107, v
2> go

You see this information:

Detailed Summary of Logged Transaction # 107 on ‘westernRS’ Origin Site



Origin User Org.                                   Commit Date                        #Cmds in Xact
------------------------------        ---------------            -----------------
westernDS.westDB eurian                                Jan 26 1996                3

Dest. Site Dest.                                    User                                       Date Logged
------------------------------        ---------------            -----------------
westernDS.westDB                                             eurian                    Jan 29 1996

This transaction was logged by the ‘sysadmin log_first_tran’ command.

Rejected Records

textval
----------------------------------------------------------------
A0100distribute :origin_time=’Jan 26 1996 9:27:24:416AM’,:origin_user=
‘’,:mode=1
begin transaction ‘logexec’ for ‘eurian’/’******’
begin transaction
A0100distribute :origin_user=’’,:mode=1 exec “TT”.”so_req_rep_all_allco
n” @”p01”=80000709,@”p02”=’MIL’
execute tt_act_rep_all_allcon @p01 = 80000709, @p02 = ‘MIL’
A0100distribute :origin_time=’Jan 26 1996 9:27:24:416AM’,:origin_user=
‘’,:mode=1 commit transaction
execute rs_update_lastcommit @origin = 107, @origin_qid = 0x
00000001004620d300019296000effffffff000000008910009bd7cd000100000000000
1, @ secondary_qid =
0x00000000000000000000000000000000000000000000000000000000000000
0000000000, @origin_time = ‘Jan 26 1996 9:27:24:416AM’ commit transaction
(return status = 0)

This text corresponds to what is sent to the database (in this case, function strings have been applied).

3 Use this transaction information to manually apply the update to the replicate database.

4 After fixing the error in the database, resume the connection in the
Replication Server.
> resume connection to westernDS.westDB

5 Delete the transaction from the exceptions log to keep the log small:

1>    exec rs_delexception 107
2>   go

Skipping the transaction


If you cannot fix the Adaptive Server error, skip the transaction and manually apply the skipped transaction to the replicate table to resynchronize the primary and replicate tables.

Adaptive Server last-chance threshold passed


Symptoms:  The following errors, which include references to the Adaptive Server error 7415, are recorded in the Replication Server error log.


I. 96/08/08 10:45:07. Message from server: Message: 7415, State: 1,
Severity: 10 - - ‘The transaction log in database northDB is almost full.
Your transaction is being suspended until space is made available in the
log.’.


E. 96/08/08 . ERROR #5046 DSI(axp st.northDB) - /dsioqid.c(1638)
When executing the rs_get_lastcommit function in database ‘axp st.northDB’,
received data server errors. See logged data server errors for more
information.

Solutions: Reduce the size of the log by dumping or truncating.

DSI shuts down because replicate database log is full


Symptoms : The DSI thread shuts down and the following error message is displayed in the Replication Server error log.

E. 96/07/17 . ERROR #1028 DSI EXEC(107(1) SYDNEY_DS.pubs2) -
dsiqmint.c( 2361) Message from server: Message: 1105, State 3, Severity 17
-- ‘Can’t allocate space for object ‘syslogs’ in database ‘pubs2’ because
the ‘logsegment’ segment is full. If you ran out of space in syslogs, dump
the transaction log. Otherwise, use ALTER DATABASE or sp_extendsegment to
increase the size of the segment.
H. 96/07/17 . THREAD FATAL ERROR #5049 DSI EXEC(107(1)
SYDNEY_DS.pubs2) - dsiqmint.c(2368) The DSI thread for database
‘SYDNEY_DS.pubs2’ is being shutdown.
DSI received data server error #1105 which is mapped to STOP_REPLICATION.
See logged data server errors for more information. The data server error
was caused by output command #1 mapped from input command #1 of the failed
transaction.

Solutions : Complete the following tasks:

  1. Add space to the Adaptive Server log, after verifying that there is space in the Replication Server stable device.

  1. If there is not enough space in the stable device to increase the size of the log, add space to the stable device.

  1. Make sure the DSI thread to the data server and the RSI thread to the Replication Server are up and running.

Replication Server out of memory


Symptoms: After a connection is started, these error messages are displayed in the Replication Server error log:

I. 95/12/19 16:42:09. The DSI thread for database ‘westernDS.westDB’ is
started.
E. 95/12/19 . ERROR #7035 DIST(westernDS.westDB) - m/memseg.c(771)
.........Additional allocation would exceed the memory_limit of ‘20000’
specified in the configuration.
Increase the value of the memory_limit parameter in the rs_config table and restart the Repserver.

Solution: To fix this problem:

1.     Increase the value of the memory_limit parameter.
2.     Restart the Replication Server.

Replication suspended because RSSD restarted


RSSD SQL Server Bounced

Symptoms: These error messages are displayed in the Replication Server error log:

E. 96/09/30 . ERROR #13045 LTM USER(longdss.amercyp) -
seful/cm.c(2597)
FAILED to connect to server ’westss’ as user ’westrs_rssd_prim’. See
ct-lib and/or server error messages for more information.
I. 96/09/30 14:59:01. Trying to connect to server ’westss’ as user
’westrs_rssd_prim’ ......

After the Adaptive Server with the RSSD has been started again, the following

error messages are displayed in the Replication Server error log:
E. 96/09/30 . ERROR #1027 DIST(westss.amerttp) - seful/cm.c(2593)
Open Client Client-Library error: Error: 84083972, Severity: 5 --
’ct_connect(): unable to get layer message string: unable to get origin
message string: Net-Lib protocol driver call to connect two endpoints
failed’, Operating System error 0 -- ’Socket connect failed - errno 22’.
E. 96/09/30 . ERROR #13045 DIST(westss.amerttp) - seful/cm.c(2597)
FAILED to connect to server ’westss’ as user ’westrs_rssd_prim’. See
ct-lib and/or server error messages for more information.
I. 96/09/30 14:59:01. Trying to connect to server ’westss’ as user
’westrs_rssd_prim’ ......
E. 96/09/30 . ERROR #1027 DSI(westss.amerttp) - /dsiutil.c(278)
Open Client Client-Library error: Error: 84083974, Severity: 5 --
’ct_results(): unable to get layer message string: unable to get origin
message string: Net-Library operation terminated due to disconnect’.
E. 96/09/30 . ERROR #5097 DSI(westss.amerttp) - /dsiutil.c(281)
The ct-lib function ’ct_results’ returns FAIL for database ’westss.amerttp’.
The errors are retryable. The DSI thread will restart automatically. See ctlib
messages for more information.
...
E. 96/09/30 . ERROR #13043 LTM USER(longdss.amercyp) -
seful/cm.c(2796)
Failed to execute the ’USE westrs_rssd’ command on server ’westss’. See ctlib
and sqlserver error messages for more information.
E. 96/09/30 . ERROR #1028 LTM USER(longdss.amercyp) -
seful/cm.c(2796)
Message from server: Message: 921, State: 1, Severity: 14 -- ’Database
’westrs_rssd’ has not been recovered yet - please wait and try again.’.
I. 96/09/30 14:59:12. Message from server: Message: 5701, State: 1,
Severity: 10 -- ’Changed database context to ’master’.’.
I. 96/09/30 14:59:15. LTM for longdss.amercyp connected in passthru mode.
E. 96/09/30 . ERROR #13043 USER(westrs_ltm) - seful/cm.c(2796)
Failed to execute the ’USE westrs_rssd’ command on
server ’westss’. See ct-lib and sqlserver error messages
for more information.
E. 96/09/30 . ERROR #1028 USER(westrs_ltm) - seful/cm.c(2796)
Message from server: Message: 921, State: 1, Severity: 14 -- ’Database
’westrs_rssd’ has not been recovered yet - please wait and try again.’.
I. 96/09/30 14:59:16. Message from server: Message: 5701, State: 1,
Severity: 10 -- ’Changed database context to ’master’.’.
...
E. 96/09/30 . ERROR #13043 DIST(westss.amerttp) - seful/cm.c(2796)
Failed to execute the ’USE westrs_rssd’ command on server ’westss’. See ctlib
and sqlserver error messages for more information.
E. 96/09/30 . ERROR #1028 DIST(westss.amerttp) - seful/cm.c(2796)
Message from server: Message: 921, State: 1, Severity: 14 -- ’Database
’westrs_rssd’ has not been recovered yet - please wait and try again.’.
...

Explanation: The Adaptive Server that controls the RSSD was shut down and restarted while the Replication Server was running. The DIST and SQT threads to the databases controlled by the Replication Server were terminated. Replication to databases controlled by the Replication Server were terminated. Replication to those databases was terminated and will not resume even after the RSSD becomes available again.

Running the admin_who_is_down command at the Replication Server shows that both DIST and SQT threads are down as follows:

Spid Name           State                       Info
-----------------         ------------------        ---------
DIST                       Down                     westernDS.westDB
SQT                        Down 105:1          westernDS.westDB

Solution: To solve the problem:

  1. At the Replication Server, execute resume distributor for each database to resume SQT and DIST threads.

  1. Run admin_who_is_down at each database to verify that the SQT and DIST threads are up.

log truncated (LTM 1079 with ASE 624)


Symptoms: This error message is displayed in the Adaptive Server error log:

97/10/17 11:57:19.34 RepAgent (10) Error: 9215,
Severity: EX_USER, State: 1
The Secondary Truncation Point is not valid. Please make sure that the Secondary Truncation Point is valid, and then restart the Rep Agent Thread for this database.

This error occurs when the log is truncated past the secondary truncation point and the RepAgent tries to reconnect to the database. RepAgent requests the log page number from the Replication Server and then accesses that page in the database. Because the log was truncated, the RepAgent will not start.

Attempts to retrieve rows from the page via RID fail because the requested RID is a higher value than the last RID on the page. For example, truncating the log frees only those log pages that the LTM has sent to the Replication Server with confirmation.

Solution: Because you truncated the log, you must reset the information about the database log in the rs_locater table within the RSSD as follows.

1. Execute the following stored procedure in the RSSD:
> rs_zeroltm primss, primdb
> go

2. Activate the secondary truncation point again in the primary database.
> dbcc settrunc(ltm, valid)
> go

3. Restart RepAgent.

Subscription Problems


If all of the servers and Replication Server threads are running, but the subscription is not being created or dropped, use the check subscription command, which returns messages that describe the status of the materialization process.

Atomic Materialization


This method, invoked using the default form of the create subscription command, copies subscription data through the network in a single atomic operation. Replication Server executes the rs_select_with_lock function to retrieve the primary data. This method provides complete consistency throughout the materialization process, but may temporarily obstruct transactions using the primary or replicate data. Do not use this method for large subscriptions if a long-running transaction is unacceptable in the primary database.

If the primary Replication Server is unavailable or any other problems occur, the recovering column in the rs_subscriptions system table is set to 1 to denote that the subscription requires recovery.

After the definition stage is complete, the replicate Replication Server builds the materialization queue for the subscription. You can monitor this activity with the admin who, sqm command.

Replication Server executes the rs_select_with_lock function to select subscription rows from the primary site. After the materialization queue is built, the replicate Replication Server sends an activation request to the primary Replication Server. This request is passed through the primary database via the rs_marker system function. When the primary Replication Server receives the activation request, the subscription is marked valid. All updates following the request are sent to the replicate database if they match the subscription.

Table given below describes solutions for atomic materialization problems based on the status returned by check subscription.
Replicate
Status
Primary
Status
Subscription State

Suggested Actions
Defined/Pending
Invalid
Waiting for other subscriptions for the same replication definition and replicate database to be processed.
Check for other subscriptions being created and dropped for the same replication definition and replicate database. If there are no other subscriptions, wait five minutes
Defined/
Recovering
Invalid

Cannot connect to the primary Replication Server to define the subscription.

Check the replicate Replication Server error log for messages.
Make sure the user creating the
subscription has the same login name and password at the primary Replication Server and the replicate Replication Server. Make sure the user has at least primary subscribe permission at the primary Replication Server.
Defined/
Recovering
Defined

Cannot build the materialization queue.
Verify that the primary data server is running. Make sure the user creating the subscription has the same login name and password at the primary data server and replicate Replication Server, select permission on the primary table, and execute permission for rs_marker.  Investigate whether the replicate Replication Server is out of stable queue segments. Use the admin disk_space command. Use admin who, dsi and admin who, sqm to monitor the queues.


Nonatomic materialization


This method, invoked using the create subscription command with the without holdlock clause, is similar to the atomic method, except that consistency constraints during materialization are relaxed to allow clients at the primary database to process transactions during materialization. Replication Server executes the rs_select function to retrieve the primary data. Subscription data is copied in a series of transactions.

Because users are allowed to update primary data, this method may result in transactional inconsistency and incomplete data during materialization. When materialization is complete, all inconsistencies are fully corrected. Autocorrection for the replicate table must be enabled to resolve inconsistencies.

Replicate
Status
Primary
Status
Subscription State

Suggested Actions
Defined/Pending
Invalid
The replicate Replication Server is waiting for other subscriptions for the same replication definition and replicate database to be created or dropped.

Look for other subscriptions being created or dropped for the same replication definition and replicate database.
If there are no other subscriptions, wait for five minutes.



Defined/Recovering
Invalid

The replicate Replication Server cannot connect to the primary Replication Server to define the subscription.

Check the replicate Replication Server error log for messages. Make sure the user creating the subscription has the same login name and password at the primary
Replication Server and the replicate Replication Server. The user should have at least primary subscribe permission on the primary Replication Server.

Defined/Recovering

Defined

The replicate Replication Server cannot build the materialization queue or cannot send the activation request to the primary Replication Server.
Verify that the primary data server is running. Make sure the user creating the subscription has the same login name and password at the primary data server, select permission on the primary table, and execute permission for rs_marker. Investigate whether the replicate Replication Server is out of stable queue segments. Verify that the primary Replication Server is running and that the  SQM, SQT, and DIST threads for the primary database are running.
Investigate whether the primary Replication Server is out of segments.

Bulk Materialization


This method is appropriate when there is too much data to copy through the network. This is a “manual” materialization method that allows you to load the subscription data from media such as magnetic tape.  Bulk materialization uses these commands, which are executed at different points in the materialization process: define subscription, activate subscription, validate subscription. Use the check subscription command to check the status of the subscription.

Bulk materialization problems—activation phase

Replicate
Status
Primary
Status
Subscription State

Suggested Actions
Activating/
Recovering

Defined

Cannot send the activation request
to the primary Replication Server.

Verify that the primary Replication Server is running and the SQM, SQT, and DIST threads for the primary database are running.
Check the error log in the replicate Replication Server. Investigate whether the primary Replication Server has run out of segments.


Another Diagnostic tool to compare subscription data is “rs_subcmp”.  rs_subcmp is loaded in bin directory.

Data Server Interface Problems


The DSI applies transactions from a Replication Server stable queue to a data server, which can be an Adaptive Server or another data server with an interface to the Replication Server. Before a transaction is applied, Replication Server uses function strings to convert the commands to the format specified for the data server.

If the data server is not receiving updates, the DSI may be suspended or down. A down or suspended DSI can be caused by:

  • Incorrect login or permissions
  • A data server error
  • A Replication Server error

Use admin who, dsi and admin who_is_down to check the status.

The states of DSI scheduler threads in the output of admin who and admin who, dsi are:

  • Active – the thread is starting, restarting after an internal error, or logging an exception to the RSSD.

  • Awaiting Command – the thread is waiting for a transaction to become available in the stable queue for the database. The queue contains no complete transaction at this moment.

  • Awaiting Wakeup – the thread is sleeping before it restarts. It sleeps for two minutes after an error that can be retried. During the two-minute interval, the cause of the error may disappear or the System Administrator may correct the problem. If the problem is corrected, the thread can restart without error after the sleep interval.

A failed transaction is retried when it causes a data server error that you assigned the retry_stop or retry_log option using the assign action command.

·         Awaiting Message – the thread has dispatched transactions to the DSI executor threads and is waiting for them to complete.

·         Suspended – the DSI connection has been suspended by a user command, by an error, or by a drop subscription or activate subscription command with the with suspension clause. This state appears only in admin who output.

·         Down – the thread has not been started. This state appears only in admin who output.




The states of DSI executor threads in the output of admin who are:

·         Active – the thread is starting, executing a transaction at the data server, logging an exception to the RSSD, or restarting after an internal error.

·         Awaiting Command – the thread is waiting to receive another transaction from the DSI scheduler thread.

·         Awaiting Message – the thread is processing a transaction and is waiting for another thread to complete processing its transaction, or the thread is waiting to receive another statement from the Stable Queue Transaction interface (SQT).

·         Suspended – the connection has been suspended by a user command, an error, or a drop subscription or activate subscription command using the with suspension clause.

·         Down – the meaning of this state depends on the state of the DSI scheduler thread as follows:

o   If the DSI scheduler thread is Down, the DSI executor thread connection was suspended when the Replication Server was started, and the connection has not been resumed.

o   If the DSI scheduler thread is Active or Awaiting Wakeup, the DSI executor thread connection is recovering from a retryable error and is starting or restarting.

Troubleshooting procedure


If changes made to a primary database are not arriving at a replicate database that has active subscriptions, it is possible that the DSI for the replicate database is suspended.

Follow these steps to troubleshoot the DSI for the replicate database:

  1. Log in to the Replication Server that controls the replicate database and execute the admin who, dsi command.

  1. If a database does not have an entry, check the Replication Server error log to see if the DSI for the database was not started or was suspended.

  1. If there is an entry for the database, but the state is Awaiting Command, the DSI is waiting for a transaction from the primary.

  1. The state of the DSI thread is Active, then:
    1. Unique transactions might be incorrectly resolved as duplicates, or
    2. Transactions might be failing on the data server and written to the exceptions log.

DSI down or suspended


This section describes solving these problems:
           
·         Incorrect login or permissions
·         A data server error
·         A Replication Server error

If a subscription problem causes the DSI thread for the replicate database to terminate abnormally, you can restart the thread using the resume connection command. If possible, fix whatever condition caused the problem before resuming the connection. For example, if the maintenance user does not have update permission on the replicate table, first grant the user update permission and then resume the connection.

If you cannot fix the problem, resuming the connection will cause the DSI thread to reexecute the command that failed, and suspend the DSI again. To prevent this cycle, assign a different action to the error returned to the DSI. You must assign error actions at the Replication Server where the error class is created.

Execute the following comments at replicate Replication Server to get more information :

- trace 'on', dsi, dsi_buf_dump
- resume connection to SDS.SDB
- wait for few seconds
- trace 'off', dsi, dsi_buf_dump
- check the rep server errorlog

Data server errors - Skipping transactions


Data server errors (for example, a permissions violation or duplicate key) are logged in the Replication Server error log.  If you cannot correct the data server error, restart the DSI by requesting that the first transaction (the one causing the error) be skipped. Do this by executing the resume connection command with the skip transaction option.

Skipping transactions :

A transaction intended for a database is skipped when:

  • The resume connection command is executed with the skip transaction option.

  • A data server error is encountered and the assigned action for the error is log, or retry_log.

Skipping a transaction that causes errors is a convenient way to force the DSI to resume applying transactions to the database. However, skipping a transaction has important implications for the database:

  • Replicate data may become inconsistent with primary data.

  • If an asynchronous transaction originates in a replicate database, and the transaction is skipped at the primary data server, the primary database is not updated.

  • The inconsistency caused by skipping a transaction may result in even more errors if subsequent transactions depend upon the unapplied actions of the skipped transaction.

  • Subscription materialization and dematerialization requests are special types of transactions. If you skip a subscription materialization request, it may invalidate the subscription and must be dropped.

  • A subscription materialization or dematerialization request may cause Replication Server to create a separate stable queue, called the materialization or dematerialization queue. You cannot explicitly skip transactions in this queue using the resume connection command. They can, however, be skipped due to data server errors that are assigned the log or retry_log action.

  • If a subscription materialization or dematerialization request is skipped, all transactions in the corresponding queue may be skipped. However, if a DSI is suspended in the middle of applying the transactions in a materialization or dematerialization queue, some of the transactions in the queue may already have been committed while others may not. Later, if you use the resume connection command to skip the request, the replicate database  will have the effects of the previously committed transactions.

Because of these implications, you should skip transactions only after other means of correcting the error have been exhausted. After skipping a transaction, determine what you must do to bring the replicated data back to a state of consistency.

DSI active – but not updating


When the DSI is active but the replicate data server is not receiving updates, unique transactions might be incorrectly resolved as duplicates, duplicate transactions might be reapplied, or the transactions might be failing on the data server. If the error’s action was set to log or retry_log using the assign action command, the error and failed transaction are written to the exceptions log. You must manually resolve the data inconsistency between the primary and replicate databases or reapply the transactions.

Incorrect duplicate transaction resolution

If the origin_qid values stored in a data server or the rs_exceptslast are modified by mistake, non-duplicate transactions can be ignored or duplicate transactions could be reapplied. If you suspect that this is happening in your system, check the stored values and compare them with the transactions in the database’s stable queue to determine the validity of the values. If the values are wrong, you must modify them directly.
The DSI records the last transaction committed or written into the exceptions log so that it can detect duplicates after a system restart. Each transaction is identified by a unique origin database ID and an origin queue ID that increases for each transaction.

The last transaction committed from each origin database is recorded on a data server through execution of the function strings defined for the data server’s function-string class. For the default function-string class, rs_sqlserver_function_class, this is done in the function string of a commit command, that is, the rs_commit function. The function-string class of every data server must support the rs_get_lastcommit function which returns the origin_qid and secondary_qid for each origin database. The secondary_qid is the ID of the queue used for subscription materialization or dematerialization.

The origin_qid and secondary_qid for the last transaction written into the exception log from each origin is recorded in the rs_exceptslast table in the RSSD. However, transactions logged explicitly by the sysadmin log_first_tran command are not recorded in this table. These transactions are logged, but they are not skipped.

When a data server interface is started or restarted, it gets the origin_qid returned by the rs_get_lastcommit function and the one stored in the rs_exceptslast table. It assumes that any transaction in the queue with an origin_qid less than the larger of these two values is a duplicate and ignores it.

Examining the exceptions log

If a transaction is skipped, either explicitly using the resume connection command with the skip transaction option, or as a result of the log action assigned to data server errors, the skipped transaction is written into the exceptions log. Orphan transactions and transactions logged by the sysadmin log_first_tran command are also in the exceptions log.

If a replicate database is not receiving updates, it is possible that the update transactions were skipped and written into the exceptions log. Examine the exceptions log to find out what transactions have been logged. The exception log consists of three tables: rs_exceptshdr, rs_exceptscmd, and rs_systext. The rs_exceptshdr table has one entry for each transaction. The rs_exceptscmd table has one entry for each command (either source or output) of the transaction. The rs_systext table stores the text of the commands.

To view the header information for all logged transactions intended for a database, log in to the system Adaptive Server for the Replication Server controlling the database, and execute the following query against the RSSD:

select * from rs_exceptshdr
where error_site = data_server
and error_db = database
order by log_time

The rows are returned in ascending order, by the time the transactions were logged. If you want the rows to appear in descending order, attach “desc” to the end of the query.

Join the three system tables to view all information about a transaction in the log. The following query gives the source commands and their corresponding output commands for each logged transaction:

select hdr.sys_trans_id, cmd_type, textval
from rs_exceptshdr hdr, rs_exceptscmd cmd, rs_systext
where error_site = data_server
and error_db = database
and hdr.sys_trans_id = cmd.sys_trans_id
and cmd_id = parentid
order by log_time, src_cmd_line, output_cmd_index, sequence

Adaptive Server Log Problems

Truncating an Adaptive Server log


When the database log is full, Adaptive Server cannot continue to process updates. To solve this problem you must truncate the log. However, you should not truncate transactions from the log before the RepAgent or LTM has passed them successfully to the Replication Server.

Adaptive Server uses truncation points to ensure that only transactions processed by the RepAgent or LTM are truncated. A secondary truncation point marks the place in the primary database log up to which the RepAgent or LTM has processed transactions. The RepAgent or LTM periodically updates the secondary truncation point to reflect transactions successfully passed to the Replication Server. Adaptive Server will not truncate the log past the secondary truncation point.

There are, however, times when it is necessary to truncate the log beyond the secondary truncation point.  In such case, you need to temporality remove secondary truncate point and dump the transaction log.  This may result in the loss of updates to the replicate database. The primary database is not affected. You must reapply the lost updates to the replicate database.

Removing Secondary Truncation Point:

> use Parts
> go
> dbcc settrunc(ltm, ignore)
> go




Setting Secondary Truncation Point:

> use Parts
> go
> dbcc settrunc(ltm, valid)
> go

Verifying Secondary Truncation Point :

> use Parts
> go
> dbcc gettrunc
> go

The ltm_trunc_state column in the output from this command should contain a 1.

Log truncation problems


Orphaned transactions

An orphaned transaction is a transaction in an inbound stable queue that is missing a terminating commit or rollback command.

To check for orphaned transactions:

  1. Log in to the Replication Server and execute admin who, sqt to display information about the stable queues at the Replication Server.

  1. Use the output from admin who, sqt to identify the entry for the inbound queue of the database whose log was truncated.

The queue has a two-part name formed from the database ID and the queue-type identifier. For an inbound queue, the queue-type identifier is 1.  For example, if the database ID is 101, the inbound queue name is 101:1.

  1. If the output for the inbound queue shows an open transaction that does not change over long periods of time, an orphaned transaction is probably in the queue. However, because it is difficult to distinguish between an orphaned transaction and a very long transaction, you must further verify that an orphaned transaction exists.

To verify that there is an orphaned transaction:

·         Dump the stable queue and examine the information about the transaction. You can dump just the begin record to find the user name and the time the command was executed, and then see if the user still has open transactions in the database.

·         Even if you find that the user does not have open transactions, the queue may still have an orphaned transaction. If the queue is large, the RepAgent or LTM may not be keeping up with it. To determine whether this is the case, dump the last block in the queue and look at the date for the commands. Use admin who, sqm to find the last block.

  1. If the transaction is orphaned, use the sysadmin purge_first_open command to skip it.

  1. Use the output from dumping the stable queue to manually apply the orphaned transaction to the replicate database.

Dumping Stable Queues


Stable queues are composed of segments stored on disk partitions. Each segment has a megabyte of message space, divided into 64 blocks of 16KB. Segments store messages passed by the system. Each block in a segment corresponds to a physical data transfer between disk and memory. A segment is allocated to just one queue—different queues cannot share segments. Each block within a segment may store one or more messages. A map of the system segments is stored in the rs_segments table of the RSSD.

Replication Server deletes messages from queues on a segment basis. Internally, it deletes blocks from the segments. However, the freed space does not appear in the system tables until all of the blocks contained in the 1MB segment are deleted.

When a stable queue is dumped, a segment may contain deleted blocks along with the undeleted blocks. You can identify the undeleted blocks of a segment by checking the First Seg.Block column in the admin who, sqm output for the queue.

This list briefly describes each command:

  • sysadmin dump_queue – dumps the contents of the transaction cache of an inbound or outbound stable queue.
  • sysadmin sqt_dump_queue – dumps the contents of the transaction cache of an inbound queue.
  • sysadmin dump_file – specifies an alternate log file to be used by sysadmin dump_queue or sysadmin dump_queue.
  • sysadmin purge_all_open – purges all open transactions from an inbound queue.
  • sysadmin purge_first_open – purges only the first open transaction from an inbound queue.

Using trace

In RS :

1> trance ‘on’, ‘dsi’, ‘dsi_buf_dump’

Using sysadmin dump_queue

The sysadmin dump_queue command dumps the contents of a Replication Server stable queue to the Replication Server log, an alternate log file, the client, or to tables in the RSSD.

sysadmin dump_queue, q_number, q_type, seg, blk, cnt [,RSSD | client]

q_number, q_type       use admin who, sqm to get the values

seg                              -1 starting segment is the first active segment
                                    -2 starting segment is the first segment in queue and includes any
     inactive segments retained by setting a save interval

blk : (block 1–64)        -1 start with first undeleted block
                                    -2 start with first unread block of the queue

cnt                               specifies no. of block to dump

RSSD                          dumps to RSSD.  See tables rs_queuemsg and rs_queuemsgtxt

Client                           send message to client.

Or use sysadmin dump_file, file_name

Confirming Suspected Problems:  admin who, sqt

Check which inbound has an open transaction.   In ‘info’ column it is marked at “st:O”

            Ex : 103:1  st:O, cmds:3, qid:21:28:0

            Ex : sysadmin dump_queue 103, 1, 21, 28, 1, RSSD

Select *
from rs_queuemsg
where q_number=103
and q_type = 1
and q_seg=21
and q_blk=28
and q_row=0

2.         sysadmin sqt_dump_queue  - to dump from inbound queue


3.         sysadmin purge

To purge all open databases: syadmin purge_all_open, q_no, q_type
To purge only first open truncation: syadmin purge_first_open, q_no, q_type