Domino server clusters are an excellent way to provide for failover in the event of a server outage. But they provide little benefit if the data on the servers is not kept in sync. This tip explains how to pro-actively monitor your clusters in a way that will help you determine if you need to add additional cluster replicators to increase reliability.
"There is nothing there!" he whined. Well, actually, there was something there, just not anything that he had worked on in the last two hours. And from the pained look on his face, I half expected this big and burly man to softly murmur, "My precious ... my precious...."
But to my Domino Upgrade project manager, it sure felt like everything was missing. The cluster had protected him by letting him fail over to his backup server when his mail server took a nosedive, but where were all the messages he received from 10 a.m. to noon, and what happened to the meetings he had planned?
They were gone! Victims of his mail server's inability to keep up with cluster replication. He would see these precious messages and appointments only after his mail server was raised from the dead -- and who knew how long that was going to be?
It's not that his mail server was a feeble box. It was a new system on a platform that could kick butt, carefully selected to handle the consolidated load of four old mail servers. The only problem was that no one had seriously monitored the box while users were migrated, and the cluster replicator had been overtaken by force. It was a full two hours behind.
Fortunately, this is a situation that's easy to avoid if you don't mind being a little proactive. If the cluster replication is falling behind, just add cluster_replicators= (number of replicators to use) to the server's Notes.ini and restart the server.
Lotus legends tell us that the number of cluster replicators should be some tangible factor like the number of processors used by the Domino server plus one, but I've found it more effective to use double the number of processors. You don't want to use too many though, as they do take resources that might be better used elsewhere.
You can tell if a server needs more cluster replicators by watching a pair of stats that are automatically generated and kept in StatRep.nsf. These are:
SecondsOnQueue: This tells you what the total time, in seconds, that the last database replicated spent waiting on the cluster's replication work queue.
WorkQueueDepthCurrent: This tells you the number of databases awaiting replication by the Cluster Replicator.
There are actually three versions of each of these stats:
Watch the max as an indicator of how bad things got, even for moment. The averages can vary a bit, but they are important trending indicators. The current snapshot should always be in the single digits. Anything other than that and the townspeople will fire up their torches, grab their pitchforks and storm your office when their server keels over, because the failover copy of their precious mail file or powerful all-knowing application will be missing their favorite part -- the part they added recently.
The problem is that getting to these stats is a pain in the neck. You need to bust open the console, and type: Show stat replica.cluster.* for each server in the joint. Or you need to go to Statrep and do a Document Properties and scroll through the field list.
And who has time for that nonsense? It's downright humiliating to be forced to actually work for a living! Administrators have better things to do, like dealing with users who accidentally delete their mail files.
Because this bit of cluster checking requires effort, it rarely gets done with enough regularity to spot cluster replication problems. It rarely gets fixed before it gets out of hand.
So, as I always say, if you can't win, change the rules.
Spark up Designer and have a whack at Statrep. Make a copy of the "Statistics Reports Clusters" and call it something terribly creative, like "Statistics Reports Clusters w/queue info." Then add three columns as follows:
Showing these stats as minutes rather than seconds makes the output easier to understand at a glance.
Then add three more:
Throw in an additional column using the field Server.Availability.Index if you want to see how hard your servers are working. The closer this gets to zero, the more sweat your box is generating.
If you are using 6.x you'll probably want to add an outline entry to the MainOutline that accommodates your cool new view.
Your easy bit of work will produce a beautiful and wonderfully informative view that looks like this:
You can impress your administrator friends and tell at a glance if you need more cluster replicators to keep everything in sync.
OK, OK ... you will still have to LOOK at statrep occasionally. I didn't say it would be all peaches and cream. Your server won't adjust its own cluster_replicator number.
Or will it? Perhaps I could write some sort of agent ... one linked to an event that checks the stat and then adjusts the parameter automatically for me -- according to my demands as its master!
On second thought, I think I'd rather let my human side deal with a decision to add cluster replicators. You must take into consideration all sorts of things that might not be readily apparent, such as a temporary load on the box, or the initialization of new replicas.
After all, server resources are really limited. They're quite finite. And well, they are downright precious to me.
About the author: Andy Pedisich is President of Technotics, Inc. He has 20 years of experience in IT and has been working with Notes and Domino since Release 2. He has worked as a trainer, presenter and consultant on a variety of upgrade, migration and administration projects for clients ranging from small businesses to Fortune 500 organizations with 100,000 seats. Technotics provides strategic consulting and training on messaging and collaborative infrastructure projects for customers throughout the world. You can contact Technotics through their Web site at www.technotics.com.
What I'd really like to know is how to, or if Lotus will ever let us, turn off the annoying cluster replication messages that R6 writes to the console log!
I am trying to create the custom view within the statrep.nsf, but I don't see the fields listed in the databases to add to the custom view.
If the statrep is collecting statistics on clustered servers, the fields will be there. If you are trying to create a new statrep and there are no statistics yet, then the fields might not be there.
Go to the console and type:
show stat replica.*
to see if the server is properly clustered.
Do you have comments on this tip? Let us know.