When a program to remove old users from the system went wrong, it started a process that threatened to delete thousands of user accounts. Find out if our administrator hero was able to stop the diabolical program and save the day.
Our company has approximately 20,000 Notes licenses worldwide. We have automated many administrator tasks by developing our own solutions on top of Notes using AdminP, C and C++ APIs, Notes script and more. The creation of a user, ID-file, mailbox etc. is therefore automatically carried out when a new employee is registered or a current employee changes departments or leaves the company.
A developer was working on some changes to the program that extracts information from the personnel systems. He tested on the users that were in our testing environment, so he added a query to the program for it to only test a specific user.
Having ended the test with success, he recommended the program to be incorporated into the regular management procedures. But he hadn't removed the additional condition testing for one specific user only.
On the first day of the new program, all seemed normal at first. However, the program had run during the night, and since the program only tested for one user, it produced a delete request for the rest of the employees in the company.
This was automatically downloaded to Notes through MQ around 4:00. Our C program took on the delete request, deleted the users in our internal organizations database and created an AdminP request to carry out the first step in our deletion procedure. The agent did hourly checks to see if an AdminP request has been carried out, and then created an AdminP document for the next step in the deletion process.
This particular morning, our organizational database contained only a couple of users. But this didn't affect the mail routing, it only prevented users from looking up other users' mail addresses manually.
By looking into the database with all the delete requests added on this morning, we quickly understood the size of the problem. And the developer quickly had an answer as to why we received so many delete requests.
We rushed to halt the agent manager on the admin server and stopped the C program from running. After a short crisis meeting to decide what to do, we made a plan. All Notes servers should be closed down to prevent more from being deleted, as we saw that adminP had a huge list of delete requests that were not carried out yet. In a couple of hours, we had all servers down and a full picture of how far the deleting process had come. Our mail databases are only deleted a couple of months after an employee leaves the company, so no mail files were deleted yet.
AdminP had only deleted approx 500 users from the ACL in their mailfiles before we closed down the servers. We then rebuilt our organizational database and the NAB manually on the admin server by creating new doc IDs for every document. (Luckily one employee had replicated the NAB and organizational database late in the evening to a laptop -- this sped up the process; we would have had to restore our backup.) We then added the users manually in the ACL of 3-500 mail databases. Other administrators deleted all delete requests on all AdminP databases on every server worldwide by using server console tools. And, of course, we deleted all pending delete requests in our administration databases, as well as in the application database that received delete requests.
At lunchtime we started the admin server again. Then we started to manually replicate the AdminP, the NAB and the organizations' database to every Notes server -- still without having started the Notes servers. Soon after that we started our intranet servers, because they are accessed by browsers and serve as our information channel to all users.
We then started the first mail server again and had the last one running an hour later while heavily monitoring the AdminP and all processes. At the same time, the developers had corrected the program that caused the problem, and after having restored all databases to the level that it had been the previous evening, the program was moved to production.
My guess is that we had about 10 to 15 Notes and NT administrators working to correct this problem. We disabled the automatic process that night to check what the program had delivered the following night. This turned to be OK, and we turned on the process again and everything worked fine.
The following two days, two administrators were put on the job to clean up some minor tasks from this incident. But we certainly tried out our capacity to manage a crisis. And we have now included some control points in our automatic procedures, to prevent things like this from happening again if the change amount reaches a certain level.
Do you have your own blooper? Send it in and claim your fame.
Every story in our bloopers series comes to us directly from a SearchDomino.com administrator, developer or consultant. For obvious reasons, some contributors -- including this tale's author -- choose to remain anonymous.
This was first published in July 2003