This week we had a demo with a very large prospective customer. However, instead of doing what we usually do which is to demonstrate it in one of our test environments, they wanted us to provision a SQL Server clustered instance in their environment.
This is generally considered a risky proposition for a demo since you have no idea what their environment will look like and the very first time you try it is in front of the customer.
It's pretty funny, because before they started the operation the customer's tech guy was saying things like, "well, it's really not that hard to do in that you just have to copy the media over to the server, fill out a few wizard pages and hit the 'Finish' button".
Our product caught fifteen things they had not properly configured in their two node environment. A number of which were things that were obscure enough that Microsoft's SQL Server installer would not have caught it.
The problem with SQL Server is that at face value it appears to be easy to install. It's all the edge cases and bugs that are the real problem. To Microsoft's credit, many of them are documented in their knowledge base, but by that point the installation failed in some obscure and difficult to understand way, and you're doing a post-mortem trying to figure out what went wrong. In most cases, if you're looking at the knowledge base you've already spent at least an hour going to each cluster node and digging through the log files trying to find the error message to query for.
Just for fun, here's a small sampling of bugs in SQL Server that would result in failed installs that would be difficult to know why they failed. None of these are secret/proprietary information, they're all published in Microsoft's knowledge base if you look hard enough. Also bear in mind that I've picked some of the easier to understand issues in the interest of clarity.
- If the NetBIOS hostname of any of your servers is longer than 13 characters, the SQL Server 2000 installer will crash (kb289828). That means you see "Access Violation" and setup exits with no other discernible error message.
- If you're logged into any of the other nodes when you run setup, the install will fail with the error message "Setup failed to start on the remote machine. Check the Task Scheduler event log on the remote machine."kb910851. This includes being logged in via Remote Desktop but "disconnected" from the session.
- If you try to deploy SQL Server 2000 onto a cluster that has more than 4 nodes, the install will go through but then you can't start the instance (kb811054). It doesn't matter how many nodes you deploy the instance onto; just the total number of nodes in the Microsoft cluster. If you attempt to install a two node instance onto a five node cluster, the install will go through with absolutely no errors, but the instance won't startup, and it won't tell you why it won't start. The only way we figured out why it failed was when we dug through the log file and put "[clushelp.cpp:150] : 259 (0x103): No more data is available" into Google.
There's a common theme to all of these issues - none of them told you what was really wrong or what you had to do to fix the problem. In one case the installer crashes, another it gives a useless error message, and the third the installer gives no error at all but then you later find out the instance you installed doesn't work.
The power of our product is that we have done the work for you. We've spent the time going through the knowledge base analyzing every documented SQL Server bug. We've written pre-checks that look for those cases, and we validate your entire environment before any operations are performed. We've deployed hundreds of instances in all sorts of test environments, analyzed the problems we found and come up with the checks to ensure you don't encounter the same thing. And in the cases where a precheck finds a problem our software tells you how to fix the issue or points you to the Microsoft knowledge base article describing the issue in more detail.
There's some personal satisfaction in seeing your product perform in a live environment, recognizing that it did it's job well and solved a real problem.