XenDesktop SQL High Availability… What to use?!

The XenDesktop Site Configuration Database is an important part of your infrastructure, when it’s is down, users won’t be able to connect and IT won’t be able to make any configuration changes. Because of this you’ll probably want to implement some kind of high availability mechanism keeping your database up and running at all times, or at least to try and keep downtime at a minimum. During one of my recent presentations in which I talked about XD7 including it’s database dependency, a discussion around which type of (SQL) HA mechanism we should implement quickly formed… What options do we have?

A short intro

One of the main reasons why this database (which as of XD7 can only be SQL) has become, or is, so important, is because servers within a Flexcast Management Architecture (XenDesktop) don’t have a Local Host Cache as opposed to XenApp servers which are build upon the Independent Management Architecture. One of the main differences between IMA and FMA. With IMA all servers, although there are exceptions, have a Local Host Cache holding a recent copy of the central database, it gets synchronized every 30 minutes by default. So if the central database, for whatever reason might become unreachable, these servers will simply use the information stored in their LHC. This way new users can still connect and (almost) everything will keep working as normal, with the exception of IT being able to make any configuration changes to the Farm, no changes there.

Now that we’ve established it’s importance and the need to keep it online as close to a 100% as possible, let’s see what options we have in accomplishing this. Note that I just want to point out some of the technologies available, so no deep dives or detailed configurations. I’ll probably leave some out as well, so feel free to jump in anytime. There are tons of excellent Blogs and other types of websites that describe these technologies in great detail, I’ll reference some throughout this article.

Microsoft SQL solutions

Citrix recommends to implement SQL mirroring, with SQL Clustering coming in second. Mirroring the Database ensures that, should you lose the active database server, the automatic failover process happens in a matter of seconds, so that users are generally unaffected. Be aware that full SQL Server licenses are required on each database server, but this goes for some of the other solutions as well. Due note that SQL Server Express edition can’t be mirrored and clustering isn’t supported as well. But then again, who want’s to use SQL Express in a production environment anyway right?!

Microsoft’s SQL clustering technology can be used to automatically allow one server to take over the tasks and responsibilities of another server that has failed. However, setting up a clustered solution is more complicated, and the automatic failover process is typically slower than with alternatives such as SQL Mirroring. Again, as mentioned SQL Server Express doesn’t support clustering, at least not out of the box, but with a little creativity it sure is possible, Google is your friend!

When using SQL Server 2012 (Enterprise Edition) or later, you could also opt for something called AlwaysOn Availability Groups. A high-availability and disaster recovery solution first introduced in SQL Server 2012 to enable you to maximize availability for one or more user databases. AlwaysOn Availability Groups requires that the SQL Server instances reside on Windows Server Failover Clustering (WSFC) nodes, see below:

AlwaysOn Availability Groups

Using one of the above methods, in combination with regular (daily) back-ups, will ensure that your database will (almost) always be online. If anything, it will at least narrow your chances of running into any issues. Have a closer look and decide for yourself which one might work best for you. The above is partly referenced from the Citrix E-Docs website:

http://support.citrix.com/proddocs

VMware HA

At this point during the presentation some people were wondering if VMware HA could also be an option with regards to keeping SQL safe. After a short debate, the short and simple answer was no. Although VMware HA is a great feature on itself, it has no clue with regards to what’s running on the VM it’s protecting. I mean, your VM will stay online without to much trouble, even if one of the underlying physical host goes down, but HA doesn’t know when your SQL database goes down or stops responding due to updates gone bad, drives filled up, services stopped or something alike. You could say that VMware HA, and I guess the same goes for Hyper-V or XenServer as well, simply isn’t application oriented. That got me thinking, what other options do we have?!

VMware App HA

As part of the Enterprise Plus edition, VMware offers Application High Availability. App HA is more intelligent, it can restart failed application components or use the Application Awareness API through VMware HA to reset the VM if needed (no OS reboot). VMware App HA provides support for SQL, Tomcat, Apache, TC Server and IIS. For now no other ‘big’ applications, like Exchange, SharePoint, ShareFile and Oracle are supported. But since it’s SQL we’re interested in, it could work. When compared to similar solutions like Symantec Application HA it’s still considered a 1.0 product, as stated here:

http://virtuallylg.wordpress.com/2013/10/10/comparing-vmware-vsphere-app-ha-with-symantec-applicationha/

Make sure to check it out, it’s an excellent detailed article discussing both VMware App HA and Symantec Application HA. VMware App HA is relatively complex to set up, it requires multiple VM’s and points of management. Also, as an addition to the applications already mentioned, no custom or generic applications are supported and as far as I could tell it’s only supported on vSphere 5.5. On the other hand, if you are already using VMware combined with Enterprise Plus licenses you’ll have, not only App HA, but a whole bunch of other cool features and technologies as well. And if it’s SQL you’re supporting, then this is probably the most logical and cost effective step to take, especially for smaller and mid sized companies. However, when using ‘lighter’ editions of VMware or when real enterprise functionality is needed, more on this in a bit, then this probably won’t be your first choice.

VMware Fault Tolerance

One of my favorites, if it wasn’t for one big drawback, it only supports one vCPU! A shame, because this technology is lightning fast. This comes from the VMware website: Fault Tolerance (FT) provides continuous availability for applications in the event of server failures by creating a live shadow instance of a virtual machine that is always up-to-date with the primary virtual machine. In the event of a hardware outage, vSphere FT automatically triggers failover ensuring zero downtime and preventing data loss. After failover, vSphere FT automatically creates a new, secondary virtual machine to deliver continuous protection for the application. More information can be found here:

http://www.vmware.com/products/vsphere/features/fault-tolerance.html

If your SQL server is virtual and has only one vCPU (I’m sure it won’t be long till multiple vCPU’s will be supported) then I suggest to give this feature a closer look. It’s part of VMware’s Standard edition and higher. Again, if you already have VMware in place and your SQL server is up for it, and of course the rest of your infrastructure needs to be well, this is probably the way to go.

Symantec Application HA

Already mentioned earlier, this is probably the most Enterprise product of them all. It supports all major applications, including custom and or generic applications as well. Over 23 tier 1 applications in total. It’s based on Symantec Cluster Server, powered by Veritas agents to monitor and control applications, and we know, they know what they are doing right? It can restart failed application components, restart the underlying Operating System or restart the VM itself. And if all else fails it can use the last known good backup for restore purposes. The Symantec HA console communicates with VMware through a vCenter plugin providing centralized managed for all protected VM’s.

Of course Symantec App HA would be a separate purchase, and if you already have VMware running  (with proper licensing) you’ll probably need a strong business case to get the funds. If it’s more than SQL that needs to be supported, if you’re going ‘Greenfield’ or you may have a specific need for one, or multiple, of the enterprise class features that the Symantec product has to offer, like the number of supported applications or VM’s / agents for example (think big, one of my former colleague always just to say: www.kraak.com) then Symantec Application High Availability  might just fit the bill!

Conclusion

Again, an overview without to much technical details, I simply don’t know the products well enough. Hopefully this does give you an idea on some of the options we have when it comes to ‘securing’ the SQL Site database within our XenDesktop 7 deployments. Also, I’m sure I must have left out some other options, please feel free to comment and share your experience if any. Several factors will play a role in which solution will best fit your needs, with costs probably just being just one of them.

A lot of companies will already have SQL up and running, making the decision to implement or use existing mirroring and or clustering configurations a lot easier. The same can be said for VMware. Depending on the licenses used, implementing fault tolerance or App HA might be a logical and small step to take keeping expenses low and management known. However, even when some, or all, of the above is already in place there still might be some valid reasons to look at other solutions as well.

Bas van Kaam ©

Reference materials used: Microsoft.com, Virtuallylg.wordpress.com, VMware.com, SlideShare.net and the Citrix E-Docs website

Advertisements

13 thoughts on “XenDesktop SQL High Availability… What to use?!

  1. Ingmar Verheij

    Hi Bas,

    Great write up with a valid question, how to protect your XenDesktop environment from a SQL database failure? We’ve been discussing this sinds XD5 introduced FMA and the dependency on SQL, with XD7 “we XenApp guys” are now part of that discussion.

    The impact of loosing connectivity to the database is not as big as some people might think, the (desktop) delivery controllers keep op functioning and sessions are not affected. However, no new sessions can be started while the database is unavailable. This means that the time required to get the database connection back == the time no new sessions can be started.

    Personally I prefer to the application handle it’s own availability where possible instead of handling this on a different layer (like virtualisation). This has a number of reasons:
    1) the application vendor (Microsoft in this case) knows best how to act during “disasters ”
    2) the vendor (Microsoft) needs to support the failover mechanism from SQL. When – for instance – VMWare HA is used, why should Microsoft support you when this mechanism doesn’t work?
    3) without two nodes there’s no way to plan maintenance, unless you accept downtime. With a mirror / cluster / always on setup the passive node can be maintained while the active node serves the clients
    4) KISS – Keep It Simple Stupid : Don’t overdesign with fancy features that add complexity while there’s an acceptable solution

    Cheers
    Ingmar

    Reply
    1. Bas van Kaam Post author

      Thanks Ingmar,

      Totally agree, also on the ‘Don’t overdesign with fancy features that add complexity while there’s an acceptable solution’ although it doesn’t hurt to explore you’re options either. And in some cases, simplicity just doesn’t cut it (again, think big :-). I think there’s a use case for all options mentioned and beyond, it just depends on what you’re presented with. Good note on the maintenance!

      Regards,

      Bas.

      Reply
  2. DL

    Nice one, you should and could also enable “Offline database support”

    Offline Database (CTX129173) (Consideration if database connection is lost)

    o The following features, options, and processes remain unavailable.
    o AutoAdd target devices
    o vDisk updates
    o vDisk creation
    o Active Directory password changes
    o Stream Process startup
    o Image Update service
    o Management functions; PowerShell, MCLI, SoapServer and the Console

    I mean if you are using PVS…

    Reply
  3. Bas van Kaam Post author

    Thanks for your input DL. Hope you don’t mind, I merged both your replies :-) And you’re right, although slightly out of scope (it’s not used by default, especially in smaller deployments, and I was primarily focussing on the central Site database) I could’ve added a remark with regards to offline (PVS) database support as well. Good input none the less, thanks again.

    regards,

    Bas.

    Reply
  4. barryschiffer

    Great article!

    Just a sidestep on the actual importance of the database. I know everyone is teaching us to make it high available because oooow know what might happen if it fails:).
    When it fails, users that have an active connection will be able to keep using their session. The “only” thing is that new users or roaming users can’t (re)connect.
    The next question is, how long will it take before a crashed SQL server reboots and is up and running again? On a physical system this can take ages but on virtual systems this is a matter of a couple of minutes.
    An SQL cluster will probably take more time to do a proper fail-over, where the mirror (if whiteness is in place and working) will take seconds.
    So if a virtual database server crashes without HA, it will take only a couple of minutes for users to be able to (re)connect again.

    I’m not saying that you shouldn’t mirror your XenDesktop database, i’m just saying it depends on availability demands.

    Reply
    1. Bas van Kaam Post author

      Thanks Barry. Some valid statements, and I agree, it (for a great deal) depends on the availability demands / business case you’re presented with, although implementing some sort of HA mechanism is probably still considered a best practice by most. Manual intervention isn’t always preferred or possible.

      I was just wondering which options we have, and doing nothing like you suggest (with a solid monitoring solution in place) could be an option as well. Like many consultants will tell you… It all depends :-) I do think that Ingmar’s remark earlier, regarding maintenance, could also be of influence. Perhaps being able to perform rolling upgrades plus the fact that there’s no manual intervention needed (as mentioned above), in the case of failure, being another one.

      Mirroring, as well as AlwaysOn both support something called, automatic page repair, preventing corruption of your database, although I’m not sure how this will hold up in practice. In theory this could prevent an otherwise needed database restore, worst case scenario of course. (yeah, I got this from their website :-)

      I found this on the MS (2012) website regarding SQL mirroring: This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature. Use AlwaysOn Availability Groups instead. Interesting since Citrix recommends using this feature :-)

      And: Database mirroring is a solution for increasing the availability of a SQL Server database. Mirroring is implemented on a per-database basis and works only with databases that use the full recovery model.

      Meaning huge transaction logs. Something Citrix likes to avoid by implementing the simple recovery model, as stated on their E-Doc website. Yes, they do also mention that, as an exception, with mirroring a full recovery mode is needed.

      Also, I think in practice a lot of SQL servers will also hold other databases as well, making the server even more important, although I don’t think that is a valid excuse, but it does happen.

      Thanks again Barry! Speak to you soon.

      Regards,

      Bas.

      Reply
      1. Ingmar Verheij

        To add on the “SQL Mirroring” solution: while it is a relatively easy to implement (and can be implemented per database basis) it has two major drawbacks:

        1) A third (or witness) site is required to avoid a split-brain (hence why the cluster has a quorum disk)
        2) The software needs to “understand” mirroring where a cluster is transparent. Since not all software have a “mirror” implementation this might force you to use a cluster (unless you want to implement both).

        Fortunately the AlwaysOn Availability Groups is a combination of both solutions making it a “best of breed” solution?

  5. Bas van Kaam Post author

    Thanks Ingmar.

    What do you mean with point 2 ‘The software needs to “understand” mirroring where a cluster is transparent. Since not all software have a “mirror” implementation this might force you to use a cluster (unless you want to implement both)’ ?

    We would only apply mirroring to SQL (SQL mirroring) right? So it will ‘understand’ for sure. Do you mean that because of this it’s less flexible then, for example, Clustering and or AlwaysOn? Although, AlwaysOn is designed just for SQL as well, right? Not sure, just asking.

    Yes, agreed on the ‘best of breed’ statement.

    Reply
    1. ingmarverheij

      What I meant was that the software needs to have the option to configure a failover host. Unlike with a cluster the software needs to detect if the primary hosts fails and contact the secondary (or failover) host. With a clustser this is done by MSCS and no additional configuration is required on the host / in the software.

      Not all software have the ability to provide a secondary host for mirror purposes. If you have software that can’t deal with a SQL mirror you might need to revert to a cluster (or AlwaysOn).

      Reply
  6. Remko Weijnen

    Bas, nice article but be aware that the known issues list for both XD7.0 and X7.1 lists an issue with AlwaysOn: In deployments where a SQL Server 2012 database is configured for high-availability using AlwaysOn and is used with XenDesktop, there may be an interval (while a database failover is in progress) when Site operations may fail; for example, VDA power management or session launches. If this occurs, retry the impacted operation after the failover is complete. [#394787]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s