We experienced an outage of several Photon Cloud services today:
April 30th, 2015, from 01:25 UTC to 08:40 AM UTC.
End users (or players)Â affected:
– End usersÂ in North Europe and West Europe may have experienced failures connecting toÂ services.
– Photon Chat:Â not reachable for new connections from 07:21 AM – 8:40 AM UTC.
– Photon Realtime & Turnbased: partially not reachable between 01:25 AM UTCÂ to 07:50 AM UTC.
– License Server (for self-hosted Photon Server SDK): partially not reachable between 01:25 AM UTCÂ to 07:50 AM UTC.
All services use our Name Server (ns.exitgames.com) to resolve to the service endpoint (IP address). The availability of this name server isÂ essential so we decided for a HA (high availability) setup.
a) We have Name Servers in 3 Regions: US, Europe and Asia
b) Each region we hasÂ multiple server load balanced
c) The regions are load balanced via a geo loadbalancing services (DNS based)
Our license server uses the same HA setup.
The root cause was an outage of Microsoft Azure’s Network Infrastructure:
“From 30 APR, 2015 01:25 UTC to 10:17 UTC customers in North Europe and West Europe may have experienced DNS resolution failures to their services. Engineering teams have identified the root cause and deployed a mitigation, and we have confirmed that normal service availability has been restored. Engineers are continuing to work on a resolution for other impacted regions.”
(by Azure Status page: http://azure.microsoft.com/en-
As described above weÂ at Exit Games are relying on Azure’s high availability network infrastructure to host some of our Photon Cloud services.
We were able to mitigate the issue (switched to static IPs, use fallback services at other hosters etc.), so that by 8:40 AM UTC, all Photon Cloud services were reachable again.
By 10:17 AM UTC, all Azure services were reachable again, and by 12:30 AM UTC, we had reverted our hotfix configurations and are now using our original, stable configuration again.
We are planning the following measures to avoid issues like these in future:
– improve our monitoring to be more “region aware”. Only certain regions or customers from certain regions were affected, and we did not get alerted for all incidents as fast as possible, so we lost some time to react to the problems.
– set up critical services & network infrastructure (like geo-based traffic routing) at a different hoster as a fallback, and make it easy to switch to the fallback services automatically.
We sincerely apologize for the issues.
If you have further questions, please drop us a mail: firstname.lastname@example.org