Embed Whitelist
We've added a new advanced feature that allows Business customers or above to add a whitelist of domains that are allowed to embed campaigns.
On 15th July from 2PM AEST we suffered an extended outage which lasted approximately 14 hours. My name is John Sherwood (or @ponny on Twitter), I'm the technical Co-founder here at Gleam. I wanted to explain exactly what happened, the mistakes we made and how we plan to address similar issues moving forward.
We run a reasonably large deployment of Postgres version 9.3 for our backend databases, a version which has reached EOL (End of Life).
Upgrading Postgres 9.3 involves some aspect of downtime, the database has to be shut down, all replication to slaves or nodes must stop and then the upgrade is performed. Once upgraded to 9.6 (or above), future upgrades can be performed with minimum or zero downtime.
Knowing how important this process was to get right we consulted with Postgres experts 2ndQuadrant to ensure that our plan to upgrade to 9.6 aligned with their best practice recommendations.
We then ran test upgrades on other internal database clones to ensure everything would work as expected.
We scheduled our first maintenance window to perform this upgrade on April 27th. We anticipated downtime of 1 hour maximum and communicated this via our backend dashboard for 2 weeks leading up to the planned maintenance.
For any maintenance window we try to pick a time that has the least possible customer impact across all timezones. This is usually around 2pm AEST on a Monday as it's Sunday evening in the USA and Europe is still asleep.
At 2pm AEST we started our maintenance window and started to perform the upgrade.
During the first upgrade, the new cluster got created with an incorrect character set as we inadvertently relied on the environment default settings. Nevertheless, the upgrade tool still showed clusters as compatible even though the upgrade didn't apply correctly.
Not realizing that, we performed the next step which was a premature clean up the artifacts of the old cluster. We ended up with no data on that server because the new cluster hadn't been actually created, we decided to call the upgrade off and restore access to our prior version via a separate cluster.
We kept this maintenance period within our 1 hour window and brought service back to normal, however we knew that we'd need to schedule another window to perform the upgrade at a later date.
We prepared for a second window to perform the upgrade whilst giving enough time between to minimise customer disruption. We performed more internal tests to ensure the previous upgrade issue wasn't going to resurface.
We sent an additional email for the same maintenance window times and also an in-app alert letting customers know when it would be taking place. Again, we had allowed approximately 1 hour to complete the upgrade.
Our database clusters are spread across two data centres for redundancy. In one cluster we have a master (Which we'll call M) plus two high availability standby server (S1 & S2) in the same location plus another slave (DR) in a redundant location. With this configuration either of the Slaves could take over in the event the master server fails.
At 2PM AEST we started the maintenance period and started migrating S1. The upgrade completed and ran a number of tests to ensure that data was being returned as expected. We then started the process of upgrading the other hot standby server S2 as Postgres recommends.
Whilst watching the transfer we noticed that the rsync commands were deleting all the data on the standby servers instead of copying them. This left us with just one server, S1, in that data centre that was production ready. Re-syncing all the data back to the hot slaves would now take hours. Stuck for options, we decided to bring the site back up with S1 as master and DR in reserve.
We brought Gleam back online around 25 minutes after the maintenance period started at 2.25PM AEST. Soon after we started seeing a huge number of indexing errors coming from S1, which forced us to bring it offline to diagnose. The server had passed our previous tests but we wouldn't risk data corruption.
Upon further testing we found that a number of our large indexes that we use to make data performant were corrupt. Rebuilding these would take hours and even once complete, we couldn't be sure that we had 100% data integrity. Even recovering from our hourly backups the indexes would still need to rebuild.
We decided that in order to prevent data loss of any sort we had to recover from DR which is in another datacenter. This involved cloning and compressing over 4TB of data then transferring it over a 1Gbps link back to our main datacenter. During this period we were asking our DC if they could upgrade our links to 10Gbps but our account manager who handles orders wouldn't be available until the morning.
The zipping, transferring, and restoring took close to 10 hours to complete. It was restored to M and then much faster from M to S1 and we were able to restore service without losing any data for customers at 4:17AM AEST on the 16th.
Total downtime was 14 hours 16 minutes from 9pm PST to 1pm PST
We are proud of the team who worked around the clock, some for 36 straight hours to restore service. It's important in situations like these to remain calm and work together towards the best solution.
When you're in a highly stressful situation that can potentially impact a large number of customers there's always things you can look back on and know you could have done better.
- We made a mistake to promote one of our existing hot slaves to M instead of having a redundant master first
- Having a 1Gbps link to our redundant datacenter was the root cause of the majority of the downtime, we should have investigated a 10Gbps link
- We should have cancelled the migration again as soon as S2 had problems, but we made the decision as a team and it was the wrong one
Here's the steps we're taking to ensure something of this nature doesn't happen again:
- We do still need to upgrade to Postgres 9.6 in order to do upgrades without downtime in the future
- We're investing in additional redundant hardware for easier rollbacks in the the future
- Improve our tests to check for database corruption
- We're upgrading the links between our datacenters to 10Gbps
- Once we're able to upgrade Postgres we have additional options for even further high availability setups
We're extremely sorry that we've let you down, for anyone who was in contact during the outage we appreciate your understanding and we're ashamed that this was our biggest outage by far in the 6 years that we've been in operation, we're continually trying to do better but today proved that we still have a long way to go.
This outage was easily the most stressful day of my life to date. There's nothing worse than failing your customers and knowing how long recovery will take due to all the factors at play.
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand - Norm Kerth
If you have any thoughts or advice on this incident or how we can improve going forward, please don't hesitate to email me personally or reach out to either myself or Stuart on Twitter.
See Next Article
New Loyalty Bonus Action / Improvements to Visit Actions
We're pleased to roll out our new Loyalty Bonus Action aimed at helping reward your loyal fans and entrants. This is the first iteration that allows you to create tiers that reward users if they've entered your previous campaigns (Competitions or Rewards).