On 15th July from 2PM AEST we suffered an extended outage which lasted approximately 14 hours. My name is John Sherwood (or @ponny on Twitter), I'm the technical Co-founder here at Gleam. I wanted to explain exactly what happened, the mistakes we made and how we plan to address similar issues moving forward.
We run a reasonably large deployment of Postgres version 9.3 for our backend databases, a version which has reached EOL (End of Life).
Upgrading Postgres 9.3 involves some aspect of downtime, the database has to be shut down, all replication to slaves or nodes must stop and then the upgrade is performed. Once upgraded to 9.6 (or above), future upgrades can be performed with minimum or zero downtime.
Knowing how important this process was to get right we consulted with Postgres experts 2ndQuadrant to ensure that our plan to upgrade to 9.6 aligned with their best practice recommendations.
We then ran test upgrades on other internal database clones to ensure everything would work as expected.
We scheduled our first maintenance window to perform this upgrade on April 27th. We anticipated downtime of 1 hour maximum and communicated this via our backend dashboard for 2 weeks leading up to the planned maintenance.
For any maintenance window we try to pick a time that has the least possible customer impact across all timezones. This is usually around 2pm AEST on a Monday as it's Sunday evening in the USA and Europe is still asleep.
At 2pm AEST we started our maintenance window and started to perform the upgrade.
During the first upgrade, the new cluster got created with an incorrect character set as we inadvertently relied on the environment default settings. Nevertheless, the upgrade tool still showed clusters as compatible even though the upgrade didn't apply correctly.
Not realizing that, we performed the next step which was a premature clean up the artifacts of the old cluster. We ended up with no data on that server because the new cluster hadn't been actually created, we decided to call the upgrade off and restore access to our prior version via a separate cluster.
We kept this maintenance period within our 1 hour window and brought service back to normal, however we knew that we'd need to schedule another window to perform the upgrade at a later date.
We prepared for a second window to perform the upgrade whilst giving enough time between to minimise customer disruption. We performed more internal tests to ensure the previous upgrade issue wasn't going to resurface.
We sent an additional email for the same maintenance window times and also an in-app alert letting customers know when it would be taking place. Again, we had allowed approximately 1 hour to complete the upgrade.
Our database clusters are spread across two data centres for redundancy. In one cluster we have a master (Which we'll call M) plus two high availability standby server (S1 & S2) in the same location plus another slave (DR) in a redundant location. With this configuration either of the Slaves could take over in the event the master server fails.
At 2PM AEST we started the maintenance period and started migrating S1. The upgrade completed and ran a number of tests to ensure that data was being returned as expected. We then started the process of upgrading the other hot standby server S2 as Postgres recommends.
Whilst watching the transfer we noticed that the rsync commands were deleting all the data on the standby servers instead of copying them. This left us with just one server, S1, in that data centre that was production ready. Re-syncing all the data back to the hot slaves would now take hours. Stuck for options, we decided to bring the site back up with S1 as master and DR in reserve.
We brought Gleam back online around 25 minutes after the maintenance period started at 2.25PM AEST. Soon after we started seeing a huge number of indexing errors coming from S1, which forced us to bring it offline to diagnose. The server had passed our previous tests but we wouldn't risk data corruption.
Upon further testing we found that a number of our large indexes that we use to make data performant were corrupt. Rebuilding these would take hours and even once complete, we couldn't be sure that we had 100% data integrity. Even recovering from our hourly backups the indexes would still need to rebuild.
We decided that in order to prevent data loss of any sort we had to recover from DR which is in another datacenter. This involved cloning and compressing over 4TB of data then transferring it over a 1Gbps link back to our main datacenter. During this period we were asking our DC if they could upgrade our links to 10Gbps but our account manager who handles orders wouldn't be available until the morning.
The zipping, transferring, and restoring took close to 10 hours to complete. It was restored to M and then much faster from M to S1 and we were able to restore service without losing any data for customers at 4:17AM AEST on the 16th.
Total downtime was 14 hours 16 minutes from 9pm PST to 1pm PST
We are proud of the team who worked around the clock, some for 36 straight hours to restore service. It's important in situations like these to remain calm and work together towards the best solution.
When you're in a highly stressful situation that can potentially impact a large number of customers there's always things you can look back on and know you could have done better.
Here's the steps we're taking to ensure something of this nature doesn't happen again:
We're extremely sorry that we've let you down, for anyone who was in contact during the outage we appreciate your understanding and we're ashamed that this was our biggest outage by far in the 6 years that we've been in operation, we're continually trying to do better but today proved that we still have a long way to go.
This outage was easily the most stressful day of my life to date. There's nothing worse than failing your customers and knowing how long recovery will take due to all the factors at play.
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand - Norm Kerth
We're pleased to roll out our new Loyalty Bonus Action aimed at helping reward your loyal fans and entrants. This is the first iteration that allows you to create tiers that reward users if they've entered your previous campaigns (Competitions or Rewards).
This action is available on the Hobby plan and above.
Often if you include a Visit action first or high up in your campaign it can send the user away before they get a chance to convert, which can negatively impact your conversion rates and engagement.
We've looked at data and can see that Visit actions in a prominent position can reduce conversion rates from an average of 36% down to less than 5%.
Based on this data we've made an improvement to Visit actions and how they display when a user is logged out of the widget.
Visit actions will now prompt users with the User Details form if they are not logged in before allowing them to see and visit the link.
You can display results for:
You can now import text comments into your campaign using the Facebook Comment Import Action.
You can filter imported comments and accept:
You can now import text tweets that have been tagged with specific #hashtags using the Twitter Hashtag Import Action.
You can filter imported tweets and accept:
Our new Reddit actions allow you to award entries to users who:
Reddit Actions are available on the Hobby plan and above
You can now manually add items from Twitter, Instagram and YouTube to any Gallery.
Our new Subscribe to a Podcast Action allows you to ask users to subscribe to your podcast anywhere it's hosted. This includes Apple Podcasts, Stitcher, Spotify and even your own website.
Podcast Actions are available on the Hobby plan and above
We now support Instagram @mentions for both Competitions and Galleries, once setup this will monitor new Mentions for your selected Accounts.
Please note this only works for Media that @mentions and does not pull in Mentions in comments.
We've improved sorting options for Galleries and now allow you to "Sort by Random" which can remove potential skew for Voting based campaigns.
Facebook has rectified the bug affecting some Tab Installs, any page over 2000 Likes should be able to install these again.
We're aware of some issues with installing or managing Facebook Tabs for both Competitions and Rewards.
There's two bugs on Facebook's side at play here:
The second bug has been open for over a month on the Facebook side, so we're currently at their mercy for possible fixes on this issue - we do apologise for any inconvenience this may cause.
As an alternative we urge you to make use of your Hosted Landing Page, you can find the link on your Preview tab.
If you're having issues feel free to contact support and we can try a manual replace of an existing Tab for now.
We've added a new sharing type to Galleries which allows brands or users to embed their submissions on any page that supports HTML.
One use case for this would be allowing influencers to write posts around their submission and allow voting on their submission from within the post.
We've added the ability to import comments from your YouTube videos to our Import Actions.
The Submit URL action now allows for unique or duplicate URLs to be submitted, so if you want a lot of users to submit the same URL you have that option now ☺️
There's three new Capture rules for you to play with:
We've made it easier for you to link to your Terms & Conditions from anywhere in your campaign.
Simply create a link to #terms and it'll open the Terms & Conditions you've added.
Click here to view our <a href="#terms">Terms & Conditions</a>
We've added a new Import button on the Actions tab of Competitions & Rewards that allows you to import Actions via CSV if you've collected entries offline.
If you have any issues importing just shoot us a support ticket for help! This feature is available on all Business plans or above.
On February 25th YouTube updated its Artifical Views Policy to a new Fake Engagement Policy - for which enforcement will start on the 15th March.
This policy prohibits anyone using services to artifically inflate any YouTube metrics, this includes:
Unfortunately YouTube have been quite vague on how this policy is being enforced, but to date we've seen ~20 users have their Gleam landing page link disabled from their video description and given a warning.
You can see YouTube's official stance on giveaways via a few Twitter responses:
These updates do not punish the app, instead they punish the creators using the app. These updates will affect anyone who continues to use any 3rd party platform that incentivizes behaviour on YouTube.
Keeping customers safe and ensuring that we're compliant with guidelines is our number 1 priority. We've been through similar changes with Facebook, Instagram and Twitter over the years - platform policy and privacy changes are an inevitable side effect of any fast growing social network.
If you're receiving this email it means that you have at least 1 active campaign running with either of the following actions:
These Actions now violate YouTube's Fake Engagment policy and may result in a Channel Strike similar to this:
YouTube's new Community Guidelines Strike System will issue a warning for your first offence and disable the offending external link.
From today, we've removed the ability to add any Actions to a campaign that violate this YouTube policy.
Subscribe to a YouTube Channel is now Visit a YouTube Channel, we've also provided the ability to ask users to connect their YouTube account after the visit so you can see their YouTube details.
These changes will not affect any existing campaigns that you have running, but copying an existing campaign will not include the old actions.
In order to become compliant with the Fake Engagement Policy you will need to remove any Actions that are mentioned above. Existing entries will still count after removal.
This also includes any mention of asking users to Subscribe or forcing that they must be Subscribers in order to be eligible to enter via your Gleam description or Custom Actions.
With so many creators relying on Gleam these sorts of changes can often make it difficult to understand what's the right or wrong way to run a campaign.
But we're commited to supporting creators through these changes and we're planning to roll out additional features soon to make it easier for you to continue growing:
YouTube has made updates to its Community Guidelines and is now enforcing them via Strikes on creators who link out to certain sites via their video descriptions.
One of the guidelines updates sits under Spam, deceptive practices & scams policies:
Incentivization Spam: Content that sells engagement metrics such as views, likes, comments, or any other metric on YouTube. This also includes content where the only purpose is to boost subscribers, views, or other metrics (e.g., “sub4sub” content).
Based on this update you are no longer allowed to link to a Gleam campaign that asks users to Subscribe, Comment or perform any other action that boosts metrics on YouTube.
We've contacted YouTube for more clarification on these policies before we make changes to the platform, we do feel this update (& enforcement) causes a number of problems for Creators (and YouTube):
For now we recommend that you remove any Actions that incentivize users to perform an action on YouTube if you are linking to a Gleam campaign from your YouTube descriptions - until we have more clarity on what is & isn't allowed.
We're currently working on changes to bring campaigns into compliance & will send out an email to all Creators once they are ready to roll out.
Once that happens we'll try to roll out more features for Creators to continue running compliant & legal campaigns on YouTube.
Today we've rolled out a new Reporting tab for Business customers that allows you to see which landing page & referring sources are driving traffic and conversions for your campaign.
You can also use this report to track down valuable traffic sources, partner activity on campaigns via query strings or specific landing pages on certain domains.
We've built this in a way that also shows you the proper referring sources when the campaign is embedded on your own site or a partners.
We were unable to get a resolution from Google as to why comments made via the API were being deleted, we believe this is possibly linked with the sunsetting of Google+.
However, to minimise impact to customers and move forward the YouTube Comment action now works in a similar fashion to the YouTube Subscribe action.
This means that comments must be made directly on YouTube and then we'll verify on behalf of the user.
This will also mean that if loyal users have already commented they'll be credited without having to perform the Action.
We've rolled out a new rule for Captures that allows you to display your Capture within a specific fenced area on a map. You can add multiple locations to each Capture.
You can check this rule out under: Geolocation > Geofence
We've become aware of a bug with YouTube comments over the last 24 hours that is resulting in any comments made via the YouTube API to disappear after a few hours.
We're not sure if this is simply a display issue or if the comments are gone. We've filed a bug report with Google, other API users are also having the same issue so we hope this gets resolved quickly by them.
Please note that even though these comments aren't visible on the videos, the entries are still valid for users that completed this action.