Who said having a Quality System is a waste of time?
We are a scaleup company, providing Application Lifecycle Management and Quality Management System application for Medical Devices company, on a SaaS platform. We started over 8 years ago, two focused software engineers, having decided that the little application they did for themselves could be very useful for a lot of other companies. We had very little budgets, no investors, and did not take a salary for 2 years. The future was bright.
Having lived in medical devices companies for a big part of our careers, we felt that the methods and structure of this world was worth our time and energy, so we decided that our first hire would be a RA/QA manager. Ann helped us build our QMS management platform and our own QMS (following ISO 13485 since all our customers were). That was Year 3.
Year 4 we thought that with all the controls and procedures we put in place in the QMS we were close to being ISO 27001 compliant. That is the Information Security standard. So our QMS became a QMS+ISMS system.
Together with that we attacked all the mandatory provisions of that standard, including all the chapters related to business continuity and disaster recovery.
We proceeded with a formal certification of our ISO 27001 Information Security Management System. Internal audit, external certification.
ISO 13485 has a requirement since 2015: the risk evaluation of the processes.
This is for example one of the risks we declared in our QMS/ISMS:
In short: it’s really bad to have a server crashing, but since we have lots of backups, alternate hosting providers, and designed a Disaster Recovery Process, we should be fine.
On March 10th, 2021, this happened in Strasbourg:
© Sapeurs Pompiers du Bas Rhin
OVH’s boss told us this without ambiguity:
That building (SBG2) was the home of:
- Our biggest production server, hosting the data for about 40 of our customers
- Our main backup server
Message from a customer to our support:
"no access to project site"
My early bird colleagues called me saying “fire” (literally this time) so we were all quickly in that mode that some of you know: crossing fingers for your contingency plan to work.
My team was extraordinary - as if we’ve done that all our lives, we very calmly started a white board on a google doc, with all the primary activities to perform, and the checks that need to happen. With slack heating up by the minute with snap questions and answers for the team to divide the work.
We started communicating on our statuspage. Then a campaign of emails sent to the admins of all the customer instances on that production server.
We have about 30 servers running worldwide and we pay attention to not loading them too much so restoring all these databases and files to existing servers is not an issue. The restore itself takes some time for some customers - the fact that we have secondary servers in other providers make them slow to transfer. Our Disaster Recovery Plan didn’t have a chapter called “Fire destroying our main backup and our biggest production server” - but we were close enough.
We believe very much on hard redundancy, so besides the main backup server in OVH, we also had a duplicate in Munich (Contabo) and one in Frankfurt (Hetzner).
We have ready-made scripts to restore anything from anywhere, so we launched them in the order defined by the Disaster Recovery Plan.
We needed to redirect the master backups in all European servers to one of the remaining 2 backup servers. Again, we have a procedure for that.
The restore itself was not that long. Small customers took a few minutes to restore. The biggest data set for a customer was around 25G, and took about 1.5 hours to transfer it back to the OVH servers.
We checked the coherence of all the restored instances - it was a go.
All our customers are running on their new servers. Now working on the free trial instances that were there as well.
All data restored
Now starts perhaps the most important part of our QMS: the CAPA (Corrective And Preventive Action) processing.
Ann created it with the first evidence.
This is where you ask yourself: what happened? What is the root cause? What did we do to correct the problem right away? What could we improve? How to make sure it doesn’t happen anymore.
Then the second level remediation started: order new servers to replace the backup and the production ones. Start syncing the new backup with the other 2.
We ordered a new backup server in Poland, to be as disconnected as possible to all existing data centers, while still being on the same fast network.
Ordering of 2 new production servers to replace the lost one.
March 13th, 14th
Started moving customer instances to our 2 new servers to ease the load and be in a balances situation again.
CAPA work. After the dust (again, literally) settled down it's time to think about what you've done, and most importantly what you need to improve.
Lots of findings for us:
- We were right about making restore exercises every week - this is the only way to ensure that the whole process works.
- We created an analysis tool to know much easier where these attachments are used.
- We now have a better geographical / datacenter oriented network map, so that we will not have the backup servers in the same buildings as the production server anymore. Our main backup server is now in Poland, where we have nothing else.
- Our monitoring failed to warn us about the fire. There was a change in the monitoring servers 2 days before that prevented it from working. We will revisit completely the way we do monitoring in the near future. We hired a new team member to help us with that, and will shortly hire a subcontractor to watch our system 24/7.
- Even though we mastered the disaster very well, we analyzed the event in multiple CAPA sessions. During these we had more ideas how we can improve our processes in order reduce risk further - and obviously we'll execute on them
Our customers are incredibly nice. Example reactions from them:
- "Well done!"
- "On a été très satisfait de la gestion que vous avez faite du problème avec OVH la semaine passée. Félicitations à toute l’équipe."
- "First of all, congratulations on the successful disaster recovery!!!"
The customers who were cut from their service got a rebate on their next invoice: this is part of our SLA conditions.
Without our QMS/ISMS and our Disaster Recovery Plan within, we would probably not have survived this incident.
Yes there’s a price to pay: all these backups and restore exercises and plans take some time that you could think are lost. And it’s not sufficient to design them, you need to prove that it works on a regular basis.
This incident also brought a new dynamic to the team - convincing everyone of the vital importance of data security.
A big thank you to the whole team for their dedication and efficiency. It's so good to be in a team where you know the others have your back!