""Production" is code for "playground"" on https://aligot-death.space, available at https://aligot-death.space/txt/production-is-playground-en


"Production" is code for "playground"

You don't need that, do you?

Read it aloud: play pause stop
2 min

One day we lost the connection to our production servers: after a few calls, we realised that the infrastructure team were using our servers as test beds because they were not informed that these servers were live. We later learned that the hardware was not compatible with the custom Red Hat image used, which prevented reboots (or else NFS mounts and some other configuration files would be deleted).

We knew about the no-reboot issue but not the actual reason: they were using our production servers to debug it because the CTO gave contradictory information to both of our teams.

We requested a restoration of the files to the backup team, because, you know, we needed them. We got confirmation that it was done but the files were still missing. After investigation, they told us that when they were prompted by the backup software to overwrite the still present but empty directories, they simply hit "no" and closed the restoration ticket. Another attempt failed because they "let the workstation handling the operation go to sleep during the process".

But it didn't stopped here: we allowed the OS team to perform their tests on our perfomance test infrastructure. We got no news for a while, and when we contacted them they said that they were actually waiting for the machines to be shutdown but didn't asked for it.

We requested said shutdown to the infrastructure team. They then asked us permission for a full wipe of the machines, but we asked for a delay as we needed them again. The following week, they did it anyway. So we asked (again) the backup team to restore the servers: they failed to do so because in the mean time they had changed their infrastucture so the backups were not longer properly assigned. Long story short, they ended up having to wipe down their backup servers, and we lost our data. A few months later, we were having connectivity issues on production. As it turns out it was caused by the storage team: they were testing the new Kerberos version, because, you know, "you are the only team with this version of Red Hat".