Usual engineers create job security, the best ones — documentation

I’ll to show you that being a great engineer that produce good code, chooses perfect architecture decisions, doing fast, etc., isn’t enough today.

Intro

Agree? No? So, let’s try to deep dive into the problem. Now, I’ll try to show you, by telling the story, that being a great engineer that writes good code, chooses perfect architecture decisions, doing fast, etc., isn’t enough today. In my opinion, software engineering today and 20 years ago is absolutely different things. In those days, engineers could do a project by a team of three people who know each other well, work together for ten years, and perfectly understand each other. Now, engineers work for a particular company one, maybe two, years on average. But now, let me tell the story I’ve mentioned above.

The story

Today I’m working as SRE in a middle-sized company. But honestly, I’m not pretty sure what does it means. All my career, I promoted myself as a software engineer, not Developer, not DevOps, and even not SRE, but an engineer. In practice, it means that I’m doing all, from infrastructure maintenance to usual web-development, and often work in small-to-middle companies. In companies like that, a person like me is much appreciated, cause there are not many departments yet, but work still needs to be done. This story is entirely about SRE/DevOps/SysOps/etc. whatever you like. But it is fully applicable to classic development.

It all started when I got a task for adding three servers to our infrastructure. We make an order, wait for a week, and servers have appeared in the management console (we are talking about dedicated servers). We are running on a Debian, and one day I’ll tell you the story of why you should run away from that distro, but not today. I’m working for less than a year yet, and the previous infrastructure growth was made by another engineer, who, as you might have guessed, no longer works for this company. But this is just server bootstrap, usual, simple task; what could go wrong?

The first problem I faced was the inability to install Debian through the cloud management console of our servers provider. Well, it’s pretty strange that such a famous company, I mean our provider, offer only Ubuntu 18.04 from Debian-family systems in 2021. Anyway, I have iKVM (IPMI). To connect to the IPMI interface, first, I needed to establish a connection with provider infrastructure through SSL VPN. I go to our provider’s documentation, download proprietary SSL VPN client for their network, and you know what — it doesn’t work! The latest version for macOS Big Sur, showing “connected”, but actually not. Any other version from the official source gave me the same result. Good start. Then I tried my laptop under GNU/Linux, and the result was slightly better there. When I saw “connected” — that really was connected, but the client just failed from time to time. The stable connection I got only on Windows PC.

The second problem was to connect the ISO file with the install program. According to the official documentation, the easiest way to do that was to use Samba (Windows) share. Unfortunately, this didn’t work. Then I had no choice but to download the Java applet for connecting to iKVM and mount ISO from my PC.

That was the third problem. I’ve downloaded it, but to run it, I needed to run it under Java 8 and only, which was not indicated anywhere. Java 8, I have nothing to add.

The fourth problem was UEFI. As the support team responds to me, those servers can’t boot in legacy mode by design. As you may guess, this “feature” was not declared when we were buying servers. The legacy mode preferable for me cause when I’m using UEFI, I can’t make EFI part of software RAID. So, I was forced to use hardware RAID.

There was more problem, like inability to setup internet connection from rescue system, or sticky mounted ISO, which can only be repaired by KVM reset (according to support team solution plan). But two days and three support tickets later, servers were bootstrapped. The simple task took two days.

Why does it happen?

First of all, let put aside the server provider. Yes, I agree that formally almost all problems, especially broken VPN and ISO connect issues, are entirely the provider’s fault. But we are in IT — we are working with what we have. How can we minimize damage from such stories?

I claim that only by way of creating documentation and tickets. After all, I found a message in the team channel about problems with connection to the IPMI network on macOS. The message was sent three months ago. Another engineer on our weekly sync-up meeting said that he “remember some issues with ISO mounting, and installing Debian on provider’s infrastructure.”

“Some issues”, chat messages and no one ticket, no one Confluence page. And then the simple task took two days.

How to solve it?

Write documentation! Create tickets. Yeah, I know that is boring. But I am convinced that is good for the project. Why I started the article by comparing nowadays and the past? Because now it is critical to have good documentation, ticket system, unit tests, CI/CD processes to a quick start for newcomers. Very likely, he will leave the company after one year. What did he leave after himself? A code? Ok, but what to tell about ideas, justifications of decisions? Of course, code is more than nothing. But code without ideas documentation is useless.

I deliberately told the pure “operation” story, without code. Becouse in that case, without documentation, we have nothing, a zero! Other guys have already done it before. With the same provider, the same VPN, they faced identical issues, but they didn’t document it. It’s called job security. And I think one of the main tasks of a leader is to prevent it. Job security makes it harder for new workers to start and makes you dependent on old ones.

For me, all this was reborn into a definite rule: “Every issue you faced needs to be converted to a ticket. Every ticket must be solved by code, documentation, or better both”. No matter how perfect the documentation is. Just write something. Of course, the best solution is to write documentation along with code or using a special solution like Confluence, but even a few comments in the ticket it is best than nothing. Just note what and how did you done.

PS Yes, I wrote documentation.

Engineering from hell with love.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store