I guess that I got bite by some bug that make me wanting to write and share about what was our achievements. This one I’ll try to explain the changes that I’ve being promoting in the past years with my team, and what was the achievements and how I got there, of course, I have to keep some details out of the Internet.
A little about the history of this Team, the Ops arm of a consulting company, I’ll keep just the remarkable facts. In 2009 the board decided that was time for a team (at that time, a team of one, me) to be in charge of building environments (production or not) to the projects that were being developed. Was also around that time, that I first was look for ways to create a repeatable process to create those environments without the need of me, which ended creating huge PDFs with explanations about parameters, a bunch of screenshots and commands.
Few months later, that the offer of creating those environments and the firefighting service, become “public” I got pretty busy, also made me more straight-forward in my answers causing the wrong impression about being the angry guy in the corner of the room. Also this reputation made me the right guy for those impossible jobs, where we learned some new technology a day or two before being in front of the customer, as the logic behind those products as common, was not that difficult.
Again, I consider that we were being good enough on what we were doing, because together with the board we made our team officially 24×7, at that time we were around 6 people, with a huge problem in our hands, the lowest budget for monitoring possible and how to rotate 6 people in 7 days a week 24 hours a day, without disrupting our daily activities. Proving that since day 1, we are creative, we’ve created a really advanced monitoring with Zabbix and we’ve integrated it with our PBX (Asterisk), that was probably also the dumbest thing that we’ve done. The workflow was pretty simple, everyone should be available on call at least one day per week, to avoid being awake all night looking to the monitoring stuff, we created a script that starts a phone call with a recorded message to the engineer in the shift. That also become to the dumbest part, in some situations it generates a lot of alerts, creating a huge pile of phone calls which sometimes was still filled even after we have fixed all the problems and our phones was still receiving those phone calls. We were automating and creating processes to work smartly with our own resources.
Few years or months later, the team grown to more than 20 people, which still keeps scaring me, as I continue to be the center of the team, the one behind the scenes of all our major incidents and made me having less time ever, the straight forward answers probably was reduced to something like, YES/NO in the busiest days. At least now we don’t have the PBX calling us during the whole night, we have a night shift that checks the status before calling us, as we created a monitoring with a lot of details, we were also creating alarms to prevent outages in production, not just reacting. Most of those systems were not developed by our Company, we barely have contact with developers difficulting any improvement, I’ll be back to this subject later, now… I’ll talk about as how was ours Black Fridays.
We had only one problematic Black Friday, the first one and we used it to learn and avoid those problems on the other 5 or 6 that we “celebrated”. Back to the first one, we’ve spent a whole month scaling environments manually, reducing the weight of our monitoring tools, at the end, some miss communication with business made us really busy during the night. But, in the next morning, we were all there to be sure that, instead of complaining we were laughing and planning the next one. Yes, one year before, probably was when I decided that not only deployments should be automated, as they were for long time… but now, we must have the automation for most of the tasks that we configured wrongly during the first Black Friday, because we’re humans and made mistakes.
As the team grown, we started to have a bigger diversity in profiles, ways of work and bad communication. That also made me, again, worried of having to spend more time managing and driving people instead of fixing problems, creating fun stuff or anything technical. I’ll try to tell a little about the findings to those first “issues”.
As I said, for convenience or time, I become too straight forward on my answers, what made some of my team being worried of talking to me, which brings me to the topic “diversity of profiles”. When I realized that, I started to figure out how to get closer to the team. The first thing that come to my mind was…. meetings with everyone, lets do it weekly, most of those were something like I’m talking about the bad reactions that we had during some incidents, others was me trying to make people share their daily challenges. After few months, that definitely didn’t work, as the rest of the company has the image that talking to me was tough! After a while I was trying to understand what made people think that… I found some situations that created that perception, one of those, was that during some time, I was in charge of network security enforcement, being the bad guy. But the mainly situation was because I wasn’t considering that each one has his own timing and last not least I wasn’t used to distribute compliments freely, those should be conquered (not that I’ve changed that much).
I was expecting something similar to my timing or an equal path to the decision, this was the main change that I’ve done to my own behavior to help the team. Doing that, I started distributing the daily activities or R&D, based on individuals capacity, which probably is something that anyone managing a large scale operation we’ll condemn me. I may forgot to mention that we’re specialized on problem solving, not on repeating tasks manually, which makes way more important to know capabilities of each one, to address the problems quickly.
My main question was how to break the traditional Ops way of work to be closed as much of a DevOps without the dev portion, but taking advantage of the mindset of being worried about the application not about the infrastructure, which will be a consequence, that basically focus on stability and performance. This mindset also brings a really good ownership, but may cause some issues related to communication.
As the team grown we tried the traditional way of delivering tasks, a Service Desk software, which proved that the human nature makes each individual choosing the more comfortable tasks, causing for some tasks long waits in the queue. We tried creating a role of Ticket Manager, that didn’t work mainly because of the distributed nature and the size of the team, overloading those that quickly solves tickets, being the first choice of every escalated ticket. Going back to when we were 6 people in the team, we didn’t have those kind of issues… then reading… I had a “brilliant” idea lets create squads (like agile world) for OPS. Each squad must have at least one specialist of our main areas (dba, OS, middle) a level 1 and juniors to learn and help taking care of few customers and systems, this resulted in a quickly internal communication, reduced the handoff time and improved the ownership of problems and allowed the management layer to better understand how each one is progressing in their career. Now…. I had some easy days, I started talking individually to team again, understanding them even better and helping them to progress. That wasn’t the final draw, we’re always testing new processes, making some changes, but as quick as we can observe results faster we can improve.
The dynamic of our weekly meeting is still getting improved, I found that sometimes (most of it) I got excited about some topics and talk too much, but the team has a single question, “what as their challenge of the week”. This shows were each of the team is in their career and more important, it increases the collaboration helping everyone in their careers.
We’re still specialized in problem solving, but we got way more creative, if we don’t find a tool in the market able to help us, we create it. I now personally spend most of my time encouraging the team to propose changes, optimizations, that help us to keep increasing our efficiency. Also I keep trying to make challenges to allow people to understand the problem before touching it, sometimes I already know the answer that could fix the problem quickly but there is no fun on giving that away, from my point of view… more difficult the challenge better will be the response. That’s how we practice for the worst, spending and understanding the problems that we have time and do the workarounds only when really needs.
With all that, we keep breaking the system daily the traditional ops operation, and most important we do that as a team, growing, learning from our mistakes and most important helping our customers to have their systems online and performing well.