paint-brush
Reliability Engineer Shares How Businesses Can Manage High Availability and Improve Resilienceby@antagonist
172 reads

Reliability Engineer Shares How Businesses Can Manage High Availability and Improve Resilience

by Aremu AdebisiJanuary 23rd, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Alexandr Hacicheant, Head of Reliability Engineering at Mayflower, ensures the stability of a live-streaming platform by leveraging cutting-edge developments.
featured image - Reliability Engineer Shares How Businesses Can Manage High Availability and Improve Resilience
Aremu Adebisi HackerNoon profile picture
0-item

As a service becomes more popular, its creators face increasing challenges. A surge in visitors can overwhelm a website, cause downtime, and harm the brand's reputation. To avoid such scenarios, companies adopt modern approaches, including cloud technologies and a shift to microservices architecture. Recent research shows that the cloud microservices industry is reaching record-breaking growth levels.


Alexandr Hacicheant, Head of Reliability Engineering at Mayflower, ensures the stability of a live-streaming platform by leveraging cutting-edge developments and modern technologies. In this interview, Alexandr shares his experiences and insights into how companies can enhance infrastructure resilience, save resources, and foster effective knowledge sharing within their teams.

You lead the Reliability Engineering department at Mayflower. Could you share your previous experience and how you joined this role?

For several years, I worked remotely as a software developer for both Russian and international companies, focusing on technical challenges that required either the rapid implementation of new features in response to business needs or the resolution of issues related to performance or improper service behavior.


A key responsibility was improving the performance of server applications. For example, during promotional campaigns, surges in user activity led to a significant increase in requests to registration and purchasing services. My goal was to ensure the system could handle this load without failures at critical moments.


In 2016, I experienced a pivotal career shift when I moved to Cyprus and joined Mayflower, a company specializing in live-streaming platforms. The platform handles a large number of simultaneous video streams. My overarching task was to ensure the architecture's resilience, allowing it to seamlessly handle peak user traffic.

How has your career evolved at Mayflower?

I started with product-related tasks as a programmer and, about a year later, began leading a small backend development team of 2–4 engineers. These team members focused on server-side tasks: writing code to quickly process incoming data on the server and interact with client applications.


As a team lead, I spent less time coding and more on management, meetings, and collaboration with colleagues. The balance often leaned 60/40, with the majority being managerial tasks and the rest programming. My communication extended beyond my team, involving coordination with other managers as well.


After 18–24 months, I became a technical lead, broadening my responsibilities to include selecting technologies for projects, assessing the viability of new tools, and ensuring existing solutions met company requirements.


Eventually, I advanced to the position of CTO, overseeing the company's technical development. In this role, I managed technical leads from various departments, coordinated the backend and frontend development teams, machine learning, and DevOps departments. I built a team to develop an in-house media server, improved approaches to information security, and assembled an IT security team. About three years later, I transitioned to my current role as Head of Reliability Engineering.

What does your current role involve?

My work focuses on ensuring that our services are stable, high-available and resilient, with quick recovery in the event of failures. System downtime can occur for various reasons, often due to badly optimized code, misconfigurations and human factor mistakes. We identify these issues during profiling and analysis, then develop and share best practices with teams while automating the detection of similar issues for the future. For instance, we add pipelines for load testing to evaluate how code handles multi-user demands and how prepared the service is for peak traffic.


While it's challenging to predict all scenarios during rapid project launches, horizontal scaling—adding servers to distribute the load—can address user growth. However, this approach becomes expensive in the long term. Investing in application optimization is more cost-effective for businesses.


Resilience has become one of my core professional competencies, and I continue to refine my skills in this area. While most developers are more interested in creating features, I enjoy tackling complex technical challenges related to improving application performance. It's particularly rewarding to see the impact of this work reflected in dashboards and charts, where resource usage decreases, leading to cost savings for the business.

How does modern IT manage high-availability and improve resilience?

Businesses are increasingly shifting from monolithic to microservices architectures. A monolithic application is built as a single, unified system where all components (UI, business logic, database, etc.) are interdependent. In contrast, microservices break the application into independent modules, each performing a specific function and capable of operating autonomously, communicating through APIs.


As monoliths grow in codebase and business logic, they become harder to manage, slowing down delivery cycles and complicating issue resolution. A failure in one service can disrupt critical business processes, such as payment systems that haven't undergone recent changes.

When validating a new business idea, it's easier to start with a monolithic structure, allowing for quicker prototyping and hypothesis testing. However, as the infrastructure grows and becomes more complex, splitting into independent services becomes essential for resilience and faster updates to individual components without risking the entire system.


Another trend is adopting cloud technologies, which provide computing resources—servers, data storage, software, and networking—via the internet rather than local devices or office servers. The key advantages of cloud solutions include automatic scaling and high resilience. If one service fails, its backups ensure continuity. During traffic surges, the number of instances automatically increases and decreases as demand changes.

How do you facilitate knowledge sharing on resilience practices within your team?

I have always attended many online and offline conferences to learn about new methods and technologies. When I found something promising, I shared it with the teams, and they decided how to integrate these innovations into our work.


This led to the idea of broader knowledge sharing. I initiated an internal community where we exchange ideas and discuss trends in server-side development. Initially, I organized the meetings myself, proposing discussion topics, but the community has since grown. We now have several internal groups focused on backend, frontend, machine learning, DevOps, and more. Group leaders independently manage meetings and implement useful practices in development and operations.


This initiative also evolved into internal conferences. Once a year, we gather to share experiences, highlight achievements, and discuss the technological future of the business. As the company grew to 150 employees, the need for these events increased, and I strive to inspire colleagues to become speakers.


Interestingly, our conferences attracted not only employees but also external experts. Gradually, we expanded to a public format, discussing technical topics related to high-load service development, testing, and maintenance. What began as a small community became the foundation for both internal and external events, supporting knowledge exchange and team growth.


I also participate in open-source projects, an excellent way to gain new experience, explore best practices, understand diverse code review processes, and integrate into the broader community.

Could you share your secrets for managing a team?

Our company emphasizes internal growth, a philosophy I fully support. Instead of recruiting leaders externally, we prefer to nurture talent from within. Courses on people management and hands-on training—like setting goals, conducting one-on-one meetings, and addressing imposter syndrome—play a key role in this process. The company encourages these initiatives by funding courses, inviting experts, and organizing training sessions.


For example, my colleagues and I learned how to conduct effective one-on-one meetings, unlock employees' potential, listen actively, and collaboratively plan development paths. Additionally, I read books, follow professional chats, and attend various IT conferences to enhance my knowledge.


Overall, I believe open dialogue with colleagues is essential. It's important to inspire, support, and maintain an environment where everyone can thrive.