Managing Production Server Operating Systems - the CentOS Case Study

Red Hat's December 2020 CentOS announcement provides a good opportunity to think through the decision-making process for choosing Operating Systems in Service Provider networks.

Watch the video interview with Sherwin Crown and Mark Lindsey discussing this change.

Operating Systems Are Key Decisions

You might be committed to Linux for your application. Still, the Selection of an Operating System is a critical decision, whether you have classic long-lived bare-metal or Virtual Machines or whether you have temporary machines, like AWS EC2 instances. 

Long-Lived Machines

Some machines are built and intended to keep going for a long time. These follow in the classic Linux "minicomputer" philosophy that came to us from the 1970s when a server would be built and then run for many years. Even if the OS is upgraded, the basic function and role of the computer would be maintained. In this era, computers have clever names like "sophrosyne" or "crwth."

Even in <a href="https://www.redhat.com/en/topics/virtualization/what-is-nfv">Network Function Virtualization (NFV)</a>, most Service Provider machines are in this category. Software vendors like the Cisco BroadWorks team and Neustar Once you choose the operating system distribution, such as Red Hat Enterprise Linux, or CentOS Linux, or Debian Linux, you're committed to that until you re-create that machine. And for that duration, you have to maintain it with software updates and configuration.

Long-Lived Machines often stay up and running for years at a time (though reboots during software updates should shorten that time.)

Three reasons you're stuck with a Linux distribution long term

  1. The OS is key because it lives logically "underneath" the applications you add to it. You have to configure it first, and then you can install applications on it, e.g., BroadWorks, or SurgeMail, or Oracle Database. 
  2. The OS is also key because you have to keep installing software updates on it throughout its lifespan. More on that below.
  3. And the OS is key because automation, such as with <a href="https://www.ansible.com/integrations/infrastructure">Ansible</a>, Chef, or Puppet, has to be integrated with the OS distribution to work properly. The effective system administration varies substantially between distributions. In effect, this means that the API for managing certain functions on a Debian Linux server is different from the API for managing functions on a Red Hat Enterprise Linux server. 

Short Lived & Appliance / Template / Image Machines

There's another variety of machines: short-lived machines created for short-term use but created in bulk as needed. These are more common in elastic cloud environments, where Virtual Machines can be created and deleted as needed. In these situations, the automation of the platform is especially key. The decisions made about the Linux distribution are baked into the template that's used for many other machines.

  • Appliance: A machine built to provide a function, but not normally something that users log in to with SSH or a similar client. Typically appliances are not used as general-purpose servers, but instead provide a small set of applications. Examples: Metaswitch Perimeta Session Border Controller; pfSense Firewall.
  • Template or Image: A machine built not to be run itself, but to serve as an image for other machines. Templates are made to be replicated into new virtual machines. 

This means that while any VM doesn't stay in existence for very long, the image or prototype machine may be maintained for a long duration. Yes, the template can be replaced, but all of the complexity and effort are poured into that template.

Closely related to this are appliance systems -- which are built to be distributed. These include the images deployed on embedded devices, like cameras, routers, and home appliances. They are built as template machines and then distributed.

Two Reasons the Linux Distribution is Key for Short-Lived and Appliance Machines

  1. The Linux distribution chosen is tightly integrated with the application. We can't just log in and make a few fixes for the integration -- instead, we have to modify and update the template. Building a template or replicable software system is always more complex than building "in place."
  2. Software defects and improvements have to be rolled into the template and distributed to end-users. This means that updating updates can be trickier than simply installing updates and rebooting because the updates have to be baked into a completed package. Then all of the old instances of the machine must be updated or replaced.

Case Study: Telecom Service Providers and CentOS

Voice Service Providers that have adopted are often using long-lived machines. In many cases, CentOS is in a virtual machine that is deployed to process or route device registration, phone calls, billing, or other network management. Service Providers often run these in an N+1 configuration -- meaning they always have at least one spare server that can be down or restarted for maintenance.

CentOS grew in popularity in these environments after BroadSoft (now part of Cisco) approved CentOS as its BroadWorks platform. This was pretty big -- according to Cisco, some 26 million people get service through BroadWorks. It's used in mobile and Business Calling. So making CentOS part of the BroadWorks stack was a big deal.

For these service providers, the decision about building new bare-metal or Virtual Machines is a big deal because the machines tend to last a long time. CentOS had a good reputation for reliability because it inherited the Red Hat Enterprise Linux (RHEL) reliability. But once a system is built with CentOS, it's going to be around for a while. It's not a small, quick project to replace a VM with another doing the same function, in part because the licenses from Cisco are connected to the specific installation instance. 

 

In the Era of Open Source, Good Engineering Can Be Free (But Somebody is Paying)

Since Open Source came into its own when it began to be distributed widely on the Internet in the 1990s, system operators have known that they could get good, robust software for free. In every case, somebody was paying - either by donating their time or by hiring a programmer.

Linux is really the chief example of a Good, Free project. And many system engineers have been using Linux for free without paying anything for the right to do so.

There are many other valuable benefits we use for free -- such as the RJ45 connector design. Certainly, somebody did pay for the "Registered Jacks" design, but we don't expect to pay them anything for it.

It turns out that building the 90%-perfect version of the software is something people are willing to do largely for free. That's probably because it involves building a set of features that the developer needs. But fixing a million tiny bugs is not something people do for free. The history of Open Source is one of a million great projects that are not being maintained anymore. In the advent of vulnerabilities and exposures, correcting those defects can be critical to the service's basic reliability.

Case Study: Red Hat Enterprise Linux and CentOS

Red Hat began business by assembling free software into a distribution. They've always released some form of their distribution at no cost. But Red Hat Enterprise Linux (RHEL) has long been a reliable way to get both the software's free software and bug-fixes. Because of the GNU Public License requirements, they've been required to release all those bug fixes for free.  But they weren't required to distribute their assembly code and run servers to make it easy to download their software.

CentOS was built from the RHEL code, but it required real work to make this a distribution separate from RHEL. In 2014 Red Hat formally hired the head developers and took control of the CentOS project. And since that time, Red Hat (and IBM, their new owner) have been funding both their original distribution of RHEL and paying for a substantial portion of the work it takes to make CentOS.

In 2020, Red Hat announced that CentOS would no longer be available for free as a production-grade operating system. Since at least 2014, Red Hat's customers (and IBM's investors) had been covering the costs of organizing and managing CentOS.

For many CentOS users, this came as a bit of a shock. For some, it was merely an annoyance: they would be able to use other Linux distributions in the future. But for others, it was a major upset; it was as if they had been using the RJ45 connector for Ethernet and were suddenly told they'd have to pay to keep using it.

Software Selection Always Carries Risks

All software selections have risks. Here are some of the key areas. 

Fit

How many of the functions I need will this software provide? This assumes that we have some way of knowing what the actual functions actually are. This can be a difficulty if the problem space is not well understood.

But even if you have a good idea of what you need, you might struggle to determine whether some software can actually provide it. For example, if you need robocall mitigation software, you might struggle to know whether the software will provide what you really want.

Quality

Of the functions, the software performs, what defects will be present? This includes both functional limitations and security limitations.

Long Term Maintenance (Support Lifespan)

As defects in the software become visible, will I be able to get the software fixed? One of the key factors here is vendor viability: will the software vendor continue to be relevant and support the product?  This is basically quality over time.

Improvements (Product Evolution)

As my needs change, will this software adjust with it? For example, it's common to change TLS versions and ciphers. Will the software allow me to adapt to these new versions? 

Integration

Related to Fit, will the software integrate with other components in my network, or is it intended to be a standalone component? 

Regulatory / Legal Risks

Will the software implement all of the regulations required? Example: GDPR sets some privacy-related requirements for software used by residents of the European Union. Billions of software components were in place when the GDPR was approved. 

Implementation Risks

What are the costs in time and money to make the software actually function?

And there's more: The items above don't include some other key issues -- like human factors of difficulty knowing the actual requirements. 

Reversibility

Reversibility means we can change our decisions later. Will this software "lock us in"? For example, you can probably use several different apps for email on your smartphone. 

Case Study: Operating System Selection Service Provider Networks with CentOS Linux and CentOS Stream

We can use the grid of risk categories shown above to think about the risks of choosing CentOS as a Service Provider Operating System. Before December 2020, CentOS Linux was a legitimate option for production servers, but CentOS stream replaced it using the same name after that point. 

  • Fitness For Function

    • CentOS Linux had the same software like Red Hat Enterprise Linux (RHEL). That doesn't mean it's good for every application, but it was well suited for servers in service providers. 

    • CentOS Stream is a rolling-prerelease build showing what is likely to be included in RHEL in the future. This suggests that its fit is likely to be nearly identical to RHEL and CentOS Linux.

    • Winner: Tie

  • Quality

    • CentOS Linux included packages and errata (bug fixes, corrections) that had already been included in RHEL. RHEL has very high quality, and therefore CentOS had the same high quality.

    • CentOS Stream is a rolling-prelease build, which means that its contents have not undergone the same extensive testing as RHEL. Red Hat is indicating that its quality is not expected to be suitable for a Production Server.

    • Winner: CentOs Linux (RIP)

  • Long Term Maintenance (Support Lifespan)

    • CentOS Linux had commitments from Red Hat for substantial support lifespan -- but Red Hat evidently walked away from those commitments by changing the support schedule. This has left Centos 8 users in a bad situation because they lost years of expected software updates.

    • CentOS Stream is likely to be continuously updated until Red Hat decides to dispense with the project.

    • Winner: CentOS Linux (RIP)

  • Improvements (Product Evolution)

    • CentOS Linux included changes that were adopted in RHEL. This meant that as the RHEL users' needs changed, CentOS got those improvements.

    • CentOS Stream will get changes before RHEL, which suggests that it will improve functionality added to RHEL.s

    • Winner: CentOS Stream

  • Integration

    • CentOS Linux had the same integration interfaces (postfix, subversion, Apache, MySQL, etc.) as other Linux distributions.

    • CentOS Stream should have identical interfaces to RHEL.

    • Winner: Tie

  • Regulatory / Legal Risks

    • CentOS Linux accommodates the same legal standards as RHEL, operated in numerous regulated environments such as Voice Telecom.

    • CentOS Stream should have identical functionality to RHEL

    • Winner: Tie

  • Implementation Risks

    • CentOS Linux will work exactly as RHEL, so software intended to integrate with RHEL will work perfectly on CentOS Linux. For example, Cisco BroadWorks was built to run on RHEL and worked perfectly on CentOS.

    • CentOS Stream may include untested and unproven management interfaces.

    • Winner: CentOS Linux (RIP)

  • Reversibility

    • CentOS Linux is installed underneath applications.

    • CentOS Stream is also installed underneath applications.

    • Winner: Tie

Death, Taxes, and Security Vulnerabilities

One of the key requirements for software in this era is to have a ready stream of software updates so that you can fix security vulnerabilities as they occur. As discussed in the section above on Software Risks, the Support Lifespan is a core consideration.

It's common for software users to expect improvements (product evolution). But you can keep using software without improvements. You can't safely keep using software -- connected to a network -- without remediating the vulnerabilities.

Case Study: CentOS Linux 8 End-of-Life Change

In December 2020, Red Hat announced that the support lifespan for CentOS Linux 8 would be shortened. Instead of getting support until September 2029, as originally announced, the support lifespan was shortened to the end of December 2021. This means that the Long Term Maintenance expectations of CentOS 8 could not be met.

This forces CentOS operators to change strategy. Perhaps they didn't expect to keep running their servers and VMs until 2029, but they may well have expected to get a few years out of the deal. 

Migrating Between Operating Systems Is Nontrivial

Another essential element is Reversibility: software decisions that can be reversed later are not nearly as crucial as those irreversible. OS's choice is largely irreversible for many installations because it is deep in the stack, serving as a foundation for many other tools.

Some of the big areas that vary between distributions are:

  • Package Management
  • Startup and Shutdown scripts
  • Network Configuration
  • Standard File Paths
  • Kernel and sysctl defaults

When you choose an Operating System, you're making decisions related to three big factors: Vendor Support, Training, and Automation.

Vendor Support

Software vendors build and test on particular Linux installations. This allows them to ensure they use the proper paths, integrate with the proper automation for startup and shutdown scripts, and integrate properly with the kernel defaults. This means that they can mandate the use of certain Linux distributions to maintain compatibility. You may use their software on another Linux system, but they may not be willing to help you when it breaks as long as you fail to follow their requirements.

In contrast, appliance-like systems, such as Oracle Enterprise Operation Monitor (formerly Oracle Communications Operation Monitor, OCOM), include Linux in their distribution and replace the entire software template or image.

Training or Experience

Humans have to configure and manage Linux systems. While all Linux distributions run similar software, the server's actual method of proper management can vary substantially.  The human system administrator who doesn't know the Linux distribution's proper methods may do things in a fragile or temporary way.

Automation

Automating routine tasks is really essential for scalable, cloud-based applications where new Virtual Machines are created and deleted routinely. But every system can benefit from a consistent configuration regime to implement all of the system changes and specialized configurations necessary for applications. 

Because of the big areas that vary between systems, programmers will customize their configurations for specific platforms. Suppose you build automation for Red Hat Enterprise Linux but need to move it to Debian. In that case, you should expect the automation to be rebuilt somewhat to accommodate the changes in the areas described above.

Case Study: CentOS Migration Options

Red Hat's change to the CentOS project forces Service Providers to migrate to another Linux system to maintain access to security updates, at least after 2024, when CentOS 7 updates end. 

To use our reference example, Cisco BroadWorks was built to operate on Red Hat Enterprise Linux. In this case, the vendor-supported RHEL and two derivative products, CentOS and Oracle Linux. To maintain compatibility with the vendor's requirements, users would have to migrate to one of these other supported platforms.

Other systems in Service Provider networks do not have a vendor integration requirement, and so the operators have more flexibility. They still have to account for Training/Experience and Automation.

Conclusions

An engineer should consider numerous factors when choosing the Operating System for a Service Provider network. There is a mixture of human factors (such as training), business risks (such as the longevity and viability of vendors), and technical risks (such as integration and defects).

The Red Hat announcement in December 2020 provides some important lessons. The most straightforward option is to switch from CentOS to Red Hat Enterprise Linux (RHEL) because of the proven quality of Red Hat's support. Other groups, such as ZDNet and Ars Technica, identify other viable options.