qi.bookwar.info

CI System vs CI Pipeline

Thu, 10 Oct 2024 11:27:24 +0000

(This is a repost of the article I wrote in 2017. I think it is as relevant now as it was then)

For any code you write, you need several steps to transform it from a set of text files to a certain release artifact or a running service. You go through these steps manually at first, but sooner or later you decide to automate them. And this is how the CI/CD pipeline of a project is born.

But there are different ways how you can organize the automation.

The classical approach (CI System) is formed around a standalone CI engine (for example Jenkins, Buildbot, Bamboo..). In the CI system you configure scenarios (jobs, build factories, plans..) and triggers. You also can add dependencies, manual triggers, parametrization, post-build hooks, dashboards and much more. CI system usually becomes the world of its own, with limited number of almighty CI administrators, managing multiple interdependent projects at once.

There is also a new, “postmodern” way of setting up a CI, which is essentially inspired by the Travis CI and its integration into GitHub. And if we would follow the common trend, we’d call it System-less CI Pipeline.

In this approach the recipes, triggers and actions of the CI/CD pipeline are stored in a configuration file together with the project code. The CI system (yes, there still is one) is hidden somewhere behind your repository frontend. You do not interact with the CI system directly, rather see it responding to your actions on the codebase: pull-requests, merges, review comments, tags — every action can trigger a certain recipe, configured inside the project. Feedback from the CI system is also provided via the repository frontend in a form of comments, tags and labels.

To highlight the difference between these setups, let us consider multiple projects – A, B and C (see diagram).

As soon as the code enters the CI system, it is no longer owned by the project. It becomes the part of a larger infrastructure, and follows the lifecycle defined on the system level. This allows complex dependency graphs, common integration gates and cross-project communication.

In the pipeline case, the pipeline carries the project code through the set of steps defined on the project level. Each project gets an isolated, independent lifecycle, which doesn’t generally interact with the outside world.

And which one is better

You probably can guess my answer: yes, you need both.

But let’s look into one common discussion point.

CI system silo

Silo is one of the scariest words in the modern IT environment, and also one of the most deadliest weapons. And the argument goes as follows:

Standalone CI system creates a silo, cutting out the dev team from the project lifecycle. Pipeline is better because it allows developers to keep the full control of the project

As a Quantum Integration Engineer I would love to dismiss this argument altogether as it compares apples to oranges, but it does have a point. Or, better to say, I can see where it comes from:

Widely known and widely adopted classical CI systems, like Jenkins, were not designed for collaboration and therefore are extremely bad at it. There are no “native” tools for code review, tests, approvals or visualization of the CI system configuration. Projects like Jenkins Job Builder, which address some of these issues, are still considered to be alien for the classical CI ecosystem. High entry barrier, outdated and unusable visualizations, lack of code reuse, no naming policies and common practices.. together with generic code-centric development culture (in which CI organization is not worth any effort) .. all of this leads to complex, legacy CI systems. And none of us wants to be there.

Thus, from a developer point of view the choice looks as follows: either you work with some unpredictable unknown third-party CI system or you bring every piece of the infrastructure into your project and manage it yourself.

Now, given that I admit the problem, there are some bad news:

Switching to CI Pipeline doesn’t make you immune to the silo issue, rather it encourages you to create one. In fact, the pipeline is a silo by the very definition of it:

An information silo, or a group of such silos, is an insular management system in which one information system or subsystem is incapable of reciprocal operation with others that are, or should be, related. — Wikipedia

As a project developer you might have a feeling that pipeline improves the situation, while the only thing which has improved is your position with respect to the silo wall: you are inside your own silo now:

While you may feel good getting the full control over the pipeline definition for your project, you significantly reduce the possibility to integrate with other projects: each of them now has their own pipeline, which is not visible to you.

The other way

Let me add the disclaimer first: if you run a project, which produces a single artifact (a binary or a container), if you expect every project contributor to be fully aware of its entire lifecycle, then it might be the CI Pipeline approach is enough for your use case. Do use it, it is better than nothing.

But if we talk about continuous integration of hundreds of components.. we need more than that.

Of course there are certain project-level tasks which project team can control on its own. But then the control needs to be passed further to the shared system where representatives of different projects can design, discuss and develop common integration workflows together with the people understanding the whole.

Rather than treating CI as a simple collection of pipeline configurations hidden in the individual project repositories, we need to see a shared CI as a standalone open project, which needs architecture, governance, configuration as a code, change process, code review, documentation, on-boarding guides and simplified drive-by contributions.

To solve the collaboration problem you don’t give every collaborator a personal isolated playground. Instead you give collaborators tools and processes to work on the shared environment. And it is what good CI System is supposed to be. It is an integration system, after all.

Integration engineer in a Java shop

Wed, 03 Apr 2024 20:19:37 +0000

Or some ramblings on the difference between the mindset of developers and integration engineers, based on a totally unscientific observation.

Some time ago I worked as a Build/CI Engineer in a Java development team. I came there with zero knowledge of the Java ecosystem. I had to learn a lot, including at least three kinds of Java garbage collection and how many brackets one needs to write a “Hello world”.

I have several favorite stories from those times, and this one is about how I untangled the mess of a Gradle the Java developers managed to create in their repository.

Development vs Integration

Let’s start with a question. Why was there a mess?

My answer to that is: they were senior experienced developers, but they were developers.

And a typical developer is driven by a desire to code things the way they want them to be. They have power, they have control, they have the ability to make things for themselves and no one should tell them how it should be done. At least theoretically. In their minds.

While Build/CI/Integration Engineers rarely ever see or use things coded the way they want them (I am looking at you, Jenkins).

Integration engineer's desire or, better to say, necessity is to use the pieces you are given and to adapt them to the environment you are forced into.

You don’t expect the environment to bend to your will. Which also often means you don’t feel especially bad, when it doesn’t. You find joy in building a puzzle, so that it fits.

DevOps vs SysAdmin vs User sidenote

I blame this development mindset as a main reason for the DevOps fever.

If you visit a DevOps conference, half of the talks there are about people writing a generic tool to manage all the other generic infrastructure management tools. There are rarely talks about how you used tools written by someone else. Because despite its best intent, DevOps culture put development on the top of the “food chain” and turned DevOps into an idea that every SysAdmin should write code now.

I often feel out of place in DevOps crowds, and I explicitly asked my management to change my title from a DevOps to a CI Engineer some years ago. Because I often want to cry out loud:

- We are supposed to be USERS!

Not the kind of users who are being mocked for not understanding the basics. But the users who are proud of mastering the tools and integrating them one into another and using them to solve real-world problems.

And yes, being a responsible, effective, collaborative and knowledgeable Professional User, who is able to learn and understand new tools, is a highly valuable and rare engineering skill.

Back to the story

There was me, Java developers, and a repository with 200 000 lines of Java code and 1000 lines of Gradle scripts.

Gradle is an opinionated build tool. It is supposed to be declarative, it is written in Groovy, but occasionally a build script can go as deep as to underlying Java factories of containers of abstract interfaces and things like that..

As every other build tool, Gradle is also a constant source of frustration for any development team.

On my team only one, the most senior, person was able to work with it. When he saw Gradle not doing something correctly he would write a Groovy function to do it the right way. And eventually he wrote about 1000 lines of Groovy code. And then he left.

Thus, when I joined there was no one to ask about the state of the build scripts, but there was also no authority to tell what should be done about them.

I opened the scripts and tried to read them.

100500 opened tabs of Gradle documentation and three months later, I realized that in those scripts we keep overriding some Groovy class methods to change the way we handle subcomponents and their dependencies. We redefine paths where to look for their configs, tests and versions. And we are doing it again and again in a build task, in a test task, in each of the helper tasks…

Instead of trying to refactor and generalize the code I did a very simple thing: I moved the java code of each component into a path, where Gradle expects it to be by design. I put tests in the default tests path and sources in the default source path as it is recommended by the Gradle documentation. And by doing that I cleared up the build scripts to somewhat 50 lines of the configuration data(name, owner, bugtracker url, this sort of thing) and exactly zero Groovy scripting. Apparently, once you put the files in places where Gradle expects them to be, you don’t need to tell it what to do anymore, it all just works out of the box.

The moral of the story

I almost hear a person reading this far and asking:

Wait, instead of configuring the build tool to take files from where they were, you changed the layout of the whole project?!

And yes, this is the point.

To use a tool effectively it is not enough to look into its documentation and learn how to configure what you want to configure. It is important to learn how the tool wants to be configured.

I am not a big fan of Gradle. I find Groovy annoying. I find the ability to reference the same variable by at least three different names confusing. Yet, if the choice has been made to use it to build a project, then I better adjust my project so that it is easier to use the tool. Unless I have a really good reason to fight it.

And it will save everyone a lot of trouble.

Pi Day Special - Manifold

Wed, 13 Mar 2024 23:49:54 +0000

There are several very generic mathematical abstractions, which I miss a lot in my daily life. The issue is that those concepts do exist, and I meet them all the time, but I do not know how to refer to them without using a mathematical term, that most people will not understand.

So, to celebrate the Pi Day, let's try to close the gap.

And today we are going to talk about the idea of a Manifold.

The elephant in the room

The Wikipedia article for manifold shares the same problem as many other mathematical pages – it dives directly into the technical details without trying to explain the idea in “layman’s terms”, which makes it not accessible for anyone without a math degree.

So let’s take another path and start by recalling the Parable of Blind Men and an Elephant:

“a story of a group of blind men who have never come across an elephant before and who learn and imagine what the elephant is like by touching it. Each blind man feels a different part of the animal's body, but only one part, such as the side or the tusk. They then describe the animal based on their limited experience and their descriptions of the elephant are different from each other”

What if I tell you that topological manifold is a representation of this very idea?

Since I don’t like torturing neither people nor animals for the sake of an imaginary argument, let’s change a setting a bit and assume that we have a large rock in a dark room. We have a small torchlight, so at any moment in time we can observe a single small spot on that rock.

And the big question of topology is – how much I can figure out about the whole object, given the knowledge of small neighborhoods of every single point.

The ants

Now the torchlight analogy is actually cheating. If you are running around the object with a torchlight, you already have quite a lot of information. You can step back and take a better look, or you can measure distance between any two points with a ruler. The real fun starts when you can not look at the object from the safety and comfort of a familiar (often Euclidean) embedding space.

So rather than being a person observing the rock from the side, imagine to be the ant living on it. You don’t have a universal ruler, you are living on your small area of the rock, tracing your own tiny steps and measuring them with your own tiny feet. The thick fog covers everything, so that you can not look at the distance, therefore you live in the down-to-earth two-dimensional world.

And then there are other ants, living in their own areas.

And in the evening all those ants from all over the place join their federated social network and discuss the topics like “is the Earth flat?” and “what is the shortest path from the Green Moss Valley to the Hard Rock Mountain?”.

To prove the point they draw charts of the area, each their own. But then they quickly realize that they do not really know how to compare the foot of a red ant with a foot of the green one, and whether the Green Moss Valley one ant has drawn on their chart is the same Green Valley as drawn by the other.

Because creating the collection of local charts is not enough.

I mean, do not get me wrong, it is definitely better than nothing. At least you can confirm that you are ants living on a two dimensional surface, and you are not talking with a jellyfish floating in a three dimensional liquid. But that’s basically all you can get.

This stage of our ant community is called a topological manifold (“n-manifold is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of n-dimensional Euclidean space”).

The atlas

The use of the word “atlas” in a geographical context dates from 1595 when the German-Flemish geographer Gerardus Mercator published Atlas Sive Cosmographicae Meditationes de Fabrica Mundi et Fabricati Figura (“Atlas or cosmographical meditations upon the creation of the universe and the universe as created”). This title provides Mercator's definition of the word as a description of the creation and form of the whole universe, not simply as a collection of maps.

Wikipedia, etymology of the word “atlas”.

The simple collection of charts doesn’t give you any idea about how the charts relate to each other. We are in that blind elephant situation, where I can talk about my map, you can talk about yours, but there is no connection between them. And we don’t have a third-party, the embedding space, which would provide us a shared frame of reference.

How do we fix it? We create the connection.

Every ant needs to reach their neighbors – those other ants whose area overlaps with this one. And for that overlapping part they have to create a mapping between their respective charts – to calculate how many red feet are in a green foot, and to pin the shared known locations.

It is important that all the charts are compatible, so that those mappings, transitional functions between every pair of overlapping charts, are well-defined. This is not a given. There are, in fact, topological manifolds which are covered by charts in such a way that there is no good mapping between them.

If our ant community manages to achieve that level of agreement, they are upgraded from the status of a topological manifold to a differentiable or a smooth one.

The happy end

And this is the moment when we reach some shared level of understanding:

if one ant wants to compare notes with a remote ant somewhere far away, they first establish a path via chain of neighborhoods. And then those transition functions allow them to carry notions along the path.

What I like about the manifold object, as opposed to the elephant parable, is that rather than just talking about a problem of misunderstanding it gives you a recipe to resolve it:

Find the overlapping areas, create the mappings, ensure these are compatible, connect your local charts with the path, and walk it step by step applying transitions, when required.

There is more to it, of course. Do we always get the path? Do we get the same result if we choose another path to the same destination? How ants identify valleys and mountains? Can a donut-shaped Earth still be flat?..

This is what Geometry and Topology is about.

And every so often when I see an argument, I want to ask: folks, have you checked your transition functions? :)

Container Registry Infrastructure

Fri, 15 Dec 2023 18:48:56 +0000

Once you get exposed to the Cloud, DevOps and Everything As A Service world, you realize that almost every piece of software you might need has already been written by someone (or provided as a service).

Why is it not all rainbows and unicorns then?

Because you do not need a tool, a software or a service. You need an infrastructure.

Disclaimer: I wrote this article in 2017, and this is a repost with minor edits.

Start with the right problem

The most common issue in dealing with the infrastructure setup is that you forget about the task.

Of course it was hanging somewhere in the beginning, when you gathered the initial list of tools, which might be worth looking at. But then, once you get your hands on the list, you ask yourself:

Which one is better – DockerHub, Quay, Artifactory or Nexus?

And you are doomed.

As soon as you ask that question, you start to look into feature lists and marketing materials for the mentioned products. And then you build your entire choice strategy on figuring out how to get more by paying less.

The truth is: you do not need more. Usually you need just right amount of stuff to do the right task. And you can not replace one feature with five others you have got on a summer sale.

So let's do our homework first and get back to the task.

The task

Again, we start with the question:

We need a Container Registry, how do we get one?

And again, we are wrong.

No one needs a Container Registry for the sake of having a Container Registry alone. It is an implementation detail we have slipped in our problem statement. So let's remove it and find out what we are truly looking for:

the infrastructure to build, store and distribute container images

Now this is something to talk about.

Build

Container images are widely known as a delivery mechanism. And every delivery can work in two ways: it might be you who packages and delivers the software to the customer (or target environment), or it might be some third-party which packages the software and delivers it to you.

In the latter case you might get out just fine without the need to build a single container image on your own. So you can skip this part and focus on storing and distributing. Even better, you may forget about container images. Treat them as any other external binary artifact: download the tarball, install, backup and watch for the updates.

If building container images is your thing, then welcome to the club and keep reading.

Store

Let's say you use container images for packaging the apps to deploy them to your production environment. You have Containerfiles stored in your source code, and CI/CD pipelines are configured to build them in your build service. And your release schedule is close to one release per hour.

With that kind of setup your container infrastructure is, in fact, stateless. If by some reason you lose all the container images currently available, you can rebuild them one by one directly from sources. Then you do not really need the storage function and can focus on building and distributing.

If, on the other hand, you need a long-term storage for container images, for example those which you have released and delivered to your customers, things get slightly more complicated.

In real life you are most likely to need both: there are base images you update once per year, there are images which you can not rebuild at all (there shouldn't be, but there usually are) and there are images which you build daily in hundreds.

And it totally makes sense to manage them differently. Maybe with different tools?

Distribute

In the end it all comes to the distributing of images to the target environments. And target environment could be a developer workstation, CI system, production or customer environment. And these environments are usually distributed all over the globe.

Now container registry should connect to all three by the closest distance possible. Can you find a place for it? I guess not. And here is where caching, mirroring and proxying are brought into play.

With all that in mind, the natural question to ask is:

how many registries do you actually need?

It appears, more than one.

Build, store and distribute, shaken, not stirred

Let set up the context and consider the following situation: we develop an application, package it into container images, and then deploy them to several remote production environments (for example eu and us).

Registries and their users

Users working within the infrastructure can be divided into three groups: contributors (Dev), build and test systems (CI), and the customer or consumer (Prod).

Dev users create and share random containers all day long. They do not have rules and restrictions. Dev containers should never go anywhere but the workstations and temporary custom environments. These containers are generally disposable and do not need a long-term storage.

CI users are automated and reliable, thus they can follow strict rules on naming, tagging and metadata. Similarly to Dev, CI users generate and exchange huge amounts of disposable images. But as CI provides the data to make decisions (be it decision to approve a pull-request or decision to release the product), CI images must be verifiable and authoritative.

Prod users do not create containers, they work in a read-only mode and must have a strict policy on what is provided to them. The prod traffic is much lower in terms of variety of images consumed, while it might be high in terms of the amount of data fetched.

We could have tried to implement these use cases in one container registry: set up the naming rules, configure user and group permissions, enforce strict policy to add new repository, enable per-repository cleanup strategies.. But unless you invest huge amount of time in setting up the workflows, it is going to be complex and error-prone. It is also going to be fragile and hard to maintain. Simply imagine the update of a single service which blocks your entire infrastructure.

The other way to do it is to setup several registries with separate scopes, concerns and SLA's.

Dev users get the Sandbox registry: no rules, free access, easy to setup, easy to clean, easy to redeploy.

CI users get the CI Pool registry: high traffic, readable by everyone, writable by CI only. It should be located close to the CI workers as it is going to be heavily used by CI builds and test runs.

There should be also Prod registry: only CI users can publish to it via promotion pipeline. This is the registry which you need to “export” to remote locations. It is also the registry you probably want to backup.

Depending on your workflows, you might also want to split the CI Pool registry into CI Tests and CI Snapshots. CI Tests would be used for images you build in pre-merge checks and in the test runs, while CI Snaphots are going to be the latest snapshot builds for each of the applications you develop, working or not.

In the following example of the infrastructure layout we've added also the remote Archive registry. Its primary purpose is to be a replica of certain important DockerHub images. It allows you to be flexible in keeping and overriding upstream images:

Tools

Finally, we come to that tool question again. But now we are equipped with the knowledge of how we want to use them.

We have high network throughput requirement for CI registries, thus we probably don't want to use a managed service priced by network usage. For random images in the Sandbox we do not want to pay for the number of repositories we create (in container registries, image name represents a repository). For caching instances we'd like to have an easy to maintain and easy to test open-source solution, which we can automate and scale. For archive we may want an independent remote registry with low network requirements but reliable storage options.

We now can choose and build the hybrid solution which is tailored for our needs, rather then go for the largest and most feature-reach service available and then unavoidably get into building up the pile of workarounds and integration connectors around it.

Conclusion

It looks scary at first: you have just wanted a place to put the image and now you get multiple registries, scopes, network connections and interactions. But the truth is: we have not created the additional complexity, we've exposed the layout, which has been there in your infrastructure from day one. Instead of hiding the complexity under the hood, we are making it explicit and visible. And by that we can design the infrastructure, which is effective and maintainable. Infrastructure which solves the problem at hand, rather than uses those shiny tools everyone is talking about.

#infrastructure

dist-git and exploded SRPMS - demystified

Sun, 23 Jul 2023 20:08:58 +0000

In this article we address another topic which appeared in multiple discussions recently. We take a look at the difference between the SRPM and the so called dist-git repository of a package. And why do we indeed prefer the dist-git.

How RPM packages work?

In simple words RPM packages need three things:

archive of the original sources of the upstream application;
set of patches which needs to be applied to the original sources;
recipe (RPM spec file), which describes how to apply the patches, how to build the code and how to install it on a target system.

When developing an RPM package you treat the upstream sources as a read-only object. You can not change the upstream sources, they should match the exact content upstream has released.

To diverge from upstream, for example to backport a fix or to integrate the software better in the system, you create and maintain patches as separate files next to your upstream sources.

Then to build a package the build system needs to fetch the archive with original sources, unpack it, apply patches as described in the spec, run the build scripts again as described in the spec, arrange the resulting files in a specific way and pack them into archive together with the installation recipe.

This archive is the final “binary RPM” which you can install on your system using rpm or dnf commands.

As we build software for multiple architectures, we can produce several binary RPMs from the same source data by building them on different workers with different architectures (one for x86_64, one for aarch64 and so on).

What is dist-git

dist-git is a git repository with a specific layout, which Fedora, CentOS Stream and RHEL use to develop RPM packages.

The very minimal dist-git repo would look like this:

.
├── my-app.spec                      // spec file
├── sources                          // reference to the sources
└── patch-for-some-feature.patch     // patch to apply to the sources

The important feature of the dist-git is that it doesn't store the unpacked sources of the application. It only stores a reference to the tarball of original upstream sources in a so-called lookaside cache.

This reference is stored in the file which is called ./sources in the root of the git repository. See for example sources of a glibc package in Fedora Rawhide

The lookaside cache of Fedora and CentOS (Stream or not Stream) is public and you can download any of its content.

Now, since dist-git is the main repository where package development is happening, package maintainers often use it to store all sorts of additional things (scripts, readme files, infra configurations..) which can help them to do the work.

There is also a recommended way to write tests in dist-git (see TMT). These integration tests are not part of the RPM package, but they are used in CI workflows and we recommend to put them in the dist-git repository, so that people can contribute to the package and the test development via the same interface.

Example – keepalived dist-git

Let's take a random package build, for example keepalived-2.2.4-6.el9.

dist-git for the package has the following structure:

.
├── bz2028351-fix-dbus-policy-restrictions.patch  // patches
├── bz2102493-fix-variable-substitution.patch
├── bz2134749-fix-memory-leak-https-checks.patch
├── gating.yaml         // * CI configuration
├── .gitignore          // * standard gitignore
├── keepalived.init     // additional sources
├── keepalived.service  // additional sources
├── keepalived.spec     // spec file
├── rpminspect.yaml     // * rpminspect checks configuration  
├── sources             // reference to the lookaside cache 
└── tests               // * dist-git test scenarios, run on every merge request
    ├── keepalived.conf.in
    ├── run_tests.sh
    └── tests.yml

Here I marked with asterisk the files which are not relevant to the RPM package build.

What is SRPM

As explained above, RPM package build requires multiple inputs. While the inputs are stored in dist-git and lookaside cache, you need to fetch them and carry around the build system to the build workers.

Instead of fetching data from the internet during the build process (no build systems should ever do this!), we fetch all of the sources at the beginning, pack them in a tarball (SRPM file) and then use that self-contained tarball to run the builds in the isolated build environment.

The SRPM then serves as a record of what build system got as input to produce the binary files.

Example – keepalived SRPM

SRPM for the package contains the following data:

bz2028351-fix-dbus-policy-restrictions.patch	1.58 KB
bz2102493-fix-variable-substitution.patch	929.00 B
bz2134749-fix-memory-leak-https-checks.patch	1.87 KB
keepalived-2.2.4.tar.gz	1.10 MB
keepalived.service	392.00 B
keepalived.spec	20.47 KB

You can see how the SRPM was produced by the build system together with binary RPMs via the Koji build task https://kojihub.stream.centos.org/koji/buildinfo?buildID=27965

The build task used dist-git commit as the input:

Source:  git+https://gitlab.com/redhat/centos-stream/rpms/keepalived#fc07f81c047dca49df2fc9d20513a7f52005a54d

Note how the SRPM contains full tarball of the original upstream sources (1.10 MB of it). This tarball was fetched from the dist-git lookaside cache during the SRPM build step.

What is exploded SRPM

Fedora and RHEL use dist-git repositories for a very long time. Fedora dist-git has always been public, while RHEL dist-git repositories were internal and not available for people outside of Red Hat.

So the only way for CentOS Project to rebuild RHEL code was to take the SRPM files and use them as the source of the rebuild.

Since CentOS Project needed to rebrand or adjust certain packages, they didn't take RHEL SRPMs as is, rather they unpacked them and put the unpacked sources in git repository. This way they got access to at least some history of the changes, were able to apply their own patches and generally increased the visibility of the content.

Example – keepalived exploded SRPM

“Exploded SRPM” at git.centos.org for this package looks like:

.
├── .gitignore
├── .keepalived.metadata  // same as ./sources in dist-git
├── SOURCES
│   ├── bz2028351-fix-dbus-policy-restrictions.patch  // patches
│   ├── bz2102493-fix-variable-substitution.patch
│   ├── bz2134749-fix-memory-leak-https-checks.patch
│   └── keepalived.service  // additional sources
└── SPECS
    └── keepalived.spec  // spec file

Exploded SRPM git again doesn't store the upstream tarball in the repository and references the lookaside cache via .keepalived.metadata file.

You can see the same files as included in the SRPM, though they are put into a different directory structure. And none of the additional files (tests, scripts, configs) are available.

Take away

dist-git repository is the original source of an RPM package build. Fedora, CentOS Stream and RHEL packages are all built directly from dist-git repositories.

SRPM is an artifact of the build process. It is produced from the commit in dist-git and then stored alongside the binary RPM.

Exploded SRPM is an attempt to recover the original git structure from the SRPM in case there is no access to the dist-git repository. It does contain the same source files and spec as in dist-git, but it is not able to recover additional non-packaged data, like configuration files, tests and so on.

We recommend to use dist-git for any collaboration and development purposes.

P.S. You can also take a look at the Source Git initiative which aims to change the approach to RPM sources to make upstream source code more accessible.

#ELdevelopment

Continuity of Linux distributions

Sat, 22 Jul 2023 16:10:49 +0000

I know people who imagine distribution development as the process of piling up the code in the git repository for 6 months and then building it all in one go at the end of those 6 months, so that it can finally be shipped. This is very far from reality. And it is impossible to explain things like CentOS Stream without addressing this confusion.

Linux distributions are not just developed continuously, they are built continuously.

When you have your generic application, you have your sources. You contribute changes to sources, integrating them into the main branch. And then you decide to build the tip of the branch and you get an artifact – a binary. When later you make changes to the sources, you throw away the previous binary, and build yourself a new one.

Linux distributions are different.

When you change a distribution, you apply a change to a one part of the sources, then you build those sources into a package and add the package to a shared pool of latest packages. And then this shared pool of packages (we call it buildroot) is used to build a next change in the distribution.

Linux distribution is “self-hosted”, it grows by updating its buildroot and using its new state to build its next updates. And updates are applied individually per package.

Why am I focusing on this? Because it has practical consequences.

Packaged Linux distribution is not a single binary, it is a “compound” artifact, where different parts of it (packages) are built at different times using different states of the buildroot.

Imagine we have two packages A and B in the distribution. Package A is more static, it doesn’t get that many updates, so in four weeks it got updated once. While package B is more actively developed and gets updates every week.

This will look somewhat like this:

Time      | Week 1 | Week 2 | Week 3 | Week 4 |
Package A | v1.0.0 | v1.0.1                   |  --->    A-v1.0.1, B-v18
Package B | v15    | v16    | v17    | v18    |

So if you look at the result of the 4-weeks development, you see that package A has been updated from v1.0.0 to v1.0.1 and package B has been updated from v15 to v18. Yet package A was not built on week 4. It was built on week 2 using the state of the buildroot available at that week 2. If that package A has a build dependency on Package B, then it was using that v16 version of the dependency. And it was not updated after the change to v17 in B.

There are several points to take from here.

First point: packaged distributions do not appear out of nowhere.

Linux distributions are either continuously developed from some origin, like Fedora Rawhide is developed for 20 years from its original Fedora 1 state (I'll ask people more knowledgeable than me to explain how that original state was created). Or they are branched (aka forked) from another distribution. Or they are bootstrapped (forked, but with much more work) using another distribution.

Second point: packaged distribution is not fully defined by the static snapshot of its sources.

If on week 4 you use the latest git sources of the distribution, check them out and build packages from them, you will get package A of version v1.0.1 built using a build dependency on the package B of v18. Which may or may not lead to a different result.

And if this doesn’t scare you, please read again.

You can not reproduce the exact state of the Linux distribution from a snapshot of its sources. And not because something is hidden from the sources (dist-git repository state has more information about the package than SRPM of that package does, and we'll talk about it another day). The issue is that distribution is not just its sources. Distribution is its sources and its buildroot with all its complex history.

In my opinion it is a big fail for the entire industry to think about a Linux distribution as a fixed set of RPMs and SRPMs on a DVD(or the iso image). I understand where it comes from, but it is a fail anyway.

When I think about a distribution, I think about the buildroot of a distribution as a sort of git repository: it has history, it has merge requests. When we update a package, we add a new binary package (RPM) to the buildroot. In other words, we make a new commit to the buildroot state.

And it is not just an abstraction, I think we can literally implement the changes to the distribution buildroot as merge requests to a git repository with the list of packages (And if you are interested and want to try and help make it happen – let's talk about it).

Third point: branching of a Linux distribution is not just branching of the sources, it is branching of its binaries.

Again this is something that application developers won't expect. When we create a branch of the distribution we don't just create a branch in every git source of every package included in the distribution. We also create a “branch” of the buildroot. All binary packages built till a certain day in the mainline of a distribution are copied into a branch. They form the buildroot of the branch, which is then can be updated via a standard update procedure.

We do not rebuild a package after branching, unless there is a new change which we want to land in this specific package.

This is a heavy-weight article, and thank you for getting this far.

But let me reiterate the main message:

Packaged Linux distribution is not built from scratch from the snapshot of its sources.

We accumulate changes as they happen in different packages, and we inherit, merge and branch the pool of binary packages the same way we inherit, merge and branch their sources.

We will look into how this applies to RHEL and CentOS conversation in next articles.

As folks pointed out, the process described in this article applies to packaged Linux distributions like Fedora, Debian or RHEL. There are other ways to build a distribution and you can check the article by Colin Walters, where he discusses the alternatives.

#ELdevelopment

What is CI

Fri, 14 Jul 2023 11:57:18 +0000

The Ultimate Source of Truth, Wikipedia, defines continuous integration as the practice of merging all developer working copies to a shared mainline several times a day.

My version goes a bit deeper.

Definition: Continuous Integration (CI) is a practice of reaching the goal by doing small changes one at a time, while keeping the main artifact in a releasable state at all times.

Why do I need a different definition? Let's take a closer look into it.

First of all, by this definition CI is not limited to the area of the programming or software development. Indeed, one can and should consider CI practice applied to any kind of production workflow:

Production starts with an artifact. It can be a software, but it may as well be a book, or a picture, or a building. Artifact has a certain current state, for example, the concept of a book written on a napkin. And then there is a goal – the target state of an artifact we plan to reach (i.e. 450 pages in a hard cover).

Production workflow is the process of modifying the artifact, carrying it through the sequence of intermediate states to the predefined goal.

To add continuous integration to the picture we need a certain notion of quality: the way to differentiate the releasable state of an artifact from the unreleasable one. For a book we might say, for example, that book is in a releasable state if all its chapters are complete. The continuous integration of a book then would be publishing a new chapter as soon as it is ready.

Remark: As you may see the nature of continuous integration is actually quantum: continuous integration is performed by applying small atomic changes – “quants”. And all continuous integration workflows work with discrete chunks of data, rather than continuous streams.

Another important note is that the above definition doesn't imply a specific implementation of the production workflow.

In day to day conversations “doing CI” often refers to setting up a build service which runs certain tasks, build scripts and tests, triggered by certain events. But the CI concept itself does not require deployment automation, test coverage or Slack notifications of failed builds.

These technicalities are indeed useful and come naturally from applying the CI approach in software development, but these are helper methods, which should always be measured by and aligned to the generic idea.

In other words: while running automated tests is a good practice for implementing continuous integration development, continuous integration approach can not be reduced to just running automated tests.

The third thing I want to point out is that in the definition there is nothing said about some third-party, another actor, which you need to integrate with. There is a simple explanation for that: I believe that even when there is one and only acting entity in the workflow (one developer working on the codebase, one author writing the book..) there is still place for integration.

Generally speaking, whenever you are involved in any kind of a long-term process, you should treat “yourself today”, “yourself yesterday” and “yourself tomorrow” as independently acting third-parties, which might work in parallel on different parts of the project and should exchange their work via documentation, code-review and comments as usual collaborators do.

The curse of bug-to-bug compatibility

Tue, 11 Jul 2023 14:14:06 +0000

Disclaimer: I am a Senior Principal Engineer in Red Hat. I was a member of the RHEL 9/CentOS Stream 9 Bootstrap team. Opinions are my own.

Tl;dr

The chase for “bug-to-bug compatibility” hurts community, hurts RHEL customers and hurts the industry as a whole. The real innovation behind the CentOS Stream is the attempt to change it.

What is ABI compatibility?

ABI compatibility is a requirement for certain interfaces to not change for a certain length of time, so that you can safely rely on the availability of a certain function and certain library which behaves in a predictable way.

It is important to know that RHEL ABI compatibility doesn't say “all ABIs and APIs are stable forever”. The real ABI compatibility of RHEL is described in details in the official ABI Compatibility guidelines:

https://access.redhat.com/articles/rhel9-abi-compatibility

I recommend to take a look and check which compatibility level is assigned to your favorite library according to this guide.

What is bug to bug compatibility?

The term bug-to-bug compatibility when applied in isolation means that for this specific issue when we implement a new system we carry the bug over to a new implementation. Your users rely on the broken behavior so much that they require it on the new system, see xkcd/Workflow.

The way the term is applied to RHEL conversation is very generic, and therefore has much less sense. It implies that one Linux distribution has the same bugs as the another.

Linux distribution is not a static thing though. Linux distribution is a pool of package builds, each versioned on its own, and updated on its own, which are then combined into different subsets based on different rules. It is both the power and the weakness of a Linux distribution. The power as the ability to combine and mix and match packages allows you to create solutions to a large range of tasks. The weakness as different combinations of components may lead to different behavior.

Add the branching structure of RHEL to it, with all of its powerful minor-stream complexity, and you'll realize that there is simply no single state of RHEL, which you can be bug-to-bug compatible to.

Thus, at the distribution level the “bug-to-bug compatibility” concept does not exist. It is an overhyped buzzword people use without putting too much thought into it.

Why does it hurt

ABI compatibility guidelines is the open formal standard for the RHEL-compatible ecosystem. It is the .odt kind of specification for Linux distributions. On the other hand, the mythical bug-to-bug compatibility is the .docx. You chase the always moving target which you can not control or predict.

The huge amount of issues coming to RHEL support, and the demand from RHEL customers for longer and more extensive and never-ending support cycles, come from the fact that the ecosystem doesn't follow the standard, and relies on “undefined behavior” of the specific implementation of it.

Think about it: Even a single RHEL minor release, as the most stable and most restricted flow of updates we have at hand, is not bug-to-bug compatible to itself. Yes, we change things and we fix bugs. That's what updates do.

More to that, even a static snapshot of a RHEL minor release is not a good reference for anything. Paying RHEL customers never have a system deployed exactly in the same state how RHEL engineers test it. Every single customer changes the system so that they cherry-pick certain updates, freeze some other and install custom compatible versions of certain things. And that is generally OK. Until it isn't and generates the issue and a support case.

The ABI-compatibility standard, which we enforce, gives us both – the limits which we shouldn't cross, but also the flexibility to adjust within those limits. And any kind of “pinning” to the undefined behavior of a specific shapshot of RHEL at a specific point in time implemented by a third-party or ourselves creates an issue for future updates not knowing about that hidden requirement.

So yes, relying on undefined behavior is bad for business. For all businesses. As well as for community. It is simply bad for all people on Earth who use whatever those businesses and non-businesses create.

What is CentOS Stream really?

CentOS Stream is sort of “RHEL Stable Proposed Updates”. Yes, Red Hat Marketing and Branding folks do not want it to be explained this way, and probably cringe at this very moment. Thus, instead of looking at labels, look at the tech side: It is an ABI-stable and continuous Linux distribution, the mainline of RHEL, from which we branch RHEL minor releases.

But CentOS Stream also represents an open reference implementation of the ABI compatibility standard of the RHEL-compatible ecosystem.

If your RHEL-compatible application or service doesn't work on CentOS Stream, you do it wrong.

And I'll rephrase it: people think that they need a RHEL “clone”, because it gives them the access to vendors who develop and test for RHEL. The point though is that to get access to RHEL ecosystem community needs vendors to start develop and test for CentOS Stream. And this is where Red Hat's interest and community interest overlap.

Now some may ask,

but what if my requirements are not covered by that standard?

Then let's adjust the standard. Bring those requirements in. Do not assume that the requirements, which do not work on CentOS Stream, will somehow magically be fullfilled by RHEL. Because they won't.

And some may say,

ok, it is interesting, but all this talk about standards and requirements looks really complicated and time-consuming.

Then bring your tests. The easy way to write a standard is to turn it into a distribution-agnostic test.

If you are worried that CentOS Stream will break a certain behavior, write a test and let's gate all CentOS Stream updates with it (And while we are at it, we can also gate all Fedora updates and even upstream updates using the same test, see Packit )

And then some may say,

but what if my requirements are so special and custom that they can not be a standard for all.

Then you don't need bug-to-bug compatibility. You need the power of remixing.

Do your own customization via a Special Interest Group SIG, see for example Hyperscale SIG. Make a version which fits your goals, your schedule, your workflow and your quality requirement, but based on the open shared standard of the ecosystem.

#Eldevelopment