My experience in large systems automation

 Most of the literature in the field of automation, CI / CD, and tests, in general, refers to small to medium systems. Not that there can be high complexity in such systems, not to mention critical systems like medicine, military, or banking. However, the challenge in systems that consist of dozens of servers and multi-disciplinary and multi-disciplinary computations is different because it sometimes involves the complexity of smaller systems, certainly if it is a critical system. Besides, there is the complexity of different servers developed by different teams when the solution is supposed to be integrative. Add real-time and data streaming in the size of Pete Bytes. All of this beauty is fixed in vast laboratories, and in tests we use simulators, some could retire us comfortably judging from their price.


How do you build automation for such large systems?

The first question is what kind of automation is this - unit tests? Component testing? End-to-end testing (E2E)?

If you are testing a component (which is an integration of different features and processes), this is a particular type of test, and if you examine several components as part of a complete E2E system, this is a different strategy.

In our case, the programmers took responsibility for the Unit Tests and the Component tests. We, and the automation team that I managed, took responsibility for the E2E tests that will be part of the CI process. Although Classic CI is a one-lab process, in our case since we also wanted independent running capabilities (for example, in the client) we had to think about different types of labs to work on (different configurations).

The goal was to build tests at the sanity level to see if the version could be passed on to the testers. As long as there is no such tool, there is a waste of time for transfers to the testers and returns after failed tests.

Strategy

I've already talked a lot about strategy in other places (eg: Automation Strategy). The difference was that in previous cases I was part of the automation team. Here I ran it myself, and here is my responsibility. Will we know how to translate the knowledge and theory to practice?

Is the "translation":

Reliability

Here credibility is expressed beyond "regular" things of "Flakiness" of various kinds. This is a product if sophisticated configuration capabilities. Changes in configurations may occur from build to build sometimes. What can we do? Fail the tests?

The answer we found is: building an infrastructure that knows how to read the configuration and relates (in the first stage, due to time constraints) only to the question of whether the feature is open or not. If open, the test will run. If not, an orderly message will be issued, not a failure.

We ran the tests in as many laboratories as we could to develop the ability to deal with various problems.

Before running, the automation checked its environment to see that everything was correct, no missing files or resources.

As a manager, despite the pressure, the direction was clear: building a system that would work for years and not just for managers' demos. Each time they came up with a proposal to improve the infrastructure or the tests we considered it together and decided accordingly.

Reports

On this topic, the automation project can easily fail, even if they are written in the best form possible. If someone who is supposed to debug after the run, feels that it takes too long to figure out what the results mean, he or she will simply stop using it anymore.

Here the split into small tests helped. In addition, we have introduced a report server with modern displays and the possibility of cataloging.

We sat with developers and checked to make sure that the report presented the problems in a clear way.

We've built - beyond the specific run reports a table that accumulates daily results, so if there is a problem over time we can see it.


Debuggability

A bit more explanation about the debuggability: There is an entry of data that automation streams through the use of simulators. The entire process of information manipulation can be monitored eventually through the endpoint in the UI, or better, through the APIs. There's no problem streaming data and see if the manipulation worked or not. But what if it doesn't work?



There are two options: go for the positive - run the test - and only if there is a problem check the route. We preferred to follow the route in advance. Maybe slower, but it is also a system constraint (for example, in Kafka it is easier to register for the topics before the process) and also as an idiosyncratic point of view.

We connect to different types of databases, such as Big Data, as mentioned above to Kafka, to our code written in Java, C ++, in various operating systems.

Beyond the monitoring and reporting of the important stations, we collected on-the-way logs. Of course, we added screenshots and videos as relevant.


Ease of running

The automation can be run by Windows scheduler, run easily in a manual action, through Jenkins task. You can decide which tests will run, and you can run in parallel.


Efficiency

We divided the tests similar to the shape of a triangle: at the base of tests, if they fail, there is no point in continuing. If everything works, move on to the next level. The last level, and my vision was a level of regressions.

In the end, the idea was nice, but the question came up - and if it fails, is not it better to run the rest and see the state of the system?

In addition, it was possible to use several instances and run all levels simultaneously. The most basic tests still matter (developers will get an answer), but everything is flexible.

For multitasking to work, you had to change strategy. Instead of each run depending on the previous level, we have made every test independent.


Speed

This is a serious problem because the simulator's processing and display have many stations. Partly we solved it by using APIs, partly by having everything we could throw at the start of the run. and in parallel

Robust Programming

As stated, any proposal for improvement would have been accepted. Such as taking some functions out to avoid duplication.

We brought one of the country's leading automation experts to give us comments about the infrastructure and the code.


The most important thing in strategy

The late Jerry Weinberg, one of the well-known development consultants, said it was all about people. This project was assigned to a part of the automation team. This didn't stop us from working as one group. When the problem arose, all the automation people joined the discussion. Since we had a strong tailwind from senior management, and we understood the importance of the matter, we were highly motivated. People on the team came up with ideas, got help from outside interfaces when we needed and took ownership.


A few more details

With regard to infrastructure, we had an excellent infrastructure at the beginning, but without much running time. From that moment we developed it only according to the real need of the field. We moved from MS Test to NUnit for maximum flexibility.



The code is built as building blocks that are not at the BDD level, but a test engineer who knows how to script can easily use them. It was part of my regressive vision.



The tests were designed by the test engineers, and I asked them to add verifications along the way, and times of actions and results to add to the tests.


Summary

It is clear that there will be surprises in such systems. At first, the runs were filled with red from failure. After intensive work, the runs were more and more green, and real bugs were found. Interfaces started to trust the automation and asked to use it as well.

And even in cases of automation failures, or if it is not clear in the report, something that can't be completely avoided, the point was that we were aware of the importance of the matter and the treatment was quick.

Comments