Machine Learning on PCs – Use mixed precision and look out for super-convergence to save energy

People doing Machine Learning [ML] experiments on their own Linux PCs or laptops know that the numerical training runs put a heavy load on the graphics cards and consume a lot of energy as a direct consequence. Especially in a hot summer like we have it in Germany right now, cooling of your systems may become a problem. And as energy has a high price tag here, any method to reduce the load and/or power consumption is welcome.

But I think that caring about energy consumption is a topic which we as a Linux and ML enthusiasts should keep in mind in general. Some big tech companies will probably not do it – as long as their money machinery works and as some heads follow fantasies about building small nuclear power plants for their big AI data centers. But we Opensource people would like to see more AI- and ML-services independent of the monopolists and their infrastructure, anyway. Not only for reasons of data and privacy protection.

As soon as we, however, proclaim and work for a development that favors local and resource optimized installations of AI and ML tools both for private people and companies, we have to care about side effects: We have to bring the energy consumption down for these many local installations substantially in parallel. Otherwise, centralized solutions may have a better energy efficiency than decentralized solutions.

For me as a retired person in Germany the general financial pressure is high enough to enforce a careful use of my private resources. With this post I want to draw your attention to two points which may help you, too, to save energy during your ML-experiments. (In addition to or aside of standard measures like saving certain model states during training runs to get better starting points for new runs.)

Mixed precision with Keras (and other frameworks)

As soon as you start privately with optimized versions of transformer based LLMs you touch the limits of what can be done on private PCs. Even when you use pre-trained models and adapt your models to your special data and purposes. But also for other more trivial ML algorithms based on Deep Neural Networks as ResNets, DenseNets or U-networks for Stable Diffusion models your graphics card(s) may feel the load. My 4060 TI sometimes shows a power consumption of up to 150 Watts over training runs which may last for some hours.

If you use Tensorflow2, latest CUDA and Keras3 on a Nvidia RTX GPU, something that helps here is a setting which enforces a so called “mixed precision” calculation. The point in short is that some layer variables of your models can be handled with 16Bit precision, while more important variables are treated with 32-bit precision. But you can even set a global 16-Bit precision policy for a model. Have a look at [1] for more information. Relatively detailed introductions can be found at [3] and [4].

And to do some tests is really easy. Basic things you have to do is to load a certain module:

from tensorflow.keras import mixed_precision

Then call e.g.

tf.keras.mixed_precision.set_global_policy("mixed_float16")

at the beginning of your program. But be careful! Intermediate calls to tf.keras.backend.clear_session() may reset the defaulkt of 32-Bit precision. To be on the save side you should also call

tf.keras.mixed_precision.set_global_policy("mixed_float16")

directly ahead of building your Keras model.

You can check the present policy by calling the following function

tf.keras.mixed_precision.dtype_policy()

It should give you something like

<FloatDTypePolicy "mixed_float16">

In theory you expect both a performance boost (regarding GPU-time per iteration) during training (or inference) and/or a drop in energy consumption per iteration during the training of ML-models. I have experienced both things to different degrees. It depends predominantly on the number of parameters (i.e. the depth of your Deep Learning network) and on the efficiency with which your algorithm uses the GPU already. Another point that may impact your results are intermediate data transfer between GPU VRAM and CPU RAM.

Other ML frameworks: I do not use PyTorch and fast.ai libraries much presently (shame on me!), but also these ML-frameworks offer functionality to make use of mixed precision calculations.

Private positive results of the application of “mixed precision”

How big were the effects in concrete ML experiments?

Reduced GPU time for Stable Diffusion by a factor of 2:
An interesting thing I have seen on a 4060 TI is a drastic performance increase for Stable Diffusion with Keras CV modules after activating “mixed precision”. The image generation time accelerated by a factor of 2.1! GPU running at a load of 98% with and without “mixed precision”. See [6].

Reduced power consumption for CAEs and ResNets by a factor of 1.65 :
I found another positive result for ResNetV2 networks with up to 56 convolutional layers and CAEs with about 10 layers but relatively many and large maps. In these cases I only got a minor performance factor of 1.1 up to 1.2. The reason for this was probably the use of an old fashioned ImageDataGenerator to transfer image data to the GPU (for Keras3 special layers should instead be used to feed the networks with e.g. image tensors). However, the energy consumption on the GPU dropped significantly from 106 Watt down to 64 Watt at a load of around 70%. See [7].

In both cases I saved energy – either due to reduced GPU-time per iteration or by a reduction of the GPU’s power consumption.

Be careful and test the impact on accuracy – in particular during the training of sensitive models

Though Keras tries to compensate for underflows and other unwelcome effects of 16-Bit calculations during gradient descent – you should always be careful regarding the impact of “mixed precision” on the eventual results for your model. Working with reduced precision during weight optimization may have an impact on convergence and accuracy – in particular with momentum driven optimizers. This is important if you have to fight for the last percentages of accuracy.

So, please check your models via at least some comparative runs with 32-Bit precision during and at least at he end of a development phase for a ML algorithm. This is e.g. important when you have optimized your model to achieve a super-convergence within only around 20 epochs. In such cases deviations during initial steps with relatively large learning rates or at convergence to the loss minimum during the last epochs may become important for the overall quality of the training results. I have experienced this in particular when using Adam or AdamW (with decoupled weight decay) as optimizers. 32-Bit runs for very short training runs gave better results in the sense of a consistent reproduction of high accuracy convergence independent of statistical initial conditions like weight initialization or evaluation data sets. With 16-Bit one could miss high accuracy values more often.

In general tests for a negative impact of mixed precision is an important step of quality assurance when you work in a professional ML-development environment. In particular, when high accuracy matters for the productive use of your ML-algorithms in industry or medicine. This should, however, not prevent the use of “mixed precision” during the development phase during which you do a lot of tests, normally. There you save a lot of energy, already. But, at certain test intervals, you should do some comparative training runs with 32-Bit. And in many cases you may find that you still get tolerable accuracy values.

So far I have spoken about training. But mixed precision can also be used at inference time of an already trained ML model. You should also perform some test for this case.

Super-Convergence as a second option to save training time

Something one could read about a lot in the past two years is the so called “Super-Convergence” of Deep Neural Networks (see [8] up to [13]). The basic idea is that relatively large learning rates [LR] at the beginning of a training, weight decay, optimal batch size and a suitable LR- and momentum schedule (balancing large LRs with a proper decay curve to much smaller values) may give convergence of Deep Network already within a few epochs – around or below 20 epochs. Without loosing much of the accuracy achieved in much longer runs.

Super-convergence has been demonstrated by numerical experiments for some types of Residual Networks, classification problems and certain test data sets. A theoretical reason has been given in [13]. However, from the literature it is unclear (at least to me) whether this is typical only for networks build upon convolutional layers. Even for such convolutional networks it may be tricky to find the right parameters (see e.g. [12]). Sometimes a breakthrough may also come with some minor modifications of a network (see e.g. [12]). But a reduction of the epochs from e.g. 100 epochs down to only 20 epochs is really worth a try.

I take the idea of super-convergence as an appeal to always try out whether we can achieve a faster convergence of our neural networks by changing the LR-schedule and shifting to pure decoupled weight decay instead of L2-regularization. As one in the development phase of a ML-solution relatively often changes elements of the network or the training data a repetition of training runs from scratch happens relatively often until both the developers and the customers are satisfied with the results. In addition we may need to retrain the productive network from time to time to include new data into its optimization. In such cases the reduction of epochs (and elementary batch-related iterations) will in the end save energy – despite some extra runs to test for super-convergence.

So, in my opinion, trial runs to find out whether a ML-model in combination with published LR-schedules and weight decay parameters is fit for super-convergence should be a standard step of any development phase for a ML-solution – not only for classification tasks.

Conclusion

Energy consumption of solution based on Machine Learning is a problem that will become bigger – also when we favor decentralized, local implementations of AI- and ML-algorithms in private IT-environments (based on Linux). A simple method to save energy is to enable “mixed precision” computations during training of ML-models. Another method, which requires a bit more testing and experience is to look out for learning rate schedules and optimizers with decoupled weight decay that may reduce the number of training iterations substantially and in some cases even provide “super-convergence”. Respective steps should become a fixed part of ML-development and testing phases.

Links and literature