The Simulated Annealing Algorithm 🌡️

An optimization algorithm rooted in Statistical Physics



The simulated annealing algorithm is an optimization technique widely used for finding the global minimum of a specific function. The method was initially proposed in a paper by Kirkpatrick et al. 1 and Cerny 2 in analogy with the annealing procedure carried out in metallurgy, where the metal is heated up to a specific temperature and then cooled down slowly to remove internal stresses in the metal, as well as to improve its appearance and overall quality.

The core idea is that finding the low-temperature state of a physical system where we can compute its energy is, in fact, an optimization problem.

An example drawed from Statistical Mechanics

Let’s consider a so-called 1D spin-lattice as a starting example, which is a fancy name for describing a system composed of a chain of “particles” that can be either in a spin-up or spin-down state as represented below:

The energy of the system can be written as3: $$ E = -\sum_{\langle ij \rangle} J_{ij} s_i s_j $$ where $s_i$ can be either 0 or 1 and $J_{ij}$ represent the coupling constant between the particle $i$ and $j$ and the sum is intended on the nearest neighbors. Since our particles can be either in the spin-up or spin-down state, a system of N particles can have $2^N$ possible configurations. The probability of finding our system in a particular configuration, with energy E, is given by the Boltzmann-Gibbs statistics: $$P(C) = \frac{e^{-\frac{E(C)}{k_bT}}}{\sum_C e^{-\frac{E(C)}{k_bT}}}$$ where T is the system’s temperature, and $k_B$ is the Boltzmann constant.

Physicists are usually interested in finding the average energy at a given temperature, which is provided by the following equation: $$ \langle E \rangle = \dfrac{\sum_{C} P(C) \cdot E(C)}{\sum_C P(C)} $$ where the sum is to be intended over all the possible configurations, which become big as quick as N rises. Back in the day, one common approach was to sample a set of random configurations4 and then take the average: $$ \langle E \rangle = \sum_i E_i p_i $$ However, by sampling each configuration at random, most of them will have very low weights, making the whole approach inefficient. And here is where the Metropolis algorithm makes its entrance into the scene.

The Metropolis-Hasting Algorithm

What if, instead of uniformly sampling the configurations, we were able to sample the configurations according to their actual weight?

The paper by Metropolis and colleagues 5 called “Equation of State Calculations by Fast Computing Machines” describes a general method of drawing samples from an arbitrary probability distribution. This is one of the pillars of computational physics, and the papers have, according to google scholar, more than 50k citations.

The implementation of the algorithm is very straightforward. Let’s start with a random configuration with a specific energy E, and then we can make a “move” in our 1D spin-lattice example. The move can be represented by selecting a particle at random and flipping it. At this point, we have to recompute the configuration energy. If the energy is smaller than before, we can accept the move and start again. However, if the new energy is bigger, we choose to retain the new configuration in a probabilist way depending on the energy difference between the new and the old energy.

The algorithm can be summarize as follows:

Algorithm : Metropolis-Hasting

  1. Let's start with an initial configuration C.
  2. Evaluate its energy, E
  3. Pick a new configuration starting from the previous one
  4. Evaluate the new energy, $E_{\rm{new}}$, if $E_{\rm{new}}<E$ accept the new configuration and go to 3
  5. Sample a random number, $u$ between 0 and 1
  6. If $e^{-\Delta E / k_BT} > u$, accept the new configuration otherwise keep the old one and go to 3.

However, from these step is not clear why this procedure ensure to samples configurations distributed according to a Boltzmann-Gibbs statistics. Mathematically, this procedure allow us to perform a random walk according to a Markow chain whose stationary distribution is our target one. A formal justification of the procedure is very well described in a blog post by Gregory Gundersen 6 which is worth a read.

By applying this algorithm to our 1D lattice we observe that after same iteration the energy of the system oscillate around the average value. It’s worth to notice that the energy of the system can be lowered by decreasing the temperature.

Temperature (0-10)

Simulated Annealing

Let’s suppose we want to minimize a function $f(x)$; how can we apply the concepts from above?

As mentioned previously, annealing is a metallurgical process involving heating and gradual cooling of a crystal to obtain a clear crystalline structure with minimal defects, achieved through a slow cooling process.

In the context of our task, we can view our function as the system’s energy and aim to locate its low-temperature state by introducing an effective temperature. This concept enables us to simulate the annealing process.

To summarize, the simulated annealing algorithm can be outlined in the following steps:

Algorithm : Simulated Annealing

  1. Let's start with an initial configuration, $C$ and an initial temperature $T_0$
  2. Let's choose another configuration by the previous one
  3. Accept or Not based on the Metropolis criteria
  4. Update the temperature according to a schedule of choice (e.g. exponential decay $T_{i+1} \rightarrow \alpha \cdot T_i$ )
  5. Repeat until a given number of iterations.

Let’s apply this to the case of an optimization benchmark function $f(x)$7 : $$ f(x) = -20\cdot e^{\frac{x}{\sqrt{2}}}-e^{\frac{1}{2}\cos(2\pi x)+1} + 15 $$

in this case we can start by choosing a random $x_0$ and then we can pick the new “configuration” by moving from $x_0$ by a certain amount as follow: $$ x_{i+1} = x_{i} + \alpha \cdot \mathbb{U}(0,1) $$ where $\mathbb{U}(0,1)$ represents an uniform random distribution between 0 and 1.

It is important to remark that using the simulated annealing requires tuning both the cooling schedule of our effective temperature and the “transition rule” between our configurations.

In the example below, we can see the iteration of the simulation annealing. At “high temperature,” the algorithm favors the exploration of the “configuration” space since the temperature suppresses relatively high $\Delta E$ resulting in a random-walk-like behavior. However, in the low-temperature regime, the algorithm favors only transition between favorable “configurations”.

Some concluding notes

Although simulated annealing can be used as a universal optimization procedure, in order to succeed in our task, we have to appropriately tune both the cooling scheme and the move between the configurations. In the case of combinatorial optimization, however, this can be challenging to do, and thus, running the algorithm with different cooling schedules and with different starting temperatures.


  1. S. Kirkpatrick et al. ,Optimization by Simulated Annealing.Science220,671-680(1983).DOI:10.1126/science.220.4598.671 ↩︎

  2. Černý, Vladimír. “Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm.” Journal of optimization theory and applications 45 (1985): 41-51. ↩︎

  3. In Statistical Mechanics this is the Ising Hamiltonian in absence of any external magnetic fields. ↩︎

  4. In more Statistical Mechanics terms, the idea is to sample the system phase space assuming that all the configurations are equiprobable and then weighting them for the Boltzmann-Gibbs statistics. ↩︎

  5. Metropolis, Nicholas, et al. “Equation of state calculations by fast computing machines.” The journal of chemical physics 21.6 (1953): 1087-1092. ↩︎

  6. ↩︎

  7. For our example we will consider periodic boundary conditions. ↩︎