# What Is Amdahl’s Law?

Amdahl’s law provides a best-case estimate of how much you can improve system performance by optimizing a specific part of the system. Here’s how to use it.

Written by Fred Mora
Published on Jul. 08, 2024
Image: Shutterstock / Built In

Gene Amdahl (1922–2015) was a computer scientist who specialized in hardware architecture. His best-known contribution to computer science is his eponymous law, proposed in 1967.

## Amdahl’s Law Definition

Amdahl’s law formalizes how much parallelism can speed up an application, describing how much you can improve the overall system performance by optimizing a certain part of the system.

More on EngineeringHow Smart Chips Are Enhancing Engineers’ Productivity

## Amdahl’s Law Formula

Take a program that processes a known quantity of input data. Programs like this often have a sequential set of statements that are difficult to parallelize and a part that can be parallelized.

Let `P` be the portion of the code that can be parallelized. If we throw `N` computing units at the problem instead of a single thread, the speedup factor `S` that we can expect is below.

`S = 1 / (1 - P + P/N)`

## Benefits of Amdahl’s Law

A lot of modern computing problems are of the “embarrassingly parallel” variety, which is when a problem requires little to no effort to split into different parallel tasks. They consist of processing disjoint data elements in a way that doesn’t require access to the other elements.

It’s tempting to deploy more hardware to attempt to speed up execution.

Amdahl’s law helps by providing a quick estimate of what trade-offs are available. By devoting more hardware to a certain portion of a computation chain, we can determine what speedup factor we can achieve.

Conversely, we can use the law to quickly estimate which section of a program we can profitably optimize. A code portion where sequential operations dominate will only give mediocre return on investment if we attempt to speed it up via parallelization.

## Limitations of Amdahl’s Law

Running a code fragment in parallel on `N` processors rarely provides a linear `N`-fold speed increase. Scaling problems start appearing as you increase the number of parallel execution threads in a program.

The best case is when the runtime environment lets you map the threads of your programming model to actual physical central processing unit threads and cores, a feature that is not supported by all programming languages.

If you have more threads than available on the processor(s) in your system, you’ll have to deal with context switches. This means that a thread is interrupted, its associated internal state is saved and its cache lines — the internal CPU cache used by the thread — are flushed, e.g., written back to memory.

The opposite operations are then performed for the new thread. Considering that main memory is an order of magnitude slower than the CPU, a context switch introduces delays.

Additionally, memory bandwidth becomes a factor when many CPU cores are actively running in parallel. The data they process has to be loaded from random access memory and the results written back, at the very least. Any context switch means extra memory bandwidth consumed for swapping thread data.

This is why performance improvements from parallelization never achieve a linear gain even on the parallelized sections of the code. Amdahl’s law provides a best-case estimate.

As I explain below, there are limiting factors that prevent real systems from achieving this best-case performance gain.

## Examples Using Amdahl’s Law

Suppose your program reads some data and then performs independent calculation on each data item, which is a common case in image processing. Or that loading the data takes a minute and that processing the data then takes another three minutes on a single CPU thread.

Each element, e.g., an image frame, is independent and can be processed separately. How much hardware would it take to speed up the execution by a factor of two?

The image processing part is the `P` in the formula. It represents three minutes out of four. So `P` is 3/4, or 0.75. The desired speedup factor is `S = 2`. If we use the above formula and solve for `N`, we get the formula below.

`N = P / (P - 1 - 1/S) = 3`

Here, we throw three processors at the problem and reduce the data processing part from three minutes to one. The data loading, however, still takes one minute. We tripled our hardware costs but only doubled our speed.

The law formalizes an intuitively correct observation: If you take the part of your code that can be optimized and speed it up infinitely, you will still have to deal with the execution time of the sequential part, which will become the bottleneck.

This interactive graph lets you visualize how the speedup factor `S` changes as `P` goes from zero to one. The slider `N` lets you explore the effect of `N` computing units. If `N` becomes very high (or tends to infinity, in math jargon), `S` will asymptotically tend to `1/(1 - P)`, which is four in our example.

The loading part, which is one-fourth of the total time, will not go away even if the rest of the code executes infinitely fast, that is, in zero seconds.

## How Amdahl’s Law Impacts System Architecture

The memory bandwidth bottlenecks are only one performance consideration. Latency introduced by mass storage and network is another dominant factor.

For example, even with the best possible index, accessing a database table requires an irreducible number of I/O operations, each with its storage or network latency.

To overcome this limitation, system architects can either increase the I/O capacity of a system by adding more hardware channels between the memory and external I/O devices, or scale the system horizontally by adding distinct physical hosts.

Splitting a system between distinct hosts provides no benefits if the hosts have to be tightly coupled. Communication latency, even with fast interconnect, dwarfs memory access latency.

Such a horizontal scaling can only make sense if hosts can process data records or pieces independently. You can achieve this by “sharding,” a commonly used mechanism to split data into fragments that can be processed independently.

For example, an airplane passenger database could be split by the first digit of their ticket number, providing ten shards and a theoretical tenfold performance increase.

Persistent storage to hard drives, either spinning disks or solid-state drives, introduces large latencies in a system. Parallelization can help to an extent.

For example, RAID 1 drive arrays are storing the same data in multiple drives, called mirrors, to allow faster reads. When accessing data, the system can spread access requests among multiple drives, speeding up the data access part of a program.

More on Software EngineeringWhy AI Will Never Replace Software Developers

## How to Optimize Your Systems With Amdahl’s Law

Devoting even infinite resources to a portion of a processing chain will do nothing to help speed up the rest. The first step to optimize a complex system is to observe it carefully and ensure we understand where it spends its time.

The answer is often surprising.

Adding metrics in the code and making latency observable is the first step. Exploiting these metrics can require a non-trivial setup.

The most basic method consists of periodically dumping metric data to storage. In an enterprise environment, coders often have access to monitoring and telemetry systems that they should use to understand the behavior of their own code.

Barring this, developers will not know how to optimize. Or worse, they’ll spend efforts blindly improving sections of code that, when observed, turn out not to be a determining performance factor, which is often decried as “premature optimization.”

Once equipped with metrics, developers should try to picture their code paths as a juxtaposition of time intervals.

For example, say a request comes in from the network. Processing this request creates a number of database accesses, followed by some calculation loops.

Each of these operations could have its own metric. It is helpful to identify the longest code paths and decompose them in separate computing and latency time intervals, then zero in on the best candidates for optimization.