Information: from art to entropy
A brief introduction to Shannon’s information theory
The First World War seemed to plunge the world into a collapse of civilisation. The ideals of the Enlightenment, international agreements, and humanist values were destroyed overnight, leaving society feeling disenchanted and robbed of its dreams. In this dystopian scenario, new ideas emerged, proposing radical changes to our understanding of the world and challenging long-held certainties in the arts, philosophy and science. One of these was modern art, which rejected academic conventions and explored new forms of expression.
In 1915, Russian artist Kazimir Malevich presented a piece that would profoundly influence the development of a novel aesthetic in 20(th)-century art. The painting in question — a simple black square on a white background — broke radically with everything that had come before. Despite its minimalist appearance, the work caused great strangeness and controversy, challenging both the public and critics and breaking with centuries of artistic tradition based on the representation of the visible world.
Entitled “Black Square”, Malevich considered the painting to be the “zero point” of art, representing its reduction to the purest and most essential form — devoid of object, narrative, perspective or representation. Through this work, Malevich proposed a definitive break with traditional figurative art, inaugurating a new visual language which he termed Suprematism. For Malevich, Suprematism represented the “supremacy of pure sensibility”: an attempt to express emotions and inner states not through recognisable images, but through basic geometric shapes and colours. In essence, it was a way of taking art to its conceptual limit by reducing painting to a minimum of visual information.
Suprematism (Russian: супрематизм) was an artistic movement that originated in Russia. It is characterised by its use of basic geometric shapes, particularly squares and circles, and is widely regarded as the first organised movement of abstract painting in modern art.
Malevich’s work contained a fundamental ingredient: the absence of information. When we compare “Black Square” and subsequent minimalist works with Renaissance pieces by artists such as Raphael Sanzio and Leonardo da Vinci, a key difference emerges. While the Renaissance celebrated the richness of detail, technical mastery of perspective and faithful depiction of the visible world, minimalism and its precursors concentrated on formal reduction, eliminating ornamentation and searching for pure aesthetic expression, which was often abstract and impersonal. While Renaissance art was celebrated for its technical complexity and mimicking of nature, minimalism deliberately rejects illusion, narrative and symbolism. Instead, it proposes art that represents nothing but itself: form, colour and space in their most basic state. This radical shift sees the emptying of content become a powerful aesthetic statement.
To gain a better understanding of the differences between these art schools, let’s consider an artwork that came after Malevich’s: Yves Klein’s “Monochrome Painting” (1962). It is simply a blue painting. In other words, it is easy to imagine. We just need to close our eyes to picture it. We don’t need to see the painting to know what it looks like, but we show it as follows.
Now, let’s take a look at Raphael Sanzio’s 1511 painting, “The School of Athens”: an arched fresco in the Greco-Roman style with Corinthian columns, Roman arches, and a central dome that creates a sense of depth and grandeur, making it seem as if we are inside a classical temple or Roman palace. In the centre of the composition, under the main arch, two men walk side by side. On the left is a man with a white beard and long hair: Plato. He wears a red and purple tunic and raises the index finger of his right hand towards the sky, symbolising the world of ideas. In his left hand, he holds his book, the Timaeus…
There are many other details we could describe about this scene: dozens of characters, gestures, expressions and symbolic relationships. Clearly, imagining this work from its description is much more difficult than imagining Yves Klein’s blue painting. In other words, while we can generate a vast number of different images based on the description of “The School of Athens”, Yves Klein’s work generates only one. We can verify this idea in practice using artificial intelligence by generating images through text commands (prompts) in systems such as ChatGPT or DALL·E. Below are two examples using ChatGPT. Let’s consider a very simple prompt that is sufficient to produce different interpretations. In the case of the blue painting, the generated images are always the same, so there is no need to reproduce them here.
Therefore, while a description of a blue painting generates only one image in our minds (perhaps a few if we consider different shades of blue), a more detailed description, such as that of a Raphael painting, can generate an unlimited number of possibilities. This is because Raphael’s work contains far more visual, symbolic and narrative information than Klein’s. While minimalism reduces things to their essentials, Renaissance art expands on details, layers and meanings. Thus, when considering these examples, we gain a clearer understanding of what is meant by “information”.
‘The future is uncertain… but this uncertainty is at the heart of human creativity.’ — Ilya Prigogine.
Information and uncertainty
Put simply, information is related to the variety, complexity and unpredictability of the elements present in a system or environment. The more possibilities and interpretations an image offers, the more information it contains. Therefore, we can say that the amount of information is related to the level of uncertainty — the more difficult it is to predict an outcome (or to interpret an image), the more information that outcome contains.
For example, suppose you flip a coin. If it is perfectly balanced, there are two possible outcomes: heads or tails. As the uncertainty is small, the amount of information resulting from the coin toss is also small. Now imagine you roll a six-sided die — the uncertainty increases because there are six possible outcomes, so each roll provides more information. It is harder to predict the outcome of rolling a die than flipping a coin, as there are more possible outcomes and therefore greater uncertainty. If we try to guess a random word from a 100,000-entry dictionary, the uncertainty (and therefore the amount of information) will be even greater. Therefore, the greater the number of possibilities, the greater the amount of information involved.
“Information is the resolution of uncertainty.” — Claude Shannon.
Similarly, when looking at a simple image such as Yves Klein’s blue painting, there are few possible outcomes for our interpretation. In contrast, in “The School of Athens”, with its dozens of characters, symbols, and relationships, there is a huge range of possibilities to consider, and therefore a high informational content. We could say that the blue painting is like flipping a coin because there is only one possible outcome. The Renaissance painting, on the other hand, is like guessing the outcome of flipping 100 coins. In other words, it is much more difficult.
When we compare an image generated by our imagination with the original scene, the greater the amount of information contained in the real scene, the greater the surprise — and, consequently, the greater the difference between what we imagine and what is actually present. Thus, we can associate the concept of information with the degree of surprise caused by the outcome of an experiment.
Therefore, according to this line of reasoning, so-called “mind readers” — often advertised with great enthusiasm as devices capable of decoding thoughts — would not really be able to reconstruct a mental image accurately. The most they could achieve would be an approximation, one among countless plausible possibilities. It would be as we saw above: the image generated would be just one among many possible ones.
This limitation is directly related to Immanuel Kant’s philosophy. In his work, “Critique of Pure Reason”, he argues that we do not have direct access to the world itself (the noumenon) — that is, reality as it exists independently of our perception — but only to the representations that our mind constructs from sensory experience. In other words, we do not perceive the world objectively; rather, it is presented to us through cognitive structures that shape our perception. A useful analogy would be to imagine that we are constantly wearing blue-tinted glasses, so everything we see has a bluish hue, regardless of the actual colours of the objects. Similarly, we never perceive things as they really are, but only as they appear to us. Therefore, all knowledge is conditioned — filtered through the categories of the human mind.
Similarly, we do not have direct access to the images generated by our thoughts, or our internal world of ideas. Everything we observe or decode, whether from the external world or from within ourselves, is always a representation mediated by cognitive structures or interpretive languages. This is largely due to the vast quantity of information present in the external world and in mental processes. Therefore, accurately reading thoughts will always be challenging — it may even be impossible to know exactly what is going on in another person’s mind. Unless, of course, they are imagining a blue picture.
“Imagination often takes us to worlds that never existed, but without it, we go nowhere.” — Carl Sagan.
The idea of associating information with the level of surprise led to the development of a theory that quantifies the information contained in a signal: Information theory.
Information theory
In the mid-20th century, Claude Shannon — an engineer and mathematician working at Bell Labs at the time — published an article entitled “A Mathematical Theory of Communication” (1948). In this work, he proposed an abstract model in which a transmitter encodes information into a signal that travels through a noisy channel to be decoded later by the receiver. In other words, imagine a source of information, such as a message or text, that needs to be sent from one point to another. This message is converted into a signal, such as electrical impulses, sound waves or digital signals, which travels through a transmission medium, such as a fibre optic cable or the air. During this journey, the signal can be corrupted by noise and interference, making it difficult for the receiver to interpret the original message. Shannon’s work aimed to formalise this process by defining how to measure the amount of information and establishing limits for reliable transmission in noisy channels.
The central concept of Shannon’s theory is uncertainty, or information entropy. This probability-based metric measures how much surprise (information) is contained in a message, regardless of its meaning. This enables us to quantify and optimise data transmission. The more unlikely or unexpected a result is, the more information it conveys. Note that this idea is analogous to the one we discussed earlier.
To measure the amount of information in a signal, Shannon defined a quantity called entropy, from the Greek word for “internal exchange” or “transformation”. Shannon entropy is defined as the sum of the surprise (or amount of information) associated with each possible outcome, weighted by the probability of its occurrence — meaning that the most probable outcomes have the greatest influence on the final value. The value obtained from this sum represents the average uncertainty before observing the outcome of a random experiment.
Mathematical formalism
In mathematical terms, Shannon entropy is described by the following equation.
where X is a random variable and P(X=x) is the probability of observing the value x. A random variable is nothing more than a function that associates the possible values of an experiment with real numbers. For example, a variable can represent the values that appear on the upper face of a die or even a person’s weight. In the case of the die, P(X = 4) = 1/6, since the die has six faces. In the equation, if we use base 2 in the logarithm function, the information will be measured in bits.
For example, when tossing a fair coin, P(X = x) = 1/2, where X = 1 if the coin lands on heads or X = 0 if it lands on tails. The Shannon entropy associated with this toss is therefore:
Thus, when we toss a fair coin, we obtain one piece of information. This means that a single binary question is sufficient to find out which side came up. For example, “Did it come up heads?” The answer —”yes” or “no” — contains one bit of information, which is enough to eliminate the uncertainty between two equally probable possibilities.
More generally, if we have a die with N sides, the entropy will be given by the formula:
Therefore, the greater the number of sides on a die, the greater its entropy, making it more difficult to predict the value on the uppermost side after a roll. For a six-sided die (P(X=k)=1/6, k=1,..,6), H = 2.58 bits. For an 8-sided die (P(X=k)=1/8, k=1,..,8), H = 3 bits. Therefore, a die with more sides has greater uncertainty and unpredictability in relation to the result of a roll and carries more information on average.
In the case of a coin, if we assume that the probability of it landing on heads is p, the following graph shows that p = 0.5 offers the highest level of information. This is because if p = 1, the coin has two sides and would always land the same way up (heads), so the result would be predictable — in other words, the information content would be zero. We would not need to ask a question to know the result of the toss. The same applies if p = 0 (a coin with two tails). The more balanced the probabilities between the sides, the greater the uncertainty, and consequently, the greater the amount of information generated by the result of the toss.
Let’s gain a more intuitive understanding of what lies behind Shannon’s formula. Since a greater amount of information is generated the more unpredictable the output X of a random experiment is, we define the amount of information associated with an event as follows:
In this equation, note that the closer the probability P(X=x) is to zero, the greater the value inside the logarithm function — which means that rare events carry more information. On the other hand, when P(X=x)=1, the logarithm function returns zero (log(1) = 0), indicating that no new information is transmitted, as there is no surprise. Shannon entropy is nothing more than the expected value of I(x). Thus,
Shannon entropy can be generalised to measure the relationship between variables. In this case, we can define conditional entropy and the concept of mutual information:
Mutual information is a measure of the amount of information shared between two variables, X and Y. It is calculated as the sum of the individual entropies of each variable, minus the redundancy present when both are considered together. For example, suppose X represents a meteorologist’s rain forecast and Y represents whether or not it actually rains on a given day. If the forecast is very accurate, knowledge of X significantly reduces uncertainty about Y, indicating high mutual information between the two variables.
Therefore, mutual information is a measure of dependence between variables: the higher the value of I(X; Y), the greater the reduction in uncertainty about one variable when knowledge of the other is considered. When there is no relationship between the variables, the mutual information value is zero.
Note that I(X; Y) will only be equal to zero when X and Y are independent — that is, when knowledge of the value of X provides no additional information about Y, and vice versa. For example, suppose that:
- X is the result of rolling a fair die (values from 1 to 6)
- Y is the result of tossing a fair coin (heads or tails).
These two experiments are completely independent of each other. Knowing the result of the die roll does not affect the likelihood of it landing on heads or tails. Therefore, I(X; Y) = 0.
This definition of mutual information is fundamental to several areas, including causality theory. This is because it allows us to quantify how much knowledge of one variable reduces uncertainty about another. In causal models, particularly those based on Bayesian networks and non-parametric causal inference, mutual information can be employed to identify dependencies between variables, thereby helping to distinguish correlation from causation. While high mutual information does not imply a causal relationship, it can suggest a path of dependence that merits further investigation using suitable methods, such as Judea Pearl’s formalism or conditional independence tests.
Information and Physics
Although the concept of entropy was already used in data analysis and probability theory, it had originally been introduced to physics by Ludwig Boltzmann in 1872 within thermodynamic theory and was known as Boltzmann entropy. In this context, entropy measures the degree of disorder or randomness in a physical system. More specifically, it quantifies the number of possible microstates — that is, the different ways in which the particles of a system can organise themselves — that correspond to the same observable macrostate, such as temperature and pressure.
The greater the number of microstates compatible with a macrostate, the greater the system’s entropy. In other words, highly organised systems have low entropy, while more disordered systems have high entropy. Boltzmann’s famous equation expresses this directly:
where S is entropy, k is Boltzmann’s constant, and W is the number of microstates compatible with the macrostate.
Put simply, the greater a system’s entropy, the more disorganised it is. This is why, when we open a bottle of perfume in a room, the scent spreads — the perfume molecules, which were previously concentrated in a small space, begin to move around freely and fill all the available space. This increases the disorder of the system, i.e. its entropy. The molecules become more evenly distributed, corresponding to a much larger number of possible microstates — that is, a much larger number of ways in which the molecules can organise themselves. From a thermodynamic point of view, this spreading is natural because it tends towards the most probable states, which are the most disordered. Therefore, entropy tends to increase over time in isolated systems, as stated in the Second Law of Thermodynamics.
‘All physical things originate in information theory, and this is a participatory universe…. The participation of the observer gives rise to information, and information gives rise to physics’” — John Archibald Wheeler.
Although they appear distinct, Shannon’s formula can be understood as a generalisation of Boltzmann’s formula. When the states of a physical system are equally probable — that is to say, when each microstate has the same probability of occurring — the two formulas coincide exactly in their traditional forms. However, the concepts involved are different. Boltzmann’s entropy is a measure of energy dissipation in statistical mechanics that relates entropy to the number of microstates in a system. In contrast, Shannon entropy measures the uncertainty or ‘information content’ of a message or system.
Information content
Thus, we can conclude that Shannon’s theory provides a powerful, quantitative approach to analysing uncertainty in data. However, this definition does not consider the meaning of information; it only considers its statistical structure. This raises the question: is this the only definition of information, or can we use other theories?
Throughout the 20th century, various other conceptions of information emerged, each adapted to specific contexts. To address semantic content, for instance, approaches such as semantic information theory have been proposed. This theory seeks to quantify the informational value of a message in relation to whether it is true or false. Unlike Shannon’s information theory, this theory incorporates meaning, not just the structure and transmission of signals.
Another important concept is that of Kolmogorov complexity, also known as algorithmic information. This measures the informational content of an object, such as a sequence of bits, based on the size of the smallest program capable of generating it. Unlike Shannon, who works with probability distributions, Kolmogorov considers individual cases and explores the concepts of compressibility and randomness. Consider, for example, two binary sequences of the same length:
- 1010101010101010
- 1100110010101110
The first pattern is simple: it alternates between 1 and 0. A very short programme could describe it: ‘Repeat 1 and 0 alternately eight times.’ Therefore, this generates the complete sequence using a short program. This sequence has low Kolmogorov complexity. Conversely, the second sequence appears random and without any obvious pattern. The smallest program that could generate it would probably have to explicitly include the entire sequence. ‘Print: 1100110010101110". This means that it cannot be compressed or summarised — the minimum description is practically the same size as the sequence itself. Therefore, this sequence has a higher Kolmogorov complexity than the first one. This measure helps us to understand the structure and implicit ‘order’ in the data, going beyond the statistical averages of Shannon’s theory.
What is information?
The various definitions of the concept of information reflect the difficulty of capturing and defining what “information” means in a unique and universal way, given that its meaning varies depending on the context in which it is applied. Indeed, like time, information is something that everyone recognises, yet it defies a simple, unambiguous definition.
“What is time? If no one asks me, I know; if I want to explain it to a questioner, I don’t know.” — Saint Augustine, in ‘Confessions’.
Nevertheless, even with its limitations, Shannon’s entropy theory transformed our understanding of information and how it is quantified, providing a robust mathematical foundation for fields such as communication, data encoding, and signal processing. His concept transformed the abstract notion of information into a concrete, operational measure, driving the development of essential modern-world technologies. Without Shannon’s entropy, neither the internet nor advanced tools such as ChatGPT would exist.
Although it is fundamental from both scientific and philosophical perspectives, defining the concept of information remains challenging. Information is ubiquitous in the universe — it has existed since the beginning of time and permeates all aspects of our existence. It is found in the physical laws that govern the cosmos, in the genetic codes that shape living beings, and in the daily signals we use to communicate thoughts, emotions, and knowledge. Understanding information is ultimately understanding our own existence.
To find out more:
- The Seven Ages of Information, Francisco Rodrigues, Medium.
- The Information: A History, a Theory, a Flood, James Gleick.
- Information Theory, Inference and Learning Algorithms, MacKay David.
- A Mini-Introduction To Information Theory, Edward Witten
