The Power of Explanation

What I’m hoping to do over the course of several posts is to lay out a foundation for a new way of thinking about ethics and reality, definitively casting a vote in favor of one interpretation of reality. My goal is persuade popular opinion in favor of what is currently considered a niche outlook.

How does one move an idea from the fringes to mainstream, though? Why should you change your mind about how you think about reality? This question of persuasion seems like an important place to start.

In a lot of ways, I am re-tracing the footsteps of David Deutsch, the British physicist who penned a couple of books around the topic of reality. Much of my own thinking on the concept of reality and explanation is largely derivative of his. So let’s start at a similar place, then, with a movement towards understanding the role of explanation, of narrative, as a method of transmitting and cementing paradigmatic thought.

A Survey of Theories about Information

I tweeted at @vgr a few days back about Deutsch’s theory of explanatory power, and he responded with a list of some other theories of information. 

vgr’s informative list of information theories

Admittedly, Venkat doesn’t (yet) have a good grasp of what I mean by Deutsch’s theory of the power of explanation and therefore doesn’t exactly offer up comparable theories or interpretations, but since these are the methods of information theory that he’s using to judge how to change his mind, it feels in scope to at least walk through what these different theories can tell us about information, and how humans process it.

As a way of classifying these, there are two distinct domains about which these theories — Kuhn’s paradigm shifts, Schmidhuber’s compression progress, Occam’s razor, Deutsch’s explanatory power, Kolmogorov information — propose explanations. We can roughly divide them into two groups: the first 1) concerning the measurement and encodeable size of information and another 2) concerning the validity of an explanation. The first, in other words, provides a classification for ideas and thoughts based on how much pure physical matter of bits I need to send you in order to wholly and completely communicate an idea (measurement and density of encodeability); the second deals with what information do I need to present you with and possibly in what order or framework, in order to change your mind about a topic (explanatory validity).

With these two buckets and a bit of explanation as to what the theories entail, we can now categorize these quite effectively.

Let’s start with Kolmogorov. In a way, Kolmogorov wholly defines the first category of theories — the encodeability of an idea. What does it mean, though, to be able to ‘encode’ an idea? The classic, computer science focused explanation usually involves saying something along the lines of “consider a photograph”. You can either represent it as a matrix of color points or in JPG format. The first, or matrix representation, often takes up orders of magnitude more space, in terms of bits, than the ‘condensed’ JPG format. Kolmogrov was concerned with the ultimate ability to take a complex idea such as a photo and represent it in the smallest number of bits. You can then judge information or an idea based on how compactly it can be expressed. The ‘complexity’ of an idea is measured by how few bits you need to transmit over the wire. The lower the number of bits needed to represent it, the lower its stated complexity.

Schmidhuber took this concept of the compressibility of information and developed several examples of how ‘low Kolmogorov complexity’ ideas can still, in reality, be quite complex.

As an aside, my favorite example of these is his Femme Fractale, an equation for a drawing that when executed, creates a set of intersecting lines. Schmidhuber then goes on to explain how insight or creativity can be derived from this relatively simplistic pattern, eventually highlighting one particular pathway that, to him, is evocative of women’s silhouettes. The syntactic expression of the original drawing (top left, below) is as follows: “The frame is a circle; its leftmost point is the center of another circle of the same size. Wherever two circles of equal size touch or intersect are centers of two more circles with equal and half size, respectively. Each line of the drawing is a segment of some circle, its endpoints are where circles touch or intersect”.

Tracing femmes in a low Komolgorov complexity fractal, source http://people.idsia.ch/~juergen/femmefractale.html

Schmidhuber’s compression progress builds upon Kolmogorov’s idea of information compressibility. Schmidhuber’s contribution was the insight that the extent to which an idea can be compressed is dependent upon the existing context which an information storage system has accumulated or that exists between communication partners. For example, when developing an encoding mechanism to use between two parties, the density of the encoding that is possible is based on how accurately you can predict, given a pattern of input or output, what any series of bits expands to represent. In the Femme Fractale example above, you as the decoder can “predict” what a circle looks like, so the English description doesn’t need to encode a definition of a circle. The definition of circle is a part of your shared context.

In an incredibly general sense, compression algorithms, then, are a manner of building and then transmitting data within a shared context. As the shared context becomes more descriptive, the encoding required for an idea decreases.

Schumidhuber proposes that as the amount of information that a system has seen increases, the amount of space or number of bits that an encoder needs in order to transmit or store that new idea decreases. This is ‘compression progress’, or the ability to more greatly compress ideas as you progress deeper into a context.

If you’re not incredibly up to speed on computer science primitives and what it means for an image to be ‘represented as a matrix’, there’s another, less pure yet more revelatory example I can present you with — Dawkin’s term ‘meme’.

As I mentioned earlier, there’s a bit of nuance here with information encoding measurement: all information encoding requires a decoder. How tightly you can pack information is a function of the pre-negotiated symbolism between two parties. An alphabet, for example, encodes information uniformly, sort of, by the construction of words. These words themselves encode meaning, however, such as ‘meme’. Broadcasting the word ‘meme’, in terms of information density, is quite small. It’s four ASCII letters. On a typical late 2010’s transmission line, we can get that idea ‘across’ a wire in approximately 32 bits, without compression (ignoring any transmission control or protocol data).

This 32-bits, however, is only enough to transmit the word itself.  The concept of what I mean by ‘meme’ is assumed to be encoded already within the recipient of the information. There’s already a shared understanding between the message sender and the message recipient.

Is it possible to encode the bigger question, namely ‘what is a meme’ in 32 bits? That, again, depends on the existing, shared context of the two parties wishing to communicate. If I have to explain what a meme is, in English, that will most definitively require greater than 32 bits of information, granted that at the least it requires a sentence or two of explanation. If I need to explain it in German, it’ll take even more bits as I’ll need to do a fair amount of translation in addition.

This progression, of building a shared understanding from an agreed upon  alphabet, to shared words such as English, to paragraph long explanations, to just sending you the short, four-character word meme — this is compression of expression, an encoding of information into progressively smaller and smaller amounts of signal that need to be transmitted as the base context of how and what we’re communicating becomes richer and, in some sense, more predictive.

Both Schmidhuber and Kolmogorov’s theories about information transmission can be loosely classified as theories about measurement and condensibility of information. These theories give us guidelines for how to measure the encodeability of an idea or information, as well as some general guidelines for understanding how we might be able to better compress information.

Let’s look now at Occam’s Razor and Explanatory Power, theories that deal with the explanatory validity of an idea. That is, how do we judge the validity of that idea, in terms of using it as a framing for how to understand our reality.

This question is related to the context question. A “valid framing” is a context that enables a compacted representation of reality. In fact, compactness is what Occam’s Razor expresses, that the most simple explanation is often reality.

Consider the following quote from Ludwig Wittenstein’s Tractatus Logico-Philosophicus:

If a sign is not necessary then it is meaningless. That is the meaning of Occam’s Razor.

Ludwig Wittgenstein, Tractatus Logico-Philosophicus 3.328

An unnecessary sign would be one that does nothing to contribute to the compactness of an idea, or that does not lend to the task of further compressing additional information or observation. Thus, the task of scientific endeavor is to find theories about reality that allow us to maximally compress the way that we represent it. We do this through the construction of the minimally required context.

But what is this context constructed of? How does this context get built? This is where Deutsch’s theory of Explanatory Power fills the gap. Deutsch, in his book The Fabric of Reality, extends Karl Popper’s anti-inductivism to conclude that all scientific theory is the building of a story-like explanation. Deutsch gives several guidelines for how explanatory power can be observed or identified:

  • if more facts or observations are accounted for; (compression)
  • if it changes more “surprising facts” into “a matter of course”; (prediction)
  • if it offers greater predictive power, i.e., if it offers more details about what we should expect to see, and what we should not; (prediction)
  • if it depends less on authorities and more on observations; (?)
  • if it makes fewer assumptions; (compression)
  • if it is more falsifiable, i.e., more testable by observation or experiment; (prediction)
  • if it is hard to vary.

Taken from the Wikipedia entry on Explanatory Power.

This transition, from ‘suprising facts’ to ‘matter of course’ closely mirrors the language used by Schmidhuber to describe the process of building a more compressible context for information.

Let’s look at this concept of compressibility on the example that Deutsch gives in his TED talk about the subject, on why the sun gets colder in the winter. He uses the ancient Greek explanation, that of Demeter getting sad because her daughter Persephone goes down into the underworld to pay a debt for the last part of the year. Deutsch gives a good idea for why, with additional observations, this explanation fails several of his above outlined criteria for a good explanation.

Let’s consider this same story in terms of compressibility. Knowing the story of Demeter and Persephone only gives us the ability to understand this one, specific phenomenon — why the sun weakens in the winter. It’s not very compact, as it doesn’t give us much ability to predict any other observations about the world.

Our current explanation of the Earth rotating on a skewed axis about the Sun, on the other hand, does a lot to predict many things that we can observe about our reality. It predicts seasons. It predicts the changes in day length that you observe at the equator as opposed to the poles. It predicts why the moon’s appearance is different on different parts of the planet. The mere concept of being on a planet that rotates about the Sun on a skewed axis gives us the context to situate and condense other phenomena that we observe about the world. In the terminology of Deutsch, we’d say that this idea of planetary rotation has high explanatory power, but it also goes a long way to contributing to our ability, as humans to compress the knowledge that we know about the world into tighter and more succinct representations. Deutsch is right that we create explanations about the world. Those explanations become the context that we can situate our observations and predictions into.

This just leaves us with Kuhn’s Paradigm‘s shifts, which is largely an observation that Kuhn made about the fact that a shift in the broader contextualization of information happens. In other words, Kuhn points out that when a compression algorithm (be it a human’s understanding or an actual programmatic decision matrix) discovers a new, more predictive way of organizing information, it will shift its interpretation or encoding of prior observations to be re-encoded using this new, more dense understanding.

In Exitus

So, how does one change their mind about a thing? These theories tell us that it is by finding new explanations that lend to more condensible encodings that then allow us to communicate and store our understanding of our reality in richer and more meaningful ways.

Errata

This condensed video of Schmidhuber talking it is pretty good; if you’ve got time the whole video might be worth watching. Schmidhuber’s work is largely descriptive of how learning systems learn new things and the representations that they then store of that information into, but he’s also got some really great observations about the mechanism of discovery and curiosity, or why humans are driven to look for more compressible interpretations.