It looks like you're new here. If you want to get involved, click one of these buttons!

- All Categories 2.4K
- Chat 502
- Study Groups 21
- Petri Nets 9
- Epidemiology 4
- Leaf Modeling 2
- Review Sections 9
- MIT 2020: Programming with Categories 51
- MIT 2020: Lectures 20
- MIT 2020: Exercises 25
- Baez ACT 2019: Online Course 339
- Baez ACT 2019: Lectures 79
- Baez ACT 2019: Exercises 149
- Baez ACT 2019: Chat 50
- UCR ACT Seminar 4
- General 75
- Azimuth Code Project 110
- Statistical methods 4
- Drafts 10
- Math Syntax Demos 15
- Wiki - Latest Changes 3
- Strategy 113
- Azimuth Project 1.1K
- - Spam 1
- News and Information 148
- Azimuth Blog 149
- - Conventions and Policies 21
- - Questions 43
- Azimuth Wiki 718

Options

in Chat

Hi, in another thread John mentioned the connection between two different formulations of Occam's Razor in modelling. It's one of those things where I can sort of see the that there ought to be a connection, but I'm not completely clear upon things.

One way of describing the MDL principle is as a trade-off between machinery and parameters. For example, suppose that you've got a set of data and you want to find a model which should have the best
generalization ability. Then you can do this by choosing some class of models, say Gaussian distributions centred uat various points. Then for a collection of $N$ Gaussians you can specify a point by specifying which Gaussian you're using and then giving some *finite length* binary description of the parameter value. Since Gaussians are mostly concentrated near the middle, you need fewer bits to specify a point near the centre of a Gaussian while for points that are much further from the centre you need more bits. So there's some trade-off between whether you use a lot of bits to describe a point relative to a Gaussain that's far away, or actually put a Gaussian close to your point (requiring bits to specify its centre) and then use far fewer bits to describe it that way. Generally it requires slightly more bits to specify part of the machinery rather than a parameter (since the machinery is generally complex) but not hugely so. The sort-of idea behind the MDL is that the "machinery" can get used for all points while the parameters only describe one individual point. So bits allocated to "machinery" are more effective *if* they'll be used for multiple points, but if you've got just one point that's not "outlying" then it's more effective just to use quite a few bits in its parameter.

This embodies the often stated form of Occam's razor in the sense that the continuous analogue of "not multiplying entities needlessly" is "not adding entities which don't pay for themselves in terms of the reduciton in parameter specification complexity". It's a bit less clear to me how fitting the maximum entropy distribution connects with this: presumably a distribution with any more "features" than necessary is adding new entities, and the Gibbs distribution that maximises the entropy is the most featureless distribution fitting the data that you can get?

## Comments

David Tweed wrote:

That's one way to think about it. There are many more, which all interlock.

First, Shannon showed that entropy is a good measure of "information", for example by proving his coding theorem, which shows that the optimal coding scheme for a random signal can compress it to a length proportional to its entropy. (For a less sloppy statement, click the link.)

Second, relative entropy, also known as "Kullback-Leibler divergence" or "information gain", lets us include the idea of a non-uniform prior. Given any prior, meaning any probability distribution $q$, the amount of information we

gainby discovering the probability distribution is actually $p$ is$$ I(p,q) = \sum_i p_i \log(p_i/q_i) $$ I explained this here. So, starting from some prior $q$ and then learning some new facts, which posterior $p$ should we pick? Arguably we should pick the one that contains no more information than needed to match the new facts we've learned. So, we should minimize $I(p,q)$ subject to the constraints on $p$ given by our new facts.

This was argued extensively and brilliantly by Jaynes, though he focused on the case where the prior $q$ was uniform. I should warn you, though: there's an annoying issue of minus signs that runs through this whole discussion, so he would insert a minus sign in the above formula, call that entropy, and

maximizeit. Hence, he called this idea the Maximum Entropy Principle.(There's no contradiction; the minus signs ultimately arise from the fact that entropy is a measure of

information you don't know. A very uniform probability distribution has high entropy because there's a lot you don't know about a sample chosen randomly according to this distribution.)Third, Kolmogorov, Levin Chaitin, Solomonoff and others showed that we could define a kind of entropy of a single string of bits (or letters in any alphabet) to be the length of the shortest program printing it out. This is called its Kolmogorov complexity. These folks proved that this concept is closely connected to entropy: roughly, given a random signal with a certain entropy, the expected Kolmogorov complexity of the bit strings it emits is close to its entropy!

So, minimum description length complexity, information, and entropy are all different faces of the same thing! This permits a quantitative approach to Occam's razor, though there are certainly many problems.

I've tried to give links that will let you see the details that make this vague outline nice and solid without overwhelming you. This subject has always been one of my favorites, but in the last few years it's become an obsession.

`David Tweed wrote: > the Gibbs distribution that maximises the entropy is the most featureless distribution fitting the data that you can get? That's one way to think about it. There are many more, which all interlock. First, Shannon showed that entropy is a good measure of "information", for example by proving his [coding theorem](http://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem), which shows that the optimal coding scheme for a random signal can compress it to a length proportional to its entropy. (For a less sloppy statement, click the link.) Second, relative entropy, also known as "Kullback-Leibler divergence" or "information gain", lets us include the idea of a non-uniform prior. Given any prior, meaning any probability distribution $q$, the amount of information we <i>gain</i> by discovering the probability distribution is actually $p$ is $$ I(p,q) = \sum_i p_i \log(p_i/q_i) $$ I explained this [here](http://johncarlosbaez.wordpress.com/2011/01/21/information-geometry-part-6/). So, starting from some prior $q$ and then learning some new facts, which posterior $p$ should we pick? Arguably we should pick the one that contains no more information than needed to match the new facts we've learned. So, we should minimize $I(p,q)$ subject to the constraints on $p$ given by our new facts. This was argued extensively and brilliantly by Jaynes, though he focused on the case where the prior $q$ was uniform. I should warn you, though: there's an annoying issue of minus signs that runs through this whole discussion, so he would insert a minus sign in the above formula, call that entropy, and <i>maximize</i> it. Hence, he called this idea the Maximum Entropy Principle. (There's no contradiction; the minus signs ultimately arise from the fact that entropy is a measure of <i>information you don't know</i>. A very uniform probability distribution has high entropy because there's a lot you don't know about a sample chosen randomly according to this distribution.) Third, Kolmogorov, Levin Chaitin, Solomonoff and others showed that we could define a kind of entropy of a single string of bits (or letters in any alphabet) to be the length of the shortest program printing it out. This is called its <a href = "http://en.wikipedia.org/wiki/Kolmogorov_complexity">Kolmogorov complexity</a>. These folks proved that this concept is closely connected to entropy: roughly, given a random signal with a certain entropy, the expected Kolmogorov complexity of the bit strings it emits is close to its entropy! So, minimum description length complexity, information, and entropy are all different faces of the same thing! This permits a quantitative approach to Occam's razor, though there are certainly many problems. I've tried to give links that will let you see the details that make this vague outline nice and solid without overwhelming you. This subject has always been one of my favorites, but in the last few years it's become an obsession.`

Thanks; I'll cogitate on this. (I'm not so much thinking about the compression/informational aspects, it's more quite what the analogue of "entities" in "entities should not be multipilied necessity" are in this case. I suspect it's just a case of me reading around those links to get the right perspective. The Minimum Description Length is a relatively direct analogue that's easy to for me to see.)

`Thanks; I'll cogitate on this. (I'm not so much thinking about the compression/informational aspects, it's more quite what the analogue of "entities" in "entities should not be multipilied necessity" are in this case. I suspect it's just a case of me reading around those links to get the right perspective. The Minimum Description Length is a relatively direct analogue that's easy to for me to see.)`

I think the "entities should not be multiplied unnecessarily" formulation of Occam's razor is risky - people could use it to argue against a model of the universe with lots of atoms in it. I much prefer the Solomonoff/Jaynes formulation, which gives preference to the

simplesttheory that fits the data, meaning the one containing the least information.`I think the "entities should not be multiplied unnecessarily" formulation of Occam's razor is risky - people could use it to argue against a model of the universe with lots of atoms in it. I much prefer the Solomonoff/Jaynes formulation, which gives preference to the _simplest_ theory that fits the data, meaning the one containing the least information.`

I could see this as a matter of economy of description, but is there any reason to believe that the simpler description is more likely to be true?

`I could see this as a matter of economy of description, but is there any reason to believe that the simpler description is more likely to be true?`

Longer descriptions imply more information. MDL is a technique to avoid thinking you've got more information than you actually have in the data ie. the likelihood of redundancy.

`Longer descriptions imply more information. MDL is a technique to avoid thinking you've got more information than you actually have in the data ie. the likelihood of redundancy.`

That's a good question, but it also goes straight in to the very deep waters of what it would mean for a model of the world to be "true". So I'll sidestep that and give a reason why one might think an "overall simpler" theory is likely to have greater accuracy. In machine learning (ie, predictive statistical modelling) one would like the model to have generalization/to avoid over-fitting. This means that, for any particular finite sample of data from some process will exhibit patterns that are due to the process but it is very likely, just from being a finite set, to have some other patterns which don't apply to the process in general, particularly as the patterns become more complex. So if the "model" was to really fit absolutely all the data exactly it's likely to be featuring spurious things which means it going to be poor when applied to predict new data because the complex patterns (which don't hold in the general process) "mislead it". In contrast if a simpler explanation that fits the data set almost as well, it's likely to have fewer complex rules that fit these spurious patterns in the data, and so will be less "misled". On the other hand, when you get too simple in the model "parameters" the information you've got to use to specify the explicit specification of data set to where the model doesn't fit the data set at all well -- so that the

overalldescription length goes up -- your model has missed real features and hence will predict poorly on the new data.This is a criterion with its justification that's actually used in practice, and although it's not generally the case that the MDL model learned from a finite data sample has

the bestgeneralization, using the MDL model is generally close to the model with the best generalization. (Yeah, there's some thorny logical issues with what that statement means that I'm going to ignore.)`That's a good question, but it also goes straight in to the very deep waters of what it would mean for a model of the world to be "true". So I'll sidestep that and give a reason why one might think an "overall simpler" theory is likely to have greater accuracy. In machine learning (ie, predictive statistical modelling) one would like the model to have [generalization](http://en.wikipedia.org/wiki/Machine_learning#Generalization)/to avoid over-fitting. This means that, for any particular finite sample of data from some process will exhibit patterns that are due to the process but it is very likely, just from being a finite set, to have some other patterns which don't apply to the process in general, particularly as the patterns become more complex. So if the "model" was to really fit absolutely all the data exactly it's likely to be featuring spurious things which means it going to be poor when applied to predict new data because the complex patterns (which don't hold in the general process) "mislead it". In contrast if a simpler explanation that fits the data set almost as well, it's likely to have fewer complex rules that fit these spurious patterns in the data, and so will be less "misled". On the other hand, when you get too simple in the model "parameters" the information you've got to use to specify the explicit specification of data set to where the model doesn't fit the data set at all well -- so that the _overall_ description length goes up -- your model has missed real features and hence will predict poorly on the new data. This is a criterion with its justification that's actually used in practice, and although it's not generally the case that the MDL model learned from a finite data sample has _the best_ generalization, using the MDL model is generally close to the model with the best generalization. (Yeah, there's some thorny logical issues with what that statement means that I'm going to ignore.)`

People sometimes advocate using maximum entropy priors in Bayesian statistics. This seems a dangerous practice to me.

A classic example is that you have some sort of device or process which can produce the numbers 1 to 6 with probabilities $p_1 \dots p_6$, and the only thing you know about the device or process is that the mean is $m$. You can then use the principle of maximum entropy to choose the $p_i$. The result is of form $p_i = \alpha \beta^i$ and using $\sum p_i = 1$ and $\sum i p_i = m$ you can solve (numerically) for $\alpha$ and $\beta$.

Why dangerous? Suppose $m=1.01$. The result has $p_6 \approx $1e-10. Although the method is based on the idea of assuming as little as possible, it manages to produce the rather dramatic conclusion that you're almost certain never to see a 6 from very little input. Humans have a tendency of under-estimating the probability of rare events and getting into trouble because of it, and this principle seems to encourage such mistakes.

What's going wrong here?

`People sometimes advocate using maximum entropy priors in Bayesian statistics. This seems a dangerous practice to me. A classic example is that you have some sort of device or process which can produce the numbers 1 to 6 with probabilities $p_1 \dots p_6$, and the only thing you know about the device or process is that the mean is $m$. You can then use the principle of maximum entropy to choose the $p_i$. The result is of form $p_i = \alpha \beta^i$ and using $\sum p_i = 1$ and $\sum i p_i = m$ you can solve (numerically) for $\alpha$ and $\beta$. Why dangerous? Suppose $m=1.01$. The result has $p_6 \approx $1e-10. Although the method is based on the idea of assuming as little as possible, it manages to produce the rather dramatic conclusion that you're almost certain never to see a 6 from very little input. Humans have a tendency of under-estimating the probability of rare events and getting into trouble because of it, and this principle seems to encourage such mistakes. What's going wrong here?`

It is less likely to be wrong, since it contains fewer points of failure.

Haven't been here for a while, being totally overworked... But the Occam thing reminded me of the unfinished introduction to my crazy paper on Riemannian Geometry, where I cut away tangent space with Occam's razor and advocate starting in cotangent space. I need some good quote. The best I found is Wittgenstein's formulation: "If a sign is not necessary then it is meaningless. That is the meaning of Occam’s razor." (Tractatus Logico-Philosophicus, 3.328)

`> is there any reason to believe that the simpler description is more likely to be true? It is less likely to be wrong, since it contains fewer points of failure. ----------------------------- Haven't been here for a while, being totally overworked... But the Occam thing reminded me of the unfinished introduction to my crazy paper on Riemannian Geometry, where I cut away tangent space with Occam's razor and advocate starting in cotangent space. I need some good quote. The best I found is Wittgenstein's formulation: "If a sign is not necessary then it is meaningless. That is the meaning of Occam’s razor." (Tractatus Logico-Philosophicus, 3.328)`