No Free Lunch Theorems

Broadly speaking, there are two no free lunch theorems. One for supervised machine learning and one for search/optimization. The two seminal publications are:


No Free Luch for Supervised Machine Learning

Wolpert (2001) shows that in a noise-free scenario where the loss function is the misclassification rate, if one is interested in off-training-set error, then all algorithms are equivalent, on average.

d = training set
m = number of elements in training set
f = "target" input-output relationships
h = hypothesis (the algorithm's guess for f made in response to d)
C = off-training-set "loss" associated with f and h ("generalization error")

'...how well you do is determined by how "aligned" your learning algorithm P(h|d) is with the actual posterior, P(f|d).'

'...by any of the measures E(C|d), E(C|m), E(C|f,d), or E(C|f,m) (all generically known as "risks"), all algorithms are equivalent, on average. Or to put it another way, for any two learning algorithms, there are just as many situations (appropriately weighted) in which algorithm one is superior to algorithm two as vice versa, according to any of the measures of "superiority" in Theorem 2.'
Wolpert (2001)

No Free Luch for Search/Optimization

The no free lunch theorem for search and optimization applies to finite spaces and algorithms that do not resample points. All algorithms that search for an extremum of a cost function perform exactly the same when averaged over all possible cost functions. So, for any search/optimization algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class. See Wolpert and Macready (1997).

Free Lunches


"We show that all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions. In particular, if algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A. Starting from this we analyze a number of the other a priori characteristics of the search problem, like its geometry and its information-theoretic aspects. This analysis allows us to derive mathematical benchmarks for assessing a particular search algorithm's performance. We also investigate minimax aspects of the search problem, the validity of using characteristics of a partial search over a cost function to predict future behavior of the search algorithm on that cost function, and time-varying cost functions. We conclude with some discussion of the justifiability of biologically-inspired search methods."
Wolpert and Macready (1995)

"An algorithmicist looks at no free lunch."
Culberson (1996)

'The "no free lunch" theorems (Wolpert and Macready) have sparked heated debate in the computational learning community. A recent communication, (Zhu and Rohwer, 1996) attempts to demonstrate the ine fficiency of cross-validation on a simple problem. We elaborate on this result by considering a broader class of cross-validation. We show that when used more strictly, cross-validation can yield the expected results on simple examples.'
Goutte (1997)

"A framework is developed to explore the connection between effective optimization algorithms and the problems they are solving. A number of "no free lunch" (NFL) theorems are presented that establish that for any algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class. These theorems result in a geometric interpretation of what it means for an algorithm to be well suited to an optimization problem. Applications of the NFL theorems to information theoretic aspects of optimization and benchmark measures of performance are also presented. Other issues addressed are time-varying optimization problems and a priori "head-to-head" minimax distinctions between optimization algorithms, distinctions that can obtain despite the NFL theorems' enforcing of a type of uniformity over all algorithms."
Wolpert and Macready (1997)

"Here different scenarios of optimization are presented. It is argued why the scenario on which the No Free Lunch Theorem is based does not model real life optimization. For more realistic scenarios it is argued why optimization techniques differ in their efficiency. For a small example this claim is proved."
Droste, Jansen and Wegener (1998)

"We show that with a uniform prior on models having the same training error, early stopping at some fixed training error above the training error minimum results in an increase in the expected generalization error."
Cataltepe, Abu-Mostafa and Magdon-Ismail (1999)

The problem with the no-free-lunch theorems in machine learning is that they are rather complicated and difficult to follow because they are proven under the most general conditions possible (Wolpert 1992, 1995, 1996). The purpose of this note is to present the idea behind them in a very simple example.
Forster (1999)

"No Free Lunch theorems have shown that learning algorithms cannot be universally good. We show that No Free Lunch exists for noise prediction as well. We show that when the noise is additive and the prior over target functions is "uniform", a prior on the noise distribution cannot be updated, in the Bayesian sense, from any finite data set. We emphasize the importance of a prior over the target function in order to justify superior performance for learning systems."
Magdon-Ismail (2000)

'This letter discusses the recent paper "Some technical remarks on the proof of the 'No Free Lunch' theorem". In that paper some technical issues related to the formal proof of the "No Free Lunch" (NFL) theorem for search [2], [3] were given. As a result of a discussion among the authors, this letter explores the issues raised in that paper more thoroughly. This includes the presentation of a simpler version of the NFL proof, in accord with a sugggestion made explicitly in [1] and implicitly in [3]. It also includes the correction of an incorrect claim made in [1] of a limitation of the NFL theorem. Finally, some thoughts on future research directions for research into algorithm performance are given.'
Köppen, Wolpert and Macready (2001)

Slides for a tutorial presented at the 2001 Congress on Evolutionary Computation (CEC2001), Seoul, March 27, 2001.
English (2001)

"It is seen that No Free Lunch results are independent from whether or not the set of functions over which a No Free Lunch result holds is compressible."
Schumacher, Vose and Whitley (2001)

"The No Free Lunch theorem has had considerable impact in the field of optimization research. A terse definition of this theorem is that no algorithm can outperform any other algorithm when performance is amortized over all functions. Once that theorem has been proven, the next logical step is to characterize how effective optimization can be under reasonable restrictions. We operationally define a technique for approaching the question of what makes a function searchable in practice. This technique involves defining a scalar field over the space of all functions that enables one to make decisive claims concerning the performance of an associated algorithm. We then demonstrate the effectiveness of this technique by giving such a field and a corresponding algorithm; the algorithm performs better than random search for small values of this field. We then show that this algorithm will be effective over many, perhaps most functions of interest to optimization researchers. We conclude with a discussion about how such regularities are exploited in many popular optimization algorithms."
Christensen and Oppacher (2001)

"In a recent paper [3] it was shown that No Free Lunch results [5] hold for any subset F of the set of all possible functions from a finite set X to a finite set Y iff F is closed under permutation of X. In this article, we prove that the number of those subsets can be neglected compared to the overall number of possible subsets. Further, we present some arguments why problem classes relevant in practice are not likely to be closed under permutation."
Igel and Toussaint (2001)

"an Almost No Free Lunch (ANFL) theorem shows that for each function which can be optimized e)ciently by a search heuristic there can be constructed many related functions where the same heuristic is bad. As a consequence, search heuristics use some idea how to look for good points and can be successful only for functions “giving the right hints”.'
Droste, Jansen and Wegener (2002)

"The classic NFL theorems are invariably cast in terms of single objective optimization problems. We confirm that the classic NFL theorem holds for general multiobjective fitness spaces, and show how this follows from a ‘single-objective’ NFL theorem. We also show that, given any particular Pareto Front, an NFL theorem holds for the set of all multiobjective problems which have that Pareto Front. It follows that, given any ‘shape’ or class of Pareto fronts, an NFL theorem holds for the set of all multiobjective problems in that class. These findings have salience in test function design. Such NFL results are cast in the typical context of absolute performance, assuming a performance metric which returns a value based on the result produced by a single algorithm. But, in multiobjective search we commonly use comparative metrics, which return performance measures based non-trivially on the results from two (or more) algorithms. Closely related to but extending the observations in the seminal NFL work concerning minimax distinctions between algorithms, we provide a ‘Free Leftovers’ theorem for comparative performance of algorithms over permutation functions; in words: over the space of permutation problems, every algorithm has some companion algorithm(s) which it outperforms, according to a certain well-behaved metric, when comparative performance is summed over all problems in te space."
Corne and Knowles (2003)

"The No-Free-Lunch (NFL) theorems hold for general multiobjective fitness spaces, in the sense that, over a space of problems which is closed under permutation, any two algorithms will produce the same set of multiobjective samples. However, there are salient ways in which NFL does not generally hold in multiobjective optimization. Previously we have shown that a ‘free lunch’ can arise when comparative metrics (rather than absolute metrics) are used for performance measurement. Here we show that NFL does not generally apply in multiobjective optimization when absolute performance metrics are used. This is because multiobjective optimizers usually combine a generator with an archiver. The generator corresponds to the ‘algorithm’ in the NFL sense, but the archiver filters the sample generated by the algorithm in a way that undermines the NFL assumptions. Essentially, if two multiobjective approaches have different archivers, their average performance may differ. We prove this, and hence show that we can say, without qualification, that some multiobjective approaches are better than others."
Corne and Knowles (2003)

"The sharpened No-Free-Lunch-theorem (NFL-theorem) states that the performance of all optimization algorithms averaged over any finite set F of functions is equal if and only if F is closed under permutation (c.u.p.) and each target function in F is equally likely. In this paper, we first summarize some consequences of this theorem, which have been proven recently: The average number of evaluations needed to find a desirable (e.g., optimal) solution can be calculated; the number of subsets c.u.p. can be neglected compared to the overall number of possible subsets; and problem classes relevant in practice are not likely to be c.u.p. Second, as the main result, the NFL-theorem is extended. Necessary and sufficient conditions for NFL-results to hold are given for arbitrary, non-uniform distributions of target functions. This yields the most general NFL-theorem for optimization presented so far."
Igel and Toussaint (2003)

"Genetic Algorithms (GAs) and Genetic Programming (GP) are often considered as seperate but related fields. Typically, GAs use a fixed length linear representation, whereas GP uses a variable size tree representation. This paper argues that the differences are unimportant. Firstly, variable length actually means variable length up to some fixed limit, so can really be considered as fixed length. Secondly, the representations and genetic operators of GA and GP appear different, however ultimately it is a population of bit strings in the computers memory which is being manipulated whether it is GA or GP which is being run on the computer.
The important difference lies in the interpretation of the representation; if there is a one to one mapping between the description of an object and the object itself (as is the case with the representation of numbers), or a many to one mapping (as is the case with the representation of programs). This has ramifications for the validity of the No Free Lunch theorem, which is valid in the first case but not in the second. It is argued that due to the highly related nature of GAs and GP, that many of the empirical results discovered in one field will apply to the other field, for example maintaining high diversity in a population to improve performance." Woodward (2003)

"This paper has three aims. Firstly, to clarify the poorly understood No Free Lunch Theorem (NFL) which states all search algorithms perform equally. Secondly, search algorithms are often applied to program induction and it is suggested that NFL does not hold due to the universal nature of the mapping between program space and functionality space. Finally, NFL and combinatorial problems are examined. When evaluatig a candidate solution, it can be discarded without being fully examined. A stronger version of NFL is established for this class of problems where the goal is to minimize a quantity."
Woodward and Neil (2003)

Tom Minka's Statistical Learning/Pattern Recognition Glossary entry

New Publications

Top 10 Publications

  1. WOLPERT, D.H., et al., 1997. No free lunch theorems for optimization. Evolutionary Computation, IEEE Transactions on. [Cited by 1057] (97.81/year)
  2. WOLPERT, D.H. and W.G. MACREADY, 1997. No free lunch theorems for search. IEEE Transactions on Evolutionary Computation. [Cited by 481] (44.51/year)
  3. SCHUMACHER, C., M.D. VOSE and L.D. WHITLEY, 2001. The no free lunch and problem description length. Proceedings of the Genetic and Evolutionary Computation …. [Cited by 60] (8.82/year)
  4. DEMBSKI, W.A., 2006. No Free Lunch: Why Specified Complexity Cannot Be Purchased Without Intelligence. books.google.com. [Cited by 76] (42.07/year) (N.B. Intelligent design is utter nonsense!)
  5. JACKSON, R., et al., 2005. Alcohol and ischaemic heart disease: probably no free lunch. The Lancet. [Cited by 42] (14.97/year)
  6. IGEL, C. and M. TOUSSAINT, 2003. On classes of functions for which No Free Lunch results hold. Information Processing Letters. [Cited by 33] (6.87/year)
  7. CHRISTENSEN, S. and F. OPPACHER, 2001. What can we learn from No Free Lunch? A First Attempt to Characterize the Concept of a Searchable …. Proceedings of the Genetic and Evolutionary Computation …. [Cited by 28] (4.11/year)
  8. CATALTEPE, Z., Y.S. ABU-MOSTAFA and M. MAGDON-ISMAIL, 1999. No Free Lunch for Early Stopping. Neural Computation. [Cited by 24] (2.73/year)
  9. WOLPERT, D.H., 2001. The supervised learning no-free-lunch theorems. Proc. 6th Online World Conference on Soft Computing in …. [Cited by 27] (3.97/year)
  10. ZHU, H. and R. ROHWER, 1996. No free lunch for cross-validation. Neural Computation. [Cited by 25] (2.12/year)

Bibliography

Webmaster: Martin Sewell