Abstract:
The relentless advance of computer technology, a gift of Moore's Law, and the data deluge available via the Internet and other sources, has been a gift to both scientific research and business/industry. Researchers in many fields are hard at work exploiting this data. The discipline of "machine learning," for instance, attempts to automatically classify, interpret and find patterns in big data. It has applications as diverse as supernova astronomy, protein molecule analysis, cybersecurity, medicine and finance. However, with this opportunity comes the danger of "statistical overfitting," namely attempting to find patterns in data beyond prudent limits, thus producing results that are statistically meaningless.
The problem of statistical overfitting has recently been highlighted in mathematical finance. A just-published paper by the present author, Jonathan Borwein, Marcos Lopez de Prado and Jim Zhu, entitled "Pseudo-Mathematics and Financial Charlatanism," draws into question the present practice of using historical stock market data to "backtest" a new proposed investment strategy or exchange-traded fund. We demonstrate that in fact it is very easy to overfit stock market data, given powerful computer technology available, and, further, without disclosure of how many variations were tried in the design of a proposed investment strategy, it is impossible for potential investors to know if the strategy has been overfit. Hence, many published backtests are probably invalid, and this may explain why so many proposed investment strategies, which look great on paper, later fall flat when actually deployed.
In general, we argue that not only do those who directly deal with "big data" need to be better aware of the methodological and statistical pitfalls of analyzing this data, but those who observe these problems of this sort arising in their profession need to be more vocal about them. Otherwise, to quote our "Pseudo-Mathematics" paper, "Our silence is consent, making us accomplices in these abuses."