Frequently Asked Questions about Symbolic Regression
Material on this page is in the process of evolution. It is inspired by (and contains the most parts of) the document on FAQs on symbolic regression - one of the tutorials of the Evolved Analytics' DataModeler symbolic regression package.
What makes Symbolic Regression different from Regression?
Conventional regression involves assuming a model form and then determining the parameters which best make that assumed model fit the observed data. Symbolic regression, searches for the model form which best describes the data behavior--essentially, we are letting the data tell us the appropriate model form rather than imposing our a priori assumptions.
The search space in Symbolic Regression is infinite. How do you find THE model?
There are an infinite number of models which will fit a finite data set, so THE model does not exist. There is also an infinite number of models which approximately fit a data set. Our goal is to discover a variety of "good enough" models. In some circumstances, we may want to examine this collection and select THE most plausible model; however, generally we should prefer to examine the collection for insight and actually use multiple models as the basis of making predictions against new data.
By embracing the multi-objective philosophy for model search (i.e. by explicitly performing complexity control) we can reduce the set of all possible solutions fitting the data to a set of all plausible and simple ones. To discover usable models we should have preference for simplicity --a good model should be both accurate in terms of capturing the observed data behavior AND be free of unnecessary structures and variables.
Why should I use Symbolic Regression instead of Non-linear regression?
You, probably shouldn't if you have reliable knowledge about the system, and a strong evidence for a particular model form. You should use Symbolic Regression if you don't want to have to make an a priori assumption about the model form. Instead, the evolutionary process will identify the important variables and their relationships with each other as well as the targeted response behavior. (however you would still have to define the set of function operators that you allow in the models. If your models may not contain trigonometric functions of the input variables - you will need to exclude them from the primitive set.)
Another reason is that if the system is truly linear, that will be revealed; otherwise, we will have a level of insight, understanding and trustability that is difficulty to attain with classic regression. Symbolic regression algorithms do the heavy lifting to generate a more complete picture of the problem and avoid the risk of human error.
How the results of Symbolic Regression are likely to differ from those for other forms of regression?
Other forms of regression either make simplifying assumptions (e.g., the model is polynomial of less than 3rd order with no cross-terms) or use a model form constructed based upon domain knowledge, analysis, and human biases. Using these supplied model structures (templates), other forms of regression attempt to optimize the parameters.
Conversely, symbolic regression attempts to discover both the structure and the associated parameters. Both a strength and a weakness of Symbolic Regression is that there is no requirement for physical reasonableness. This can lead (and often does) to surprising and insightful results. Another aspect is that since an infinite number of models can fit a data set, multiple solutions will be developed for symbolic regression and the user can select which model(s) are best for their needs in post-processing.
What about the physics?
A strength of symbolic regression is that it lets the data reveal the appropriate models rather than imposes a model structure that is mathematically tractable from a human perspective. Key drivers in the targeted response behavior will be uncovered -- even if they are correlated with other inputs. If we expect the system to have exponential or sinusoidal behavior, those building blocks as well as any other specialized domain-specific functions may be included in the model search processing.
What is a trustable model? This does not sound like it is possible!
Trustable models are possible because of the diverse model structures that are hypothesized, explored and refined during the model search. When we get to post-processing, we have many models from which to choose. If we select an ensemble of diverse models which are also quality models in that they are low in both error and complexity, we have something quite special.
This ensemble will agree if they are operating in the data regions used in their development -- otherwise, they would not be good models; however, if they are asked to operate outside that data region or if the underlying system has undergone fundamental changes, then they will tend to disagree -- otherwise, they would not be diverse models.
We should always remember what George Box said about models: "Essentially, all models are wrong, but some are useful."



