We report synthesis of an open source test data set and work in progress to expand it. Sophisticated data mining and Machine Learning (ML) techniques can discover statistical associations among variables that may or may not reflect actual causal dependencies. In many applications, systems must discriminate between associations that are mere coincidences and those that are at least plausibly causal. Further, a graph of causal relationships may be complex, with fan-in, fan-out, transitive, and various combinations of, dependencies. To test a system’s power to filter out non-causal associations and untangle the causal web, suitable synthetic data is needed. We report the development, in Wolfram Mathematica, of code that synthesizes data with subtle, complex, causal dependencies among some but not all of the generated observable variables. We implement several simple dissipative chaotic flows. Four (4) are autonomous, six (6) are driven. Among the resulting ten (10) observable state vectors, there are forty-five (45) potential pairwise (1:1) relationships, of which four (4) are strong, five (5) are moderate, three (3) are weak, for a total of twelve (12) that are actually causal, and any others are mere statistical artifacts that a tool under test should reject. Each system’s observables are corrupted by additive Gaussian noise. Each system’s hidden dynamics are disturbed by a normal Wiener process. The levels of these stochastic components are parameterized to make problem difficulty tunable. A set of generated data and code for generating more will be released openly on-line.
|