Two Dimensions of Data: Newsletter #25

Jan 26, 2015

What was that old saw: in God we trust, everyone else bring data? Data and information are the bedrock of modern society. Money, numbers, bits; however you count the beads, it’s data everywhere.

Yet, there’s no real understanding of data among scientists and scholars, let alone the general public. Even the experts view information from within their specialization — let’s say machine learning or information visualization — than an understanding of the science as a whole. Imagine a world in which people learned numerical simulations for space travel without learning classical mechanics. Physics is a great science because it’s basic concepts — not it’s foundations, but the concepts that all physicists need to know in order to apply their methods to problems in the world — are drilled into physicists from mechanics 101 onward.

There are two sciences of information: computer science and statistics; both are backed by mathematical theory, but go well beyond mathematics in their real world applicability. Still, there’s a tendency to identify these subjects with their (current) mathematical foundations, i.e., the theory of computation and probability theory. A physicist would find that strange; physics is mathematical, but no physicist would confuse the foundations of physics with the foundations of mathematics.

Until our understanding of information makes that transition, we won’t have a robust science of form. I believe that transition will require a deeper unification of computing and statistics than is on offer today and in order to do so, we will have to look at the two disciplines from a bird’s eye view first and then narrow down on important questions for unification. It’s a topic that’s beginning to concern me more and more, so I am going to use these newsletters to talk about my ideas every so often. Bear with me if you think I am going all technical.

Let’s first note that computing and statistics bite different chunks of the information universe. Computing helps us engineer information systems — desktop, laptop and mobile computers and computer networks being the most important. Computing (and once again, let me emphasize that I care more about computer engineering than computer science) integrates information vertically, i.e., it’s about engineering information systems from logic gates all the way to iPhone apps.

Statistics on the other hand helps us with experimentation, getting data from the world. The integration is horizontal; statisticians care about experimental designs and survey techniques; as the data is brought in for analysis, statisticians also care about techniques for crunching and visualizing the numbers.

Computing and statistics have stayed away from each other for most of their history, starting with training and ending with their typical applications. Statisticians learn continuous mathematics and most of the important applications of statistics have been in unsexy fields such as agricultural genetics and psychology. Computer scientists learn discrete mathematics and from the beginning the science and engineering has been very sexy — from it’s involvement in code breaking to the foundations of mathematics.

The proliferation of data is the main reason the two fields are beginning to come together. In particular, we need the vertical engineering of computing systems to be driven by the horizontal flow of data. Incidentally, this is exactly what my PhD supervisor, Whitman Richards, was advocating several decades ago. He got the germ of that idea from David Marr’s work on Vision. The marriage of the vertical and the horizontal is not only interesting as engineering, it’s arguably the best way to understanding the relationship between the mind and the brain as well. Machine learning is at the forefront of the marriage of vertical information and horizontal information. I believe that merger will expand to more and more fields in the future. To be continued.

Ranganaut

Discussion about this post