I’m part of a small group of mathematics enthusiasts in Kansas City who meet about once a month on Saturday mornings to drink coffee and discuss mathematics. This past weekend it was my turn to do a presentation to the rest of the group and I chose to speak on the mathematical foundations of the Support Vector Machine algorithm in Oracle Data Mining. While I wasn’t surprised that some in the group had a better handle on Vapnik-Chervonenkis theory than I and gently “guided” me a few times, I was somewhat surprised at their positive reaction to my characterization of the “Oracle” approach to data mining in contrast with the “SAS” approach. While gross simplifications are always “gross”, here is my take on what I believe to be very different philosophies. Let’s use classification as an example since we’re talking about SVMs.
I think of the “SAS” approach to be similar to that of a “statistician” or classic data scientist. That is, there is a desire to understand the algorithm in context of the data set. The main objective is to identify and understand the source(s) of error in the model and to characterize the algorithm through the use various coefficients and ratios. A good deal of effort is spent in the evaluation process of the algorithm and in understanding the impact of different choices in methodology. The SAS perspective emphasizes understanding the data preparation and the algorithm. The more detail, the better.
The “Oracle” approach to data mining is characterized by a broader set of business concerns from data security and processing time to the business value of results. An “ODM” approach would be to invest time to develop and test several different models and see which ones have more predictive power and to spend relatively less time in evaluation of the individual models. Another way to say this is that error is error and rather than understanding sources of error, we should take that same time and try to understand business implications of the positive “working” part of the model. The more business value, the better.
The fundamental concept behind Support Vector Machines is to take a highly dimensional data set and to separate two classes of data by a hyperplane that maximizes the marginal distance between the data points (i.e. perfectly in the middle). Let’s contrast this with a “neural net” algorithm which is another classification approach. Neural nets use iteration and a wide variety of statistical techniques to “tune” their algorithm to minimize the predictive error across a particular training data set. Support Vector Machines tend to be more robust and work well across a broad range of new data sets. Neural Nets are more precise, but also are prone to “over fitting” their training data set and typically are less robust. I think of the “Oracle” data mining approach to be like the Support Vector Machine. It uses a very strong mathematical foundation that is highly generalizable and works well across a broad range of data sets. The “SAS” data mining approach uses highly complex mathematical techniques specific to a given situation. The “Oracle” approach minimizes the “structural risk” of classification or it uses an approach that is least likely to produce error. The “SAS” approach minimizes the “empirical risk” of classification or it uses an approach minimizes the total error.
I have nothing but respect for SAS practitioners. They are true experts and tend to personify the “if it’s worth doing, it’s worth doing right” approach to analytics. Of course the contrasting position is, “the perfect is the enemy of the good.” Time spent perfecting a model is forever lost and often more value is delivered by moving more quickly and accomplishing more. The Support Vector Machine algorithm deployment in Oracle Data Mining is good, very good. It recognizes that there is an inherent tradeoff between algorithmic complexity and the ability to generalize across new data sets. It uses a sensible automatic data preparation process that makes good choices and then leverages a replicable, explainable, foundationally solid methodology for balancing tradeoffs. In short, even proof-obsessed, dyed-in-the-wool mathematicians can recognize the inherent value of Oracle’s Support Vector Machine strategy.