We have discussed theoretical support vector machine, or SVM, topics in previous articles. In this article, as we continue our dive into applying advanced techniques, we will take a more practical look at their uses.

The biggest difference between SVMs and other learning algorithms is that SVMs do not focus on an optimization protocol that attempts to minimize errors. Most learning algorithms focus on minimizing errors, or empirical risk, generated by the model. They are based on a principle that establishes theoretical bounds on algorithm performance.

SVM is different. It does not seek to reduce empirical risk by making a few mistakes. Rather, it pretends to build reliable models. The principle of this is called structural risk minimization. The SVM searches a structural model that has little risk of making mistakes with future data. SVM models can be used for everything from regression to classification to clustering.

### Different tools

SVMs can be used for both classifications as well as regression, depending on the configuration of the algorithm. For example, you can solve regression-type problems with “epsilon-SVR” or “nu-SVR.” Similarly, “C-SVC” and “nu-SVC SVM” can be used for classification tasks. The one-class SVM type is for distribution estimation. It learns from just one class of examples and tests later if new examples match known ones.

There are a number of different parameters that must be set when using SVMs to solve regression problems. Two of the most important ones are the insensitivity zone, epsilon, and the penalty parameter, C. The C parameter is responsible for determining the trade-off between training error and VC dimension (roughly, the complexity) of the model. Both parameters are chosen by the user.

The original SVM for regression (SVR for short) used a combination of two parameters, C and epsilon, to apply a penalty to the optimization for incorrectly predicted points. A different version was developed where the epsilon penalty parameter was replaced by nu [0, 1].

The main motivation for the nu-SVM version is that it has a more meaningful interpretation. Nu represents an upper limit on the portion of training samples, which are errors (poorly predicted), and a lower limit on the portion of samples that are support vectors. To many users, nu is more intuitive than C or epsilon. In any case, epsilon and nu serve a similar function in that they represent the penalty parameter. The same optimization problem can be solved in either case. We always need to configure our parameters correctly before solving any problem with SVMs.

The next important SVM “parameter” is the kernel type. The most popular types are: Linear, poly, rbf (radial basis function), sigmoid and precomputed. Choosing the right kernel is critical. While a full discussion of kernel selection is important, the nuances are beyond the scope of this article. For now, understand that the rbf kernel tends to be the most popular starting point. This kernel non-linearly maps samples into a higher dimensional space. It can handle the case when class labels and attributes are non-linearly related. The rbf kernel has fewer numerical difficulties than the other classes. While rbf kernels are often the best choice, exceptions apply. In cases where the number of features are very large, for example, the linear kernel is typically a better choice.

Gamma is another parameter that needs to be set when the poly, rbf or sigmoid kernels are chosen. Changing the value of gamma can improve or decrease the accuracy of the resulting model. It is good practice to use cross-validation to find an optimal gamma. When using libsvm, which is used by a plugin developed by this author for TradersStudio, the default value for gamma is typically 1 divided by the number of features.

Here’s how gamma affects a model. When you use the Gaussian rbf kernel, performance will be represented by a combination of bell-shaped surfaces centered at each support vector. Each surface’s width is inversely proportional to gamma. If the width is smaller than the minimum pair-wise distance to your data, you will have overfitting. If it’s larger, then all of your points will fall into one class, and performance suffers. The optimal width needs to be between these two extremes.

### Just one class?

Imagine a factory with heavy machinery under constant surveillance of some advanced system. The task of the controlling system is to determine when something goes wrong. Either the products are below quality, a machine is faulty or the temperature is rising. We want to know which is the case. That is, we want to determine whether test data are a member of a specific class (defined by our training data).

It’s relatively easy to gather training data of situations that are OK; this is just an average day in the factory. Conversely, collecting data on a faulty system state can be expensive or impossible. If a faulty system state could be simulated, then there is no way to guarantee that all the faulty states are simulated. Therefore, we cannot solve this in a traditional two-class problem manner.

One-class classification problems and solutions are introduced to cope with this problem. By only providing the “normal operations” training data, an algorithm creates a model that attempts to represent this data. If new data are different, according to some measurement, it is labeled as out-of-class. SVMs can help solve this one-class problem.

### One-class svm

The one-class SVM, according to Schölkopf, essentially separates all data points from the origin and maximizes the distance from the hyperplane to that origin. The result of this is a binary function that captures the regions in the input space where the probability density of the data lives. The function returns +1 in a “small” region that captures the training data points and -1 elsewhere.

Within the quadratic function for this SVM, the parameter v characterizes the smoothness much like C did in our previous discussion. It sets an upper limit on the fraction of training samples, which are outliers, and a lower boundary on the number of training samples that are used as the support vector. Because the parameter v is so important, this approach is often referred to as v-SVM.

A modified version of the function creates a hyperplane characterized by w and p that has maximal distance from the origin to feature space F. That plane separates all the data points from the origin, thereby allowing you to see what is and isn’t considered to be part of the class.

### Multi-class svm

When SVM is used for classification, it is a binary model. We can use libsvm, the library used by TradersStudio, for multiple classes. A simple strategy is to do binary classification, one pair at a time. Here, we will use a one-vs.-rest approach. The general idea is to create multiple models using the one class. Let’s assume we have multiple classes as follows:

- Strong uptrend
- Uptrend
- Neutral
- Downtrend
- Strong downtrend

We will implement these as five separate binary models:

- Strong uptrend: Yes/No
- Uptrend: Yes/No
- Neutral: Yes/No
- Downtrend: Yes/No
- Strong downtrend: Yes/No

We then combine these in post-processing and use this to create multi-class SVM identification. Each separate model needs to be tested and cross-validated.

Consider our first example. We will look at a paper published in 2013, “Different Stock Market Models using Support Vector Machines” by Rafael Rosillo, Javier Giner, Javier Puente and Borja Ponte. One example they talk about is to try to predict if the market is going up or down based on using two traditional technical indicator inputs: The relative strength index (RSI) and moving average convergence/divergence (MACD).

Let’s do a similar example. We will use RSI and (open - average (close, 50)) as our inputs. We will then try to classify whether the next trading day will have the open greater than the close. We will use the Australian dollar futures and train this SVM model.

To make a prediction, we need to shift the inputs back in time. Let’s take a look at this simple script in TradersStudio, using the Neural Genius plugin (see “Visualizing performance,” below). This code will print to the terminal output, which will export to Microsoft Excel. This output tests the SVM model over the range of the two input values to depict the prediction output for the model. We will train this model using the last 1,000 days of data ending Dec. 11, 2015. We will generate a grid with RSI on the Y access and the moving average difference on the X axis.

The goal of this example is to depict graphically the rules that the SVM has modeled from the data set. This methodology is also a simplified version of how we can generate rules from any SVM. The grid and table it generates are shown in “Output analysis” (below).

This was a basic example that showed how we can use SVM to understand the logic of a simple data set, filtering for noise. It also shows how we can see the rules that can be produced from an SVM model.

The classification “pedagogical” or “learning-based” is given to black-box rule-extraction techniques. This means that normally there is no way for the technique to explain its logic. We can address this issue by combining these black box methods with machine learning methods that create rules as output.

A third group in this classification scheme is a hybrid approach that incorporates elements of both the “transparent” and pedagogical rule-extraction techniques. This methodology was used originally to exact rules from neural networks and also can be applied to SVM.

Pedagogical rule-extraction techniques are things like tree algorithms and rough sets that simply use the resulting output from the model to induce rules. If there are no conflicts in the trained SVM, these techniques can produce reliable rules. Another rule-extraction method is decomposition, which takes the hidden nodes of the model and tries to reverse engineer them into rules using techniques such as sensitivity analysis.

### Predicting t-bonds

We can use a simple SVM model of intermarket analysis to predict the 30-year Treasury bond. One of the most reliable markets for predicting bonds is the Philadelphia Electrical Utility Average (UTY). Our model is as follows:

**If Tbonds < Average(TBonds,MKLen,0) and UTY > Average(UTY,IntLen,0) Then Buy(“”,1,0,market,day)**

**If Tbonds > Average(TBonds,MKLen,0) and UTY < Average(UTY,IntLen,0) Then Sell(“”,1,0,market,day)**

First published in 1998, this model has continued to do well since then. It originally used a value of 8 for MKLen and a value of 18 for IntLen. This model has made more than $140,000 with these parameters since release and $225,000 overall. This is not the best performing set of parameters today. Some sets make more than $270,000 with $50 deducted per trade for slippage and commission.

There is a lot to learn in this example. The first concept is that making money with a system based on a model is minimally correlated with the model’s ability to predict market direction. Consider the two models discussed earlier. One model used IntLen = 18, while the model that made more money used IntLen = 20. If we look at how well we predict market direction, we find that for short-term prediction (one to five bars) we are just around 50%, and that only rises to 53% looking 20 bars into the future.

The reason being right more often on market direction doesn’t always make more money lies within the distribution of errors. We could have a more accurate model of direction that has large misses when it’s wrong. When developing a model, we need to pick the target and optimization criteria carefully. Optimizing for profit or profit/drawdown is good for trading but could show low percentages on market predictions.

With respect to the general design of our system, our first input will use our simple intermarket divergence model. Here is a code fragment showing our starting inputs.

**MK_Osc=Close-Average(Close,MKLen,0)Int_Osc=Close Of independent1-Average(Close Of independent1,IntLen,0)TrendMode=CDbl((Close-Average(Close,TrendLen,0)))MarkMode = Cdbl(0.0)**

**If MK_Osc<0 And Int_Osc>0 Then MarkMode=CDbl(1)If MK_Osc>0 And Int_Osc<0 Then MarkMode=CDbl(-1)EquityCurve=PLSimulatorLong(MarkMode,True)EquityCurveDiff=EquityCurve-EquityCurve[EQLen] **

**MKMode=CDbl(Sign(MK_Osc))IntMode=CDbl(Sign(Int_Osc))XCorrel=CDbl(Corel(Close,Close Of independent1,CorLen,0))**

The core inputs we will use for the SVM model are:

**MarkMode:** This input is the one used in the intermarket system. A value of 1 means long and -1 means short. This is a very predictive input.

**MKMode:** This is sign of the price minus moving average for the market we are trading.

**IntMode:** This is the sign of the price minus the moving average for the intermarket.

**EquityCurveDiff:** This is the difference over a given window of the current equity value and a previous one. This is a very good input because it tells you how well the core intermarket divergence concept currently is working.

**Xcorrel:** This is the correlation between the market we are trading and the intermarket. The correlation is predictive of how well the intermarket relationship is working.

When we develop these models, you might wonder how we create a target that looks into the future. We shift the inputs back; that is, if we are predicting five bars into the future, we would access each of these inputs five bars earlier and then use the current value for calculating the target.

In our next article, we will start with these inputs and develop a model for T-bonds using SVM. We will try several different combinations and targets. We will try both classification and regression using SVM and explain how to use both of these in SVM models to classify market modes as well as predicting price change. Also, we will see how all of these concepts can come together and be integrated cohesively into a trading system.