Processing math: 100%

Saturday, 31 August 2013

Naive Bayes Classification

Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.

 As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted \tau, and defined as Y_{T0 + \tau} = ln \left( S_{T0 + \tau}\right) - ln \left( S_{T0} \right) where S denotes the price of the asset we are attempting to predict, ln denotes the natural logarithm, and T0 indicates the current period. The aim of the algorithm is to predict Y, conditional on the state of a set of signals.  Let X < X_1, X_2,... , X_k > be a vector of real valued stochastic signals. Both Y and X are standardized by subtracting a rolling mean and dividing by some rolling measure of dispersion.   Bayes' Theorem allows us to relate Y and X and is defined as
P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) }
where x <x_1,x_2> is an observed signal vector, X=x is shorthand for X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k , and y is an observed value of Y. Since the denominator P (X | x) is invariant across values of Y, the formula can be simplified to P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value c=1 when Y < -2*\sigma, a value of c=3 when Y > 2*\sigma, and a value c=2 otherwise. We apply the same discretization procedure to each signal X_i to form a corresponding categorical variable X_i^*. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c)  
Assuming two predictor signals, the expression P(C=3 | X_1^*=1 \wedge X_2^*=3 ) reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal X_1 and a positive 2 sigma event in X_2". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms. I have set thresholds at the 2 sigma levels in my example. In practice these thresholds are parameters which can be optimized and recursively adjusted to account for changes in the underlying dynamics of a signal's distribution.

The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector x^* exists in our training set, which means that we may not be able to estimate P(X^*=x^*| C = c). The so called Naive Bayes' assumption addresses this issue by assuming that the signals X_1,X_2,...,X_k are conditionally independent of each other for a particular class c. This assumption allows us to express P(X^*=x^*| C = c) in the following form P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) This means that P(C=c | X^* = x^*) can now be expressed as P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) "Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). PS Naive Bayes assumes that the covariance matrix is diagonal for a particular class. Standard PCA does not consider class when rotating the data. We need an alternative procedure called class-conditional Independent Component Analysis. Finally, in typical applications there may be a large number of signals, PCA can be used to extract a set of orthogonal features smaller than the original signal set, with some loss of variance. I will investigate this issue in a future post!

No comments:

Post a Comment