Saturday, 31 August 2013

Naive Bayes Classification

Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.

 As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted $\tau$, and defined as $$ Y_{T0 + \tau} = ln \left( S_{T0 + \tau}\right) - ln \left( S_{T0} \right) $$ where $S$ denotes the price of the asset we are attempting to predict, $ln$ denotes the natural logarithm, and $T0$ indicates the current period. The aim of the algorithm is to predict Y, conditional on the state of a set of signals.  Let $X$ < $ X_1, X_2,... , X_k $ > be a vector of real valued stochastic signals. Both Y and X are standardized by subtracting a rolling mean and dividing by some rolling measure of dispersion.   Bayes' Theorem allows us to relate Y and X and is defined as
$$ P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) } $$
where $x$ <$x_1,x_2$> is an observed signal vector, $X=x$ is shorthand for $X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k $, and $y$ is an observed value of Y. Since the denominator $ P (X | x) $ is invariant across values of Y, the formula can be simplified to $$ P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) $$ However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value $c=1$ when $ Y < -2*\sigma$, a value of $c=3$ when $ Y > 2*\sigma$, and a value $c=2$ otherwise. We apply the same discretization procedure to each signal $X_i$ to form a corresponding categorical variable $X_i^*$. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
$$ P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c)  $$
Assuming two predictor signals, the expression $P(C=3 | X_1^*=1 \wedge X_2^*=3 )$ reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal $X_1$ and a positive 2 sigma event in $X_2$". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms. I have set thresholds at the 2 sigma levels in my example. In practice these thresholds are parameters which can be optimized and recursively adjusted to account for changes in the underlying dynamics of a signal's distribution.

The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector $x^*$ exists in our training set, which means that we may not be able to estimate $P(X^*=x^*| C = c)$. The so called Naive Bayes' assumption addresses this issue by assuming that the signals $X_1,X_2,...,X_k$ are conditionally independent of each other for a particular class c. This assumption allows us to express $P(X^*=x^*| C = c)$ in the following form $$ P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) $$ This means that $P(C=c | X^* = x^*)$ can now be expressed as $$ P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) $$ "Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). PS Naive Bayes assumes that the covariance matrix is diagonal for a particular class. Standard PCA does not consider class when rotating the data. We need an alternative procedure called class-conditional Independent Component Analysis. Finally, in typical applications there may be a large number of signals, PCA can be used to extract a set of orthogonal features smaller than the original signal set, with some loss of variance. I will investigate this issue in a future post!

No comments:

Post a Comment