Sunday 29 December 2013

Optimal Betting Strategy for Acey Deucey (card game)

In this post I develop a betting strategy for the popular card game Acey Deucey. Favourable situations where the player has a positive expectation are identified, and optimal bet sizes are determined using the Kelly criterion. Thus, the resulting strategy maximises the long-run growth rate of the player's wealth. Although full Kelly is optimal in the sense of growth rate maximisation, returns can be quite volatile in the short-term. One common approach to reduce volatility is to bet a fraction of full Kelly at the expense of a lower expected growth rate.

Having determined the optimal strategy, I carried out a numerical simulation to estimate the distribution of terminal wealth following a fixed number of rounds for a specified starting wealth. I replicated the project in Matlab and Python. Although the game is simple relative to a real world financial investment strategy, it serves as a very useful prototype for exploring the problem of position sizing and risk management.

Rules of the Game

The player is dealt two cards and bets that a third card is numerically between the first two. Cards 2 to 10 are numbered by their face value, while Jack, Queen, and King are numbered 11, 12, and 13, respectively. If card one is an Ace then the player can make this card low with a value of 1 or high with a value of 14. If an Ace is dealt for the second card then a high value of 14 is automatically assumed.

If the third card lands between the first two then the player takes an amount from the pot equal to the bet. In the case that the third card is equal to the first or second cards, the player must contribute an amount to the pot equal to twice the bet. Finally, if the third card is less than the first card or greater than the second card (cards one and two are always assumed to be in ascending order), then the player must contribute an amount to the pot equal to the bet. Thus, there are three possible outcomes for each round, assuming that the player makes a bet. At the beginning of each round the player contributes a fixed amount called an ante to the pot.

PS there are many variations on the rules for this game. The results presented below may vary when additional rules are included. I make the following assumptions:

  • Wealth is measured in dollars.
  • Dollars are infinitely divisible.
  • Infinite pot.
  • Infinite deck.
  • Ante of 1 dollar per round.

Expected Value

Each situation in the game is associated with an expected value. Let A be the event that the third card is between the first two cards. The probability of event A occurring is defined as  $ P(A) = \frac{j-i-1}{13} = \frac{g}{13} $, where i and j are the numerical values of the first and second cards, respectively. The payoff for this event is defined by $ Q(A)= x $, where x is the bet size. Let B be the event that the third card is outside the first two cards. The probability of B is defined as $ P(B)= \frac{13-g-2}{13} $ and the payoff is $ Q(B)= -x $. Finally, let C be the event that the third card is equal to either of the first two. This event occurs with probability $ P(C)= \frac{2}{13}$. The payoff for C is a loss equal to twice the bet and defined as $ Q(C)= -2x $. Thus, $ P(A)+P(B)+P(C) = 1$. The table below lists the event probabilities and the associated payoffs.


Event Probability P Payoff Q
A P(A)=$ \frac{g}{13} $ Q(A) = $x$
B P(B)=$ \frac{11-g}{13} $ Q(B) = $-x$
C P(C)=$ \frac{2}{13} $ Q(C) = $-2x$

The expected value of a round is generally defined as  $ EV = P(A)Q(A) + P(B)Q(B) + P(C)Q(C) $. Inserting the expressions from the table above, the expected value can be expressed as $ EV = \frac{g}{13} x - \frac{11-g}{13} x - \frac{4}{13} x $ and finally simplified to 

$$ EV = \frac{(2g-15)}{13} x $$

where g is the gap between the cards (j-i-1). For example, if the first card is a 3 and the second card is a Jack then g is 7. 

Favourable Situations

Favourable situations are those which offer positive expectation. The table below lists the expected value for each situation. It is clear that the player should only be betting when the gap is greater than seven.

g 0 1 2 3 4 5 6 7 8 9 10 11 12
EV $-\frac{15}{13} x$ $-x$ $-\frac{11}{13} x$ $-\frac{9}{13} x$ $-\frac{7}{13} x$ $-\frac{5}{13} x$ $-\frac{3}{13} x$ $-\frac{1}{13} x$ $+\frac{1}{13} x$ $+\frac{3}{13} x$ $+\frac{5}{13} x$ $+\frac{7}{13} x$ $+\frac{9}{13} x$

 Optimal Bet Size


So, we will only bet when the gap is greater than seven. However, we still haven't addressed the question of how much to bet for each of the five favourable situations identified. Maximising the expected value of our decision is not sensible as this would result in betting all of our wealth and run a substantial risk of ruin (wealth going to zero). Instead, our goal should be to maximise the growth rate of our wealth over the long-run. Thus, there is a trade off between aggressiveness and probability of ruin. We want to maximise the following function with respect to the bet size x:
$$ f(x) = p_1 \text{ln} (1 + x) + p_2 \text{ln}(1-x) + p_3 \text{ln}(1-2x) $$
where $p_1$, $p_2$, and $p_3$ are $P(A)$, $P(B)$, and $P(C)$, respectively. The five figures below illustrate the concavity of this function over the maximisation region, which varies depending on the gap g. The value of x which maximises this function is the optimal Kelly bet. Intuitively, the recommended bets are proportional to the gap. 






We can obtain the maximum of f(x) algebraically by setting the first derivative equal to zero and solving for x. This results in the following expression for the optimal bet, denoted $x^*$

$$ x^* = \frac{ 3p_1-p_2-\sqrt{p_1^2-6p_1p_2+8p_1p_3+9p_2^2+24p_2p_3+16p_3^2} } { 4(p_1+p_2+p_3)} $$

The figure below shows the value of $x^*$ for each of the favourable situations. 


Likely Range in Wealth

Assuming one follows the optimal strategy outlined above, what range might one expect in one's wealth over a specific number of rounds? To address this question I implemented an application to simulate the game. In the two figures below the equity curves for 30 simulations are presented, each consisting of 300 rounds. A starting wealth of 100 dollars is assumed. As expected, the returns in the second figure, which uses quarter Kelly, are less volatile than the first, which uses full Kelly.



However, we can't tell a lot from the above figures as the number of simulations is too small to make an accurate estimate of the distribution. To do this I simulate 50 thousand paths and for each round I plot an empirical estimate of the mean (in red) and the 10th to 90th percentiles (in green). For these simulations I assume a starting wealth of 1000 dollars. Thus, Assuming full Kelly betting, we expect to turn 1000 dollars into about 6000 dollars over 300 rounds. However, the difference between the 10th and 90th percentiles is about 16000.


In contrast, quarter Kelly results in a lower expected wealth of about 1200 dollars, with a difference of about 1000 dollars between the 10th and 90th percentiles.


I have many ideas for extending this analysis and I may post again on this topic in the future. Thanks for reading - and don't forget to apply this strategy the next time you sit down to play Acey Deucey with friends ;)

Friday 4 October 2013

Modeling Mean Reversion for Statistical Arbitrage

The goal of a classical statistical arbitrage strategy is to extract returns by holding combinations of instruments that follow a mean reverting process. A classic example is a so-called Pairs Trading Strategy that benefits from the mean reverting relationship between correlated stocks such as Coca Cola and Pepsi. A second example is a Yield Curve Butterfly Strategy that attempts to capture the mean reverting relationship between medium, long, and short-term interest rates. Mean reversion is an attractive property as one can extract returns from the market by selling the peaks and buying the troughs of the wave. The amplitude and frequency are two characteristics of a wave that are important in stat arb. The amplitude determines the profit potential of a single cycle in the wave and the frequency determines how many of these opportunities arise over a period of time. We can simulate a sin wave to illustrate these ideas with the following form $$ y(t) = A \cdot \text {sin} ( 2 \pi ft ) $$

where A is amplitude, and f is frequency. The following figure shows two sin waves. The red wave has a frequency of 1 while the yellow wave has a frequency of 2. Thus, the latter wave oscillates more frequently and so presents more trading opportunities.


Of course, the greater the amplitude of the waves the greater the profit potential from an individual trade. The following plot shows a sin wave with an amplitude  of 1 in red, and an amplitude of 2 in yellow, with both waves having a frequency of 2. In this case we prefer to trade the yellow series as we can make more profit in the same amount of time.


Although it is possible to find mean-reverting trades, in practice the waves are noisey. The introduction of noise makes it more difficult to statistically identify mean reverting behavior and also makes risk management a more challenging task. We can simulate noisey waves by including a noise term in the above equation $$ y(t) = A \cdot \text {sin} ( 2 \pi ft) + \epsilon_t $$
where $\epsilon_t$ is a noise variable that follows some distribution. The following figure shows a sin wave with and without normally distributed noise.



In my experience the noise tends to be non normal. For example, the Bund-Bobl-Schatz butterfly spread is mean-reverting, but it is subject to irregular extreme outliers that can result in large losses. To illustrate this, the following figure shows a sin wave with and without Student T distributed noise with 3 degrees of freedom. From the figure it is clear that there are multiple occassions where huge fat tailed events could have resulted in margin calls on positions that at the time were fading the spread.




In finance the most popular model for explaining mean reverting behaviour is the Ornstein-Uhlenbeck (OH) model. The stochastic differential equation (SDE) for the OH process is given by

$$ d S_t = \lambda ( \mu - S_t ) dt + \sigma d W_t $$
with $\lambda$ the mean reversion rate, $\mu$ the mean, and $\sigma$ the volatility. A useful metric is the Half-Life, denoted H, which indicates the number of periods it takes the process to revert from the peak half way to the mean. The Half-Life is defined as
$$ H = \frac{\text{log} (2) } {\lambda} $$
All other things held constant, we prefer a low Half-Life as this results in more trading signals over a period of time.

A previous blog post  outlines simple closed form solutions for calibrating the OH model parameters to a given dataset using the least squares regression method, and the maximum likelihood method. The author provides Matlab code for the calibrations. I decided to implement his functions in Python. The Python implementation for the least squares calibration is given below

def OU_Calibrate_LS(S,delta):
    from numpy import sum,power,sqrt,log,exp
    n = len(S)-1

    Sx = sum(S[0:-1])
    Sy = sum(S[1: ])
    Sxx = sum(power(S[0:-1],2))
    Sxy = sum( S[0:-1] * S[1: ])
    Syy = sum(power(S[1: ],2))

    a  = ( n*Sxy - Sx*Sy ) / ( n*Sxx - power(Sx,2) )
    b = ( Sy - a*Sx ) / n
    sd = sqrt( (n*Syy - power(Sy,2) - a*(n*Sxy - Sx*Sy) )/n/(n-2) )

    lambda_ = -log(a)/delta
    mu     = b/(1-a)
    sigma  =  sd * sqrt( -2*log(a)/delta/(1-pow(a,2)) )
    return (mu,sigma,lambda_)
And the maximum likelihood implementation is given by
def OU_Calibrate_ML(S,delta):
    from numpy import sum,power,sqrt,log,exp
    n = len(S)-1

    Sx = sum(S[0:-1])
    Sy = sum(S[1: ])
    Sxx = sum(power(S[0:-1],2))
    Sxy = sum( S[0:-1] * S[1: ])
    Syy = sum(power(S[1: ],2))

    mu  = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (power(Sx,2) - Sx*Sy) )

    lambda_ = -log( (Sxy - mu*Sx - mu*Sy + n*power(mu,2)) / (Sxx -2*mu*Sx + n*power(mu,2)) ) / delta
    a = exp(-lambda_*delta)
    sigmah2 = (Syy - 2*a*Sxy + power(a,2)*Sxx - 2*mu*(1-a)*(Sy - a*Sx) + n*power(mu,2)*power(1-a,2))/n
    sigma = sqrt(sigmah2*2*lambda_/(1-power(a,2)))

    return (mu,sigma,lambda_)


In a future post I will investigate a number of potential mean reverting opportunities within the futures market using the OH model as a guide. Thanks for reading!



Saturday 31 August 2013

Naive Bayes Classification

Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.

 As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted $\tau$, and defined as $$ Y_{T0 + \tau} = ln \left( S_{T0 + \tau}\right) - ln \left( S_{T0} \right) $$ where $S$ denotes the price of the asset we are attempting to predict, $ln$ denotes the natural logarithm, and $T0$ indicates the current period. The aim of the algorithm is to predict Y, conditional on the state of a set of signals.  Let $X$ < $ X_1, X_2,... , X_k $ > be a vector of real valued stochastic signals. Both Y and X are standardized by subtracting a rolling mean and dividing by some rolling measure of dispersion.   Bayes' Theorem allows us to relate Y and X and is defined as
$$ P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) } $$
where $x$ <$x_1,x_2$> is an observed signal vector, $X=x$ is shorthand for $X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k $, and $y$ is an observed value of Y. Since the denominator $ P (X | x) $ is invariant across values of Y, the formula can be simplified to $$ P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) $$ However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value $c=1$ when $ Y < -2*\sigma$, a value of $c=3$ when $ Y > 2*\sigma$, and a value $c=2$ otherwise. We apply the same discretization procedure to each signal $X_i$ to form a corresponding categorical variable $X_i^*$. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
$$ P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c)  $$
Assuming two predictor signals, the expression $P(C=3 | X_1^*=1 \wedge X_2^*=3 )$ reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal $X_1$ and a positive 2 sigma event in $X_2$". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms. I have set thresholds at the 2 sigma levels in my example. In practice these thresholds are parameters which can be optimized and recursively adjusted to account for changes in the underlying dynamics of a signal's distribution.

The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector $x^*$ exists in our training set, which means that we may not be able to estimate $P(X^*=x^*| C = c)$. The so called Naive Bayes' assumption addresses this issue by assuming that the signals $X_1,X_2,...,X_k$ are conditionally independent of each other for a particular class c. This assumption allows us to express $P(X^*=x^*| C = c)$ in the following form $$ P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) $$ This means that $P(C=c | X^* = x^*)$ can now be expressed as $$ P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) $$ "Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). PS Naive Bayes assumes that the covariance matrix is diagonal for a particular class. Standard PCA does not consider class when rotating the data. We need an alternative procedure called class-conditional Independent Component Analysis. Finally, in typical applications there may be a large number of signals, PCA can be used to extract a set of orthogonal features smaller than the original signal set, with some loss of variance. I will investigate this issue in a future post!

Friday 30 August 2013

Pairwise Linear Correlation (CL/RB/HO)


The prices of crude oil, gasoline, and heating oil are highly correlated for obvious reasons. In this post I present an estimate of the pairwise linear correlations between the future contracts on these products. The correlations are estimated using a short sample of two weeks. The following figure shows the cumulative return over the sample period where the straight line represents Saturday and Sunday, when the market is closed.


I estimate the pairwise correlation for a range of sampling frequencies from 1 second out to 120 seconds. I use samples of the midpoint to avoid the impact of the bid ask bounce effect. The correlations break down as we approach the highest sampling frequency, which is a well documented phenomenon called the Epps Effect. That said, the correlations remain above 50% at the one second frequency which is evidence for the high frequency spreader algorithms that are trading these contracts at sub second frequencies and in the process strengthening the correlations.


Of course, these correlations evolve over time, and will vary depending on the sample used to estimate. However, the same pattern emerges, that is diminishing correlations with increased sampling frequency.

Wednesday 28 August 2013

Intraday Volatility (CL/RB/HO)

I implemented a high frequency volatility estimator that takes one second snapshots of the midpoint as input. The following figure shows minute by minute volatility estimates (scaled to one hour) for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The median estimate for each minute over 250 trading days is used to illustrate the durnial volatility pattern.  The most significant volatility spikes can be easily matched with well known market events such as open, close, economic figure announcements, and less obvious events such as minute markers and settlements.  Some of the volatility spikes are caused by weekly (as in the case of the crude oil inventory figure) or monthly (as in the case of German Manufacturing PMI) announcements.



In this post I go through the Python code I developed to create this figure. The steps include querying a MongoDB database, aggregating results, computing volatility, and plotting.

We begin by connecting to MongoDB via the PyMongo package.
    client = MongoClient()
    client = MongoClient('localhost', 27017)
    db = client.taq
    quote_collection = db['CL_quotes']
I then query the available list of dates
    dates = t_collection.find({},{"d" : 1 }).distinct("d")
    dates = sorted(dates)
I don't want to include exchange holidays or expiration dates in my analysis. The following function calls retrieve the relevant dates.
    holidays= sd.holidays()
    expirationdays= sd.CL_expirationdates()
I then filter the date list.
    dates_filtered = [x for x in dates if x not in holidays]
    dates_filtered = [x for x in dates_filtered if x not in expirationdays]
I also filter Sundays.
dates_filtered = [x for x in dates_filtered if x.weekday() != 6]
For each date "dt" in dates_filtered I specify an end date to be used in the database query following.
    start_dt = dt
    end_dt = dt
    end_dt = end_dt + datetime.timedelta(days=1)
I then write an aggregation query to sample bid and ask quotes at one second intervals. The aggregation framework is a recent addition to MongoDB. It is pretty fast, but apparently the Mongo developers are in the process of making it faster. This query aggregates bid and ask quotes into 1 second buckets and returns the closing quotes for each second. The aggregation process occurs at database level. The alternative is to load the quotes into a Python data structure to perform the aggregation.

    docs = collection.aggregate([
        {"$match": {"dt": {"$lt" : end_dt, "$gte" : start_dt  }}},
        {"$project": {
             "year":       {"$year": "$dt"},
             "month":      {"$month": "$dt"},
             "day":        {"$dayOfMonth": "$dt"},
             "hour":       {"$hour": "$dt"},
             "minute":     {"$minute": "$dt"},
             "second":     {"$second": "$dt"},
             "dt": 1,
             "ap": 1,
             "bp": 1 }},
         {"$sort": {"dt": 1}},
         {"$group":
           {"_id" : {"year": "$year", "month": "$month", "day": "$day", "hour": "$hour","minute":"$minute","second":"$second" },
                        "bp_close":  {"$last": "$bp"},
                        "ap_close":  {"$last": "$ap"} }} ] )

The following code iterates through the result set and creates lists of closing bid and ask quotes and corresponding timestamps.
    timestamp = []
    bp_close=[]
    ap_close=[]

    for doc in docs['result']:
        year = doc['_id']['year']
        month = doc['_id']['month']
        day = doc['_id']['day']
        hour = doc['_id']['hour']
        minute = doc['_id']['minute']
        second = doc['_id']['second']
        timestamp.append(datetime.datetime(year,month,day,hour,minute,second))
        bp_close.append(doc['bp_close'])
        ap_close.append(doc['ap_close'])

I then wrap these lists in Pandas Series objects.
    timestamp_series = pd.Series(data=timestamp)
    timestamp_series.name='Timestamps'

    bp_series = pd.Series(data=bp_close)
    bp_series.name='BidPrice'

    ap_series = pd.Series(data=ap_close)
    ap_series.name='AskPrice'
I then concatenate the series to form a Pandas Dataframe object. The Dataframe is my data structure of choice for working with time series data in Python.
    df_in = pd.concat([pd.DataFrame(timestamp_series),pd.DataFrame(bp_series),pd.DataFrame(ap_series)], axis=1)
    df_in = df_in.set_index('Timestamp')
    df_in = df_in.sort()

The aggregation process omits 1 second intervals where no quote was available. However, I want a full range from 00:00:00 to the final 1 second period of the day 23:59:59.
    from = datetime.datetime(year,month,day,0,0,0)
    to = from + datetime.timedelta(days=1)
    step = datetime.timedelta(seconds=1)
    range = []
    t = from
    while t < to:
        range.append(t)
        t += step
I then create an empty Pandas Series indexed on this range followed by a Dataframe.
    range_series = pd.Series(data=range)
    range_series.name='Timestamp'
    df = pd.DataFrame(index=range_series,columns=['BidPrice','AskPrice' ])
I then paste the empirical Dataframe into the full range column by column.
    for col in df_in.columns:
        df[col] = df_in[col]
The missing intervals will contain NaN values which I fill by copying forward previous quotes.
    df = df.fillna(method='pad')
The midpoint is then calculated and a new column created in the Dataframe.
    df['m'] = pd.Series( (df['BidPrice']+df['AskPrice'])/2,index=df.index)
1 second log returns are then calculated
    df_r = np.log(df).diff()
    df_r = df_r.fillna(0)
Following this I square the returns.
    df_r2 = np.power(df_r,2)
And aggregate the squared returns into 1 minute intervals.
    df_r2_agg = df_r2.resample('1min',how='sum')
The above steps are applied for each of the dates resulting in millions of rows. I then create a pivot table on date and minute of the day.
    df_r2_agg['minute'] = t
    df_r2_agg['date'] = dates
    df_r2_agg_pivoted = df_r2_agg.pivot(index='date',columns='minute',values='m')
where t is a time only series in increments of one second, replicated for each date. I then calculate the median variance across dates.
    df_r2_agg_median = df_r2_agg_pivoted.median()
To calculate the volatilities I take the square root of the 1 minute variances, and scale to one hour.
    df_vol = np.sqrt(df_r2_agg_median)
    df_vol = df_vol*np.sqrt(60)
Finally, I plot the volatilities. I have left out some code, but you get the idea! ;)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    colors = itertools.cycle(["#00FA9A", "c", "y","r","m","g","w" ])
    for col in df_vol.columns:
        ax.plot_date(df_vol.index,df_vol[col]/.0001,color=next(colors),linestyle='-',marker='' )

Monday 26 August 2013

Volume at Expiration (CL/RB/HO)

As a future contract approaches the expiration date, open interest rolls to the back month and volumes in the front month dwindle. Traders holding large positions in the front month with the intention of rolling prior to expiration are faced with the decision of when to roll. Waiting until the expiration date is not advisable due to illiquidity. One day prior to expiration is also relatively illiquid. The following figures show the median intraday volume aggregated over 15 minute intervals for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The analysis suggests that large traders should roll at least 2 days prior to expiration to avoid serious liquidity constraints. The legend in these figures show the number of days to expiration, with 0 being the expiration date.


Crude Oil (CME:CL)




Gasoline (CME:RB)



Heating Oil (CME:HO)


Sunday 18 August 2013

Crude Oil Inventory

Each week (usually Wednesdays) the U.S. Energy Information Administration publish the number of barrels of crude oil held in inventory by commercial firms during the past week. The figure is eagerly anticipated by energy traders as it is interpreted as an indicator of supply. Thus, an increase (decrease) in the figure is typically associated with a bearish (bullish) sentiment toward crude oil.


The following chart shows the trading action around the Jan 30th 2013 release over a three minute window starting one minute prior to the release, which occurs at 15.30pm GMT, and ending two minutes following. A green dot indicates an up tick, a red dot indicates a down tick, and a white dot is displayed when a trade occurs at the same prices.



The next figure shows the bid ask spread at 50 millisecond intervals over the same time horizon.



And the following figure shows the midpoint, which is defined as an average of the bid and ask prices.







Very little trading takes place over the one minute period prior to the release. The next plot shows the trading volume aggregated over 5 second intervals.




To estimate volatility I aggregate 50 millisecond returns into 5 second intervals, where return is defined as the change in the log of the midpoint. The volatility graph below shows basis points on the y-axis. As expected there is a spike in volatility on the release.




Each week market analysts forecast the inventory figure. The following plot shows the consensus forecast over 2012 versus the actual figure.




As the following plot highlights, there is usually a significant differential between the consensus forecast and the actual outcome. The differential, or forecast error, is measured in millions of barrels of crude oil.





In the next plot I show the cumulative log returns of the midpoint for each announcement over a 30 second window starting 10 seconds prior to the release. As expected there is little activity prior to the announcement and a significant volatility regime switch at 15.30pm.



The volatility switch is better illustrated in the following figure, which shows the absolute 50 millisecond returns.


Given that the market typically prices in the consensus forecast prior to the announcement, one might expect there to be a positive correlation between forecast error and post release volatility.

To test this hypothesis I investigate the correlation between the consensus forecast error and post announcement volatility. The volatility estimate is scaled to a one hour horizon. The figure below suggests that an unexpected increase in supply has a bigger impact on volatility than an unexpected decrease.



The final figure shows the relationship between the volatility estimate and the absolute forecast error. I find a significant linear correlation of 45%.




I plan to post more analysis on this topic soon! Thanks for reading!


Sunday 4 August 2013

Unix Commands

In this post I catalogue most of the Unix commands that I have been using on a frequent basis.

-bash$ history | awk '{a[$2]++}END{for(i in a){print a[i] " " i}}' | sort -rn | head

List of commands you use most often

-bash$ ssh -X user@host
ssh remote login with X11 forwarding

-bash$ strace -ff -e trace=write -e write=1,2 -p SOME_PID
intercept stdout/stderr of another process

-bash$  strace -p PID
print stack trace for process PID

-bash$  ls -l /proc/PID/fd
list the file handles that process PID has open.

-bash$  Ctrl + z
-bash$  bg
-bash$  disown -h [job-spec]
pause already running process, put into background, and detach.

-bash$  top - u user
list running processes for user. -bash$ echo "ls -l" | at 1800
Execute a command at a given time

-bash$ man ascii 
quick access to the ascii table

-bash$ !!:gs/foo/bar 
Runs previous command replacing foo by bar every time that foo appears

-bash$ curl ifconfig.me
get your external ip address

-bash$ sshfs name@server:/path/to/folder /path/to/mount/point
Mount folder/filesystem through SSH

-bash$ ctrl-l 
clear the terminal screen

-bash$ (cd /tmp && ls)
Jump to a directory, execute a command and jump back to current dir

-bash$ du -h --max-depth=1 | sort -n
List the size (in human readable form) of all sub folders from the current location, sorted by size.

-bash$ disown -a && exit
Close shell keeping all subprocess running

-bash$ ps aux | sort -nk +4 | tail
Display the top ten running processes - sorted by memory usage

-bash$ find . -name "*.[py]" -exec grep -i -H "search pattern" {} \;
Search recursively to find a word or phrase in certain file types, such as Python code.
-H tells grep to print the filename. 
-i to match the case exactly.
-bash$ awk '/start_pattern/,/stop_pattern/' file.txt
Display a block of text with AWK.

-bash$ grep -A # pattern file.txt"
to see a specific number of lines following your pattern

-bash$ grep -n --color pattern * -r


-bash$ lsof -i
Watch Network Service Activity in Real-time

-bash$ some_very_long_and_complex_command # label
Using the above trick it's possible to label your commands and access them easily by pressing ^R and typing the label

-bash$ lsof -P -i -n
Show apps that use internet connection at the moment.

-bash$ mv filename.{old,new}
quickly rename a file

-bash$ rm !(*.foo|*.bar|*.baz)
Delete all files in a folder that don't match a certain file extension

-bash$ wget -qO - "http://www.tarball.com/tarball.gz" | tar zxvf -
Extract tarball from internet without local saving

-bash$ cat /etc/issue
Display which distro is installed

-bash$ man hier
Show File System Hierarchy

-bash$ sudo dd if=/dev/mem | cat | strings
this command will show you all the string (plain text) values in ram

-bash$ ls -R | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/   /' -e 's/-/|/'
Graphical tree of sub-directories

-bash$ mkdir -p a/long/directory/path
Make directory including intermediate directories

-bash$ sed -n '10,20p' 
Print all the lines between 10 and 20 of a file


-bash$ pv sourcefile > destfile
Copy a file using pv and watch its progress

-bash$ rename 'y/ /_/' *
replace spaces in filenames with underscores

-bash$ vim scp://username@host//path/to/somefile
Edit a file on a remote host using vim

-bash$ du -s * | sort -n | tail
Get the 10 biggest files/folders for the current directory

-bash$ mkdir /home/foo/doc/bar && cd $_
mkdir & cd into it as single command

-bash$ chmod --reference file1 file2
Makes the permissions of file2 the same as file1

-bash$ sed -n 5p 
To print a specific line from a file

-bash$ tar -tf  | xargs rm -r
Remove all files previously extracted from a tar(.gz) file.

-bash$ ls -d */
list directory entries instead of contents, and do not dereference symbolic links

-bash$ ps awwfux | less -S
Show a 4-way scrollable process tree with full details.

-bash$ find . -type d -empty -delete
Recursively remove all empty directories

-bash$ leave +15 ; leave 1855
set an alarm at for a particular time

-bash$ grep ^Dirty /proc/meminfo
Find out how much data is waiting to be written to disk

-bash$ mkdir -p work/{d1,d2}/{src,bin,bak}
make directory tree

-bash$ cp file.txt{,.bak}
Create a quick back-up copy of a file

-bash$ echo ${SSH_CLIENT%% *}
Get the IP of the host your coming from when logged in remotely

-bash$ showkey -a
Quick access to ASCII code of a key

-bash$ history -d
delete a line from your shell history

-bash$ find ./ -type f -exec chmod 644 {} \;
Recursively change permissions on files, leave directories alone.

-bash$ alias dush="du -sm *|sort -n|tail"
easily find megabyte eating files or directories. dush for the above output. dush -n 3 for only the 3 biggest files

-bash$ :mksession!  
Save your sessions in vim to resume later

-bash-3.2$  ./myapp &> log.txt & disown -h ;  tail -f log.txt
Run background process. Redirect stdout and stderr to log.txt, disown the process, and tail the log file.

-bash-3.2$  tar -zxvf data.tar.gz
where the options zxvf are:
-z : Uncompress the resulting archive with gzip command.
-x : Extract to disk from the archive.
-v : Produce verbose output i.e. show progress and file names while extracting files.
-f data.tar.gz : Read the archive from the specified file called data.tar.gz.


-bash-3.2$  ls -lhs
List files by size, with option l for a long listing format, h for human readable file size, and s to indicate sort by size.

-bash-3.2$  ls -lSr
To list in reverse order.

-bash-3.2$ wc -l *
Count the number of rows in each file in the current directory.

-bash-3.2$  ls -1 targetdir | wc -l
Count the number of files in the current directory

 -bash-3.2$  top
List running processes. Press 1 to list cpus.

 -bash-3.2$  find /home/me | grep foo.c
Locate file foo.c

 -bash-3.2$ netstat -tlnp
Lists all listening ports together with the PID of the associated process

 -bash-3.2$ mrt HOSTNAME
Better than traceroute and ping combined. Investigates the network connection between the host mtr runs on and HOSTNAME.

 -bash-3.2$ > file.txt
flush the contents of a file without deleting it

 -bash-3.2$ Ctrl-r
Allows you to reverse search through your command history.

 -bash-3.2$ du -skh .
Outputs the total size of files and directories within the current directory?

 -bash-3.2$ df -kh .
Outputs the used and available space for the current drive.

Friday 2 August 2013

U.S. Petroleum Balance Sheet

I developed a Python script to parse the U.S. Petroleum Balance Sheet  Excel file and generate a figure for each statistic. All reported statistics are weekly and are measured in thousands of barrels per day. The statistics fall under two categories: Petroleum Stocks, and Petroleum Supply. The figures are up-to-date for the week ending August 4th 2013.



Petroleum Stocks

U.S. Ending Stocks of Crude Oil




U.S. Ending Stocks excluding SPR of Crude Oil




U.S. Ending Stocks of Crude Oil in SPR




U.S. Ending Stocks of Total Gasoline



Thursday 25 July 2013

Managing Big Data with MongoDB


Update: Since writing this post I have changed my mind about MongoDB being an effective tick data storage solution. In my experience selection speeds are far too slow. MongoDB is document oriented and inherently unsuitable for storing time series. A more fitting solution may be a column oriented database. I know that KDB is a very fast column oriented database widely used for tick data storage but unfortunately it is only available on an expensive commercial license - I'm currently trying to find a good open source alternative to KDB.




 Liquid exchange-traded instruments typically generate millions of ticks per trading session in the form of trades and quotes changes. Storing such vast quantities of data in CSV files is not practical. There are a range of alternatives to CSV files including binary file formats like HDF5, traditional relational databases such as PostgreSQL and MySQL, and NoSQL databases, with MongoDB and Cassandra being two of the most popular. I decided on MongoDB for my data storage needs, and in this post I catalogue my experiences setting up a database. Please note that the database schema I present is for illustrative purposes only. There is a large amount of online content that advises on how best to set up a MongoDB schema.

The mongo installation includes a shell which allows the user to interact with the database from the command line. The shell can be started with the command mongo.
-bash$ mongo
MongoDB shell version: 2.4.5
connecting to: test
>

A database can be selected with the use command. In this example I select a new database called mydb. Please note that the database is not actually created until a document is inserted. For those users more familiar with traditional SQL databases, a collection is analogous to a table, and a document is equivalent to a row.
> use mydb
switched to db mydb

I then add 4 documents to mydb within a collection named trades, using the insert command. Each document represents a trade, with four fields: date, time, price, and volume. I use short field names (d, t, p, and v) because each document stores a copy of the field names and this can increase disk usage substantially when there are hundreds of millions of documents in a database.
> db.trades.insert( {d: "20130122", t: "08:01:22.298",  p: 96.23, v: 1 } )
> db.trades.insert( {d: "20130122", t: "08:01:22.698",  p: 96.24, v: 2 } )
> db.trades.insert( {d: "20130122", t: "08:01:23.256",  p: 96.23, v: 1 } )
> db.trades.insert( {d: "20130122", t: "08:01:23.557",  p: 96.24, v: 3 } )

The contents of the trades collection can be displayed using the find function with no parameters.
> db.trades.find().pretty()

{
 "_id" : ObjectId("51f2a8df6047fc37731a8f0a"),
 "d" : "20130122",
 "t" : "08:01:22.298",
 "p" : 96.23,
 "v" : 1
}
{
 "_id" : ObjectId("51f2a91c6047fc37731a8f0b"),
 "d" : "20130122",
 "t" : "08:01:22.698",
 "p" : 96.24,
 "v" : 2
}
{
 "_id" : ObjectId("51f2a9336047fc37731a8f0c"),
 "d" : "20130122",
 "t" : "08:01:23.256",
 "p" : 96.23,
 "v" : 1

}
{
 "_id" : ObjectId("51f2a9446047fc37731a8f0d"),
 "d" : "20130122",
 "t" : "08:01:23.557",
 "p" : 96.24,
 "v" : 3

}

The count function also correctly indicates that there are 4 documents in the trades collection.
> db.trades.count()

4

I then query the trades collection for all documents where field v is equal to 1, and two records are returned as expected.
> db.trades.find({v: 1})

{ "_id" : ObjectId("51f2a8df6047fc37731a8f0a"), "d" : "20130122", "t" : "08:01:22.298", "p" : 96.23, "v" : 1 }
{ "_id" : ObjectId("51f2a9336047fc37731a8f0c"), "d" : "20130122", "t" : "08:01:23.256", "p" : 96.23, "v" : 1 } 

However, the explain function shows that all four documents were scanned during the search. This isn't very efficient.
> db.trades.find({v: 1}).explain()

{
 "cursor" : "BasicCursor",
 "isMultiKey" : false,
 "n" : 2,
 "nscannedObjects" : 4,
 "nscanned" : 4,
 "nscannedObjectsAllPlans" : 4,
 "nscannedAllPlans" : 4,
 "scanAndOrder" : false,
 "indexOnly" : false,
 "nYields" : 0,
 "nChunkSkips" : 0,
 "millis" : 0,
 "indexBounds" : {
 },
 "server" : "yeats:27017"
}

The reason for this is that the trades collection has only one index by default, which is an auto generated unique identifier.
 db.trades.getIndexes()
[
 {
  "v" : 1,
  "key" : {

   "_id" : 1
  },

  "ns" : "mydb.trades",
  "name" : "_id_"
 }
] 
I then create an index for the trades collection based on the volume field by calling the ensureIndex function. A subsequent call to getIndexes confirms that a second index has been created.
> db.trades.ensureIndex({v: 1})

> db.trades.getIndexes()

[
 {
  "v" : 1,
  "key" : {
   "_id" : 1
  },
  "ns" : "mydb.trades",
  "name" : "_id_"
 },

 {
  "v" : 1,
  "key" : {

   "v" : 1
  },

  "ns" : "mydb.trades",
  "name" : "v_1"
 }
] 

I run the same query again, and this time we can see that only two documents were scanned to find two matches. This may not seem like a big deal now, but if you don't create indexes in your database then your queries will be cripplingly slow when the collections grow in size. You can create multiple indexes for each collection in your database depending on which fields you expect to include in your queries.
> db.trades.find({v: 1}).explain()

{
 "cursor" : "BtreeCursor v_1",
 "isMultiKey" : false,
 "n" : 2,
 "nscannedObjects" : 2,
 "nscanned" : 2,
 "nscannedObjectsAllPlans" : 2,
 "nscannedAllPlans" : 2,
 "scanAndOrder" : false,
 "indexOnly" : false,
 "nYields" : 0,
 "nChunkSkips" : 0,
 "millis" : 38,
 "indexBounds" : {
  "v" : [
   [
    1,
    1
   ]
  ]
 },
 "server" : "myserver:27017"
}

Rather than manually creating indexes for each collection, I wrote a Python script that automatically iterates through all collections, calling the create_index function on each. Interoperability between Python and MongoDB is enabled by the PyMongo package. The following script creates three indexes for each collection, one based on date, a second on time, and a third compound index based on date and time.
import pymongo
from pymongo import MongoClient
from pymongo import ASCENDING, DESCENDING

client = MongoClient()
client = MongoClient('localhost', port#)
db = client.mydb
collections = sorted(db.collection_names());

for s in collections:
    if s != "system.indexes":
        db[s].create_index([("d", ASCENDING)])
        db[s].create_index([("t", ASCENDING)])
        db[s].create_index([("d", ASCENDING),
                            ("t", ASCENDING)])
        print("Indexes created for collection ",s)

Having executed this script on my database, queries based on date and time ranges are completed in a timely fashion. The following command returns all trades between 8am and 4pm on Jan 4th 2012, sorted in ascending order by date and time.
> db.trades.find({d: "20120104" }, t: {$gte: "08:00:00.000", $lt:"16:00:00.000"}   }).sort( {d: 1, t: 1 } )

Of course, updating your database document by document is not feasible. There are a number of options for automatically populating the database. One option is to use the mongoimport command from bash to import a CSV file to the database where each row is interpreted as a document.
-bash$ mongoimport -d taq -c trades --type csv --file trades.csv --headerline 

A second option is to write a batch insert function in C++ and connect to MongoDB using the C++ Drivers. I developed a C++ function called insertTrades which parses the contents of a CSV file and creates BSONObj objects, which are then inserted to MongoDB.


Querying and parsing trades from the database can also be achieved using the C++ Drivers. The following C++ function returns a vector of trade custom type objects for a single trading session specified by the key parameter.


So far my MongoDB experiences have been very positive! I will post on this topic in the near future! Thanks for reading.

Monday 22 July 2013

Python for Quant Research

I have been a daily user of Matlab for the last five years and find it to be an excellent tool for carrying out quantitative research. However, having recently started to learn Python, I am now officially a Python convert and favour this intuitive language as the tool of choice for quickly prototyping ideas. The Python community is open source and growing exponentially with no shortage of excellent packages on offer. Python also possesses a high level of interoperability with C/C++, and also with R via the RPy2 package.

There are a number of open source Machine Learning packages available including Statsmodels, PyML, PyBrain, MLPy, milk, and scikit-learn. These packages come with implementations of various regression and classification algorithms from Logistic Regression and Support Vector Machines, to Random Forest and Neural Networks. Each package has it's own pros and cons and some packages focus on particular algorithms. scikit-learn is my personal favourite ML Python package. Some other useful packages that I have installed include SciPy a library of scientific computing routines, NumPy which is an N-dimensional array package, Matplotlib for 2D plotting, Pandas for data structures and analysis, and iPython, which is an interactive console for editing Python code.

R has a mature set of statistical packages on offer which can be called from Python. Thus, I set about installing R, then RPy2, and finally RMetrics which is the most comprehensive R package for analysing financial time series. I decided to build R from source rather than download a binary. Use wget to download the source.

-bash$ wget http://ftp.heanet.ie/mirrors/cran.r-project.org/src/base/R-3/R-3.0.1.tar.gz

Extract using tar, and enter the source directory.

-bash$ tar -zxvf ; cd R-3.0.1

You must configure the build with the --enable-R-shlib option as this makes R a shared library, which is a prerequisite for the RPy2 installation.

-bash$ ./configure --prefix=$HOME/.local  --enable-R-shlib

The R make process can take a while so I put it into the background, and detach from the process with disown so that it does not terminate if I close my shell. I pipe the stdout and stderr to a text file.

-bash$ make  &> make.txt &
-bash$ disown -h

I can then keep track of the make progress by tailing this file.

-bash$ tail -f make.txt
Once the make is complete I install.

-bash$ make install

With R successfully installed I download the latest rpy2 package and extract.

-bash$ wget http://sourceforge.net/projects/rpy/files/latest/download?source=files
-bash$ tar -zxvf rpy2-2.3.1.tar.gz ; cd rpy2-2.3.1

Next, update the relevant environment variables in your .bash_profile. This will vary depending on your installation, check the installation guidelines for more. Finally, install!

-bash$ python setup.py install

I then ran a test Python script from the rpy2 introduction.
import rpy2.robjects as robjects
pi = robjects.r['pi']
print(pi[0])
And the script output the value of pi as expected!

-bash$ python rp2_test.py
-bash$ 3.141592653589793
Next I installed the excellent RMetrics from the R shell.

-bash$ R
> source("http://www.rmetrics.org/Rmetrics.R")
> install.Rmetrics()

Terminal Setup


In this post I showcase a number of screenshots that give insight into my setup. I carry out my research and development work within a unix environment. I typically work on a remote machine via ssh. VIM is my editor of choice. My standard setup is a 2x2 grid of shells which allow me to work on multiple tasks simultaneously.




Python is my tool of choice for quickly prototyping ideas.



While C/C++ is my favored core development language.


For data I use the open source NoSQL database MongoDB.



Working with big datasets is achievable due to the RAM specification of the remote machine.