Sunday 29 December 2013

Optimal Betting Strategy for Acey Deucey (card game)

In this post I develop a betting strategy for the popular card game Acey Deucey. Favourable situations where the player has a positive expectation are identified, and optimal bet sizes are determined using the Kelly criterion. Thus, the resulting strategy maximises the long-run growth rate of the player's wealth. Although full Kelly is optimal in the sense of growth rate maximisation, returns can be quite volatile in the short-term. One common approach to reduce volatility is to bet a fraction of full Kelly at the expense of a lower expected growth rate.

Having determined the optimal strategy, I carried out a numerical simulation to estimate the distribution of terminal wealth following a fixed number of rounds for a specified starting wealth. I replicated the project in Matlab and Python. Although the game is simple relative to a real world financial investment strategy, it serves as a very useful prototype for exploring the problem of position sizing and risk management.

Rules of the Game

The player is dealt two cards and bets that a third card is numerically between the first two. Cards 2 to 10 are numbered by their face value, while Jack, Queen, and King are numbered 11, 12, and 13, respectively. If card one is an Ace then the player can make this card low with a value of 1 or high with a value of 14. If an Ace is dealt for the second card then a high value of 14 is automatically assumed.

If the third card lands between the first two then the player takes an amount from the pot equal to the bet. In the case that the third card is equal to the first or second cards, the player must contribute an amount to the pot equal to twice the bet. Finally, if the third card is less than the first card or greater than the second card (cards one and two are always assumed to be in ascending order), then the player must contribute an amount to the pot equal to the bet. Thus, there are three possible outcomes for each round, assuming that the player makes a bet. At the beginning of each round the player contributes a fixed amount called an ante to the pot.

PS there are many variations on the rules for this game. The results presented below may vary when additional rules are included. I make the following assumptions:

  • Wealth is measured in dollars.
  • Dollars are infinitely divisible.
  • Infinite pot.
  • Infinite deck.
  • Ante of 1 dollar per round.

Expected Value

Each situation in the game is associated with an expected value. Let A be the event that the third card is between the first two cards. The probability of event A occurring is defined as  $ P(A) = \frac{j-i-1}{13} = \frac{g}{13} $, where i and j are the numerical values of the first and second cards, respectively. The payoff for this event is defined by $ Q(A)= x $, where x is the bet size. Let B be the event that the third card is outside the first two cards. The probability of B is defined as $ P(B)= \frac{13-g-2}{13} $ and the payoff is $ Q(B)= -x $. Finally, let C be the event that the third card is equal to either of the first two. This event occurs with probability $ P(C)= \frac{2}{13}$. The payoff for C is a loss equal to twice the bet and defined as $ Q(C)= -2x $. Thus, $ P(A)+P(B)+P(C) = 1$. The table below lists the event probabilities and the associated payoffs.


Event Probability P Payoff Q
A P(A)=$ \frac{g}{13} $ Q(A) = $x$
B P(B)=$ \frac{11-g}{13} $ Q(B) = $-x$
C P(C)=$ \frac{2}{13} $ Q(C) = $-2x$

The expected value of a round is generally defined as  $ EV = P(A)Q(A) + P(B)Q(B) + P(C)Q(C) $. Inserting the expressions from the table above, the expected value can be expressed as $ EV = \frac{g}{13} x - \frac{11-g}{13} x - \frac{4}{13} x $ and finally simplified to 

$$ EV = \frac{(2g-15)}{13} x $$

where g is the gap between the cards (j-i-1). For example, if the first card is a 3 and the second card is a Jack then g is 7. 

Favourable Situations

Favourable situations are those which offer positive expectation. The table below lists the expected value for each situation. It is clear that the player should only be betting when the gap is greater than seven.

g 0 1 2 3 4 5 6 7 8 9 10 11 12
EV $-\frac{15}{13} x$ $-x$ $-\frac{11}{13} x$ $-\frac{9}{13} x$ $-\frac{7}{13} x$ $-\frac{5}{13} x$ $-\frac{3}{13} x$ $-\frac{1}{13} x$ $+\frac{1}{13} x$ $+\frac{3}{13} x$ $+\frac{5}{13} x$ $+\frac{7}{13} x$ $+\frac{9}{13} x$

 Optimal Bet Size


So, we will only bet when the gap is greater than seven. However, we still haven't addressed the question of how much to bet for each of the five favourable situations identified. Maximising the expected value of our decision is not sensible as this would result in betting all of our wealth and run a substantial risk of ruin (wealth going to zero). Instead, our goal should be to maximise the growth rate of our wealth over the long-run. Thus, there is a trade off between aggressiveness and probability of ruin. We want to maximise the following function with respect to the bet size x:
$$ f(x) = p_1 \text{ln} (1 + x) + p_2 \text{ln}(1-x) + p_3 \text{ln}(1-2x) $$
where $p_1$, $p_2$, and $p_3$ are $P(A)$, $P(B)$, and $P(C)$, respectively. The five figures below illustrate the concavity of this function over the maximisation region, which varies depending on the gap g. The value of x which maximises this function is the optimal Kelly bet. Intuitively, the recommended bets are proportional to the gap. 






We can obtain the maximum of f(x) algebraically by setting the first derivative equal to zero and solving for x. This results in the following expression for the optimal bet, denoted $x^*$

$$ x^* = \frac{ 3p_1-p_2-\sqrt{p_1^2-6p_1p_2+8p_1p_3+9p_2^2+24p_2p_3+16p_3^2} } { 4(p_1+p_2+p_3)} $$

The figure below shows the value of $x^*$ for each of the favourable situations. 


Likely Range in Wealth

Assuming one follows the optimal strategy outlined above, what range might one expect in one's wealth over a specific number of rounds? To address this question I implemented an application to simulate the game. In the two figures below the equity curves for 30 simulations are presented, each consisting of 300 rounds. A starting wealth of 100 dollars is assumed. As expected, the returns in the second figure, which uses quarter Kelly, are less volatile than the first, which uses full Kelly.



However, we can't tell a lot from the above figures as the number of simulations is too small to make an accurate estimate of the distribution. To do this I simulate 50 thousand paths and for each round I plot an empirical estimate of the mean (in red) and the 10th to 90th percentiles (in green). For these simulations I assume a starting wealth of 1000 dollars. Thus, Assuming full Kelly betting, we expect to turn 1000 dollars into about 6000 dollars over 300 rounds. However, the difference between the 10th and 90th percentiles is about 16000.


In contrast, quarter Kelly results in a lower expected wealth of about 1200 dollars, with a difference of about 1000 dollars between the 10th and 90th percentiles.


I have many ideas for extending this analysis and I may post again on this topic in the future. Thanks for reading - and don't forget to apply this strategy the next time you sit down to play Acey Deucey with friends ;)

Friday 4 October 2013

Modeling Mean Reversion for Statistical Arbitrage

The goal of a classical statistical arbitrage strategy is to extract returns by holding combinations of instruments that follow a mean reverting process. A classic example is a so-called Pairs Trading Strategy that benefits from the mean reverting relationship between correlated stocks such as Coca Cola and Pepsi. A second example is a Yield Curve Butterfly Strategy that attempts to capture the mean reverting relationship between medium, long, and short-term interest rates. Mean reversion is an attractive property as one can extract returns from the market by selling the peaks and buying the troughs of the wave. The amplitude and frequency are two characteristics of a wave that are important in stat arb. The amplitude determines the profit potential of a single cycle in the wave and the frequency determines how many of these opportunities arise over a period of time. We can simulate a sin wave to illustrate these ideas with the following form $$ y(t) = A \cdot \text {sin} ( 2 \pi ft ) $$

where A is amplitude, and f is frequency. The following figure shows two sin waves. The red wave has a frequency of 1 while the yellow wave has a frequency of 2. Thus, the latter wave oscillates more frequently and so presents more trading opportunities.


Of course, the greater the amplitude of the waves the greater the profit potential from an individual trade. The following plot shows a sin wave with an amplitude  of 1 in red, and an amplitude of 2 in yellow, with both waves having a frequency of 2. In this case we prefer to trade the yellow series as we can make more profit in the same amount of time.


Although it is possible to find mean-reverting trades, in practice the waves are noisey. The introduction of noise makes it more difficult to statistically identify mean reverting behavior and also makes risk management a more challenging task. We can simulate noisey waves by including a noise term in the above equation $$ y(t) = A \cdot \text {sin} ( 2 \pi ft) + \epsilon_t $$
where $\epsilon_t$ is a noise variable that follows some distribution. The following figure shows a sin wave with and without normally distributed noise.



In my experience the noise tends to be non normal. For example, the Bund-Bobl-Schatz butterfly spread is mean-reverting, but it is subject to irregular extreme outliers that can result in large losses. To illustrate this, the following figure shows a sin wave with and without Student T distributed noise with 3 degrees of freedom. From the figure it is clear that there are multiple occassions where huge fat tailed events could have resulted in margin calls on positions that at the time were fading the spread.




In finance the most popular model for explaining mean reverting behaviour is the Ornstein-Uhlenbeck (OH) model. The stochastic differential equation (SDE) for the OH process is given by

$$ d S_t = \lambda ( \mu - S_t ) dt + \sigma d W_t $$
with $\lambda$ the mean reversion rate, $\mu$ the mean, and $\sigma$ the volatility. A useful metric is the Half-Life, denoted H, which indicates the number of periods it takes the process to revert from the peak half way to the mean. The Half-Life is defined as
$$ H = \frac{\text{log} (2) } {\lambda} $$
All other things held constant, we prefer a low Half-Life as this results in more trading signals over a period of time.

A previous blog post  outlines simple closed form solutions for calibrating the OH model parameters to a given dataset using the least squares regression method, and the maximum likelihood method. The author provides Matlab code for the calibrations. I decided to implement his functions in Python. The Python implementation for the least squares calibration is given below

def OU_Calibrate_LS(S,delta):
    from numpy import sum,power,sqrt,log,exp
    n = len(S)-1

    Sx = sum(S[0:-1])
    Sy = sum(S[1: ])
    Sxx = sum(power(S[0:-1],2))
    Sxy = sum( S[0:-1] * S[1: ])
    Syy = sum(power(S[1: ],2))

    a  = ( n*Sxy - Sx*Sy ) / ( n*Sxx - power(Sx,2) )
    b = ( Sy - a*Sx ) / n
    sd = sqrt( (n*Syy - power(Sy,2) - a*(n*Sxy - Sx*Sy) )/n/(n-2) )

    lambda_ = -log(a)/delta
    mu     = b/(1-a)
    sigma  =  sd * sqrt( -2*log(a)/delta/(1-pow(a,2)) )
    return (mu,sigma,lambda_)
And the maximum likelihood implementation is given by
def OU_Calibrate_ML(S,delta):
    from numpy import sum,power,sqrt,log,exp
    n = len(S)-1

    Sx = sum(S[0:-1])
    Sy = sum(S[1: ])
    Sxx = sum(power(S[0:-1],2))
    Sxy = sum( S[0:-1] * S[1: ])
    Syy = sum(power(S[1: ],2))

    mu  = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (power(Sx,2) - Sx*Sy) )

    lambda_ = -log( (Sxy - mu*Sx - mu*Sy + n*power(mu,2)) / (Sxx -2*mu*Sx + n*power(mu,2)) ) / delta
    a = exp(-lambda_*delta)
    sigmah2 = (Syy - 2*a*Sxy + power(a,2)*Sxx - 2*mu*(1-a)*(Sy - a*Sx) + n*power(mu,2)*power(1-a,2))/n
    sigma = sqrt(sigmah2*2*lambda_/(1-power(a,2)))

    return (mu,sigma,lambda_)


In a future post I will investigate a number of potential mean reverting opportunities within the futures market using the OH model as a guide. Thanks for reading!



Saturday 31 August 2013

Naive Bayes Classification

Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.

 As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted $\tau$, and defined as $$ Y_{T0 + \tau} = ln \left( S_{T0 + \tau}\right) - ln \left( S_{T0} \right) $$ where $S$ denotes the price of the asset we are attempting to predict, $ln$ denotes the natural logarithm, and $T0$ indicates the current period. The aim of the algorithm is to predict Y, conditional on the state of a set of signals.  Let $X$ < $ X_1, X_2,... , X_k $ > be a vector of real valued stochastic signals. Both Y and X are standardized by subtracting a rolling mean and dividing by some rolling measure of dispersion.   Bayes' Theorem allows us to relate Y and X and is defined as
$$ P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) } $$
where $x$ <$x_1,x_2$> is an observed signal vector, $X=x$ is shorthand for $X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k $, and $y$ is an observed value of Y. Since the denominator $ P (X | x) $ is invariant across values of Y, the formula can be simplified to $$ P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) $$ However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value $c=1$ when $ Y < -2*\sigma$, a value of $c=3$ when $ Y > 2*\sigma$, and a value $c=2$ otherwise. We apply the same discretization procedure to each signal $X_i$ to form a corresponding categorical variable $X_i^*$. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
$$ P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c)  $$
Assuming two predictor signals, the expression $P(C=3 | X_1^*=1 \wedge X_2^*=3 )$ reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal $X_1$ and a positive 2 sigma event in $X_2$". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms. I have set thresholds at the 2 sigma levels in my example. In practice these thresholds are parameters which can be optimized and recursively adjusted to account for changes in the underlying dynamics of a signal's distribution.

The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector $x^*$ exists in our training set, which means that we may not be able to estimate $P(X^*=x^*| C = c)$. The so called Naive Bayes' assumption addresses this issue by assuming that the signals $X_1,X_2,...,X_k$ are conditionally independent of each other for a particular class c. This assumption allows us to express $P(X^*=x^*| C = c)$ in the following form $$ P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) $$ This means that $P(C=c | X^* = x^*)$ can now be expressed as $$ P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) $$ "Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). PS Naive Bayes assumes that the covariance matrix is diagonal for a particular class. Standard PCA does not consider class when rotating the data. We need an alternative procedure called class-conditional Independent Component Analysis. Finally, in typical applications there may be a large number of signals, PCA can be used to extract a set of orthogonal features smaller than the original signal set, with some loss of variance. I will investigate this issue in a future post!

Friday 30 August 2013

Pairwise Linear Correlation (CL/RB/HO)


The prices of crude oil, gasoline, and heating oil are highly correlated for obvious reasons. In this post I present an estimate of the pairwise linear correlations between the future contracts on these products. The correlations are estimated using a short sample of two weeks. The following figure shows the cumulative return over the sample period where the straight line represents Saturday and Sunday, when the market is closed.


I estimate the pairwise correlation for a range of sampling frequencies from 1 second out to 120 seconds. I use samples of the midpoint to avoid the impact of the bid ask bounce effect. The correlations break down as we approach the highest sampling frequency, which is a well documented phenomenon called the Epps Effect. That said, the correlations remain above 50% at the one second frequency which is evidence for the high frequency spreader algorithms that are trading these contracts at sub second frequencies and in the process strengthening the correlations.


Of course, these correlations evolve over time, and will vary depending on the sample used to estimate. However, the same pattern emerges, that is diminishing correlations with increased sampling frequency.

Wednesday 28 August 2013

Intraday Volatility (CL/RB/HO)

I implemented a high frequency volatility estimator that takes one second snapshots of the midpoint as input. The following figure shows minute by minute volatility estimates (scaled to one hour) for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The median estimate for each minute over 250 trading days is used to illustrate the durnial volatility pattern.  The most significant volatility spikes can be easily matched with well known market events such as open, close, economic figure announcements, and less obvious events such as minute markers and settlements.  Some of the volatility spikes are caused by weekly (as in the case of the crude oil inventory figure) or monthly (as in the case of German Manufacturing PMI) announcements.



In this post I go through the Python code I developed to create this figure. The steps include querying a MongoDB database, aggregating results, computing volatility, and plotting.

We begin by connecting to MongoDB via the PyMongo package.
    client = MongoClient()
    client = MongoClient('localhost', 27017)
    db = client.taq
    quote_collection = db['CL_quotes']
I then query the available list of dates
    dates = t_collection.find({},{"d" : 1 }).distinct("d")
    dates = sorted(dates)
I don't want to include exchange holidays or expiration dates in my analysis. The following function calls retrieve the relevant dates.
    holidays= sd.holidays()
    expirationdays= sd.CL_expirationdates()
I then filter the date list.
    dates_filtered = [x for x in dates if x not in holidays]
    dates_filtered = [x for x in dates_filtered if x not in expirationdays]
I also filter Sundays.
dates_filtered = [x for x in dates_filtered if x.weekday() != 6]
For each date "dt" in dates_filtered I specify an end date to be used in the database query following.
    start_dt = dt
    end_dt = dt
    end_dt = end_dt + datetime.timedelta(days=1)
I then write an aggregation query to sample bid and ask quotes at one second intervals. The aggregation framework is a recent addition to MongoDB. It is pretty fast, but apparently the Mongo developers are in the process of making it faster. This query aggregates bid and ask quotes into 1 second buckets and returns the closing quotes for each second. The aggregation process occurs at database level. The alternative is to load the quotes into a Python data structure to perform the aggregation.

    docs = collection.aggregate([
        {"$match": {"dt": {"$lt" : end_dt, "$gte" : start_dt  }}},
        {"$project": {
             "year":       {"$year": "$dt"},
             "month":      {"$month": "$dt"},
             "day":        {"$dayOfMonth": "$dt"},
             "hour":       {"$hour": "$dt"},
             "minute":     {"$minute": "$dt"},
             "second":     {"$second": "$dt"},
             "dt": 1,
             "ap": 1,
             "bp": 1 }},
         {"$sort": {"dt": 1}},
         {"$group":
           {"_id" : {"year": "$year", "month": "$month", "day": "$day", "hour": "$hour","minute":"$minute","second":"$second" },
                        "bp_close":  {"$last": "$bp"},
                        "ap_close":  {"$last": "$ap"} }} ] )

The following code iterates through the result set and creates lists of closing bid and ask quotes and corresponding timestamps.
    timestamp = []
    bp_close=[]
    ap_close=[]

    for doc in docs['result']:
        year = doc['_id']['year']
        month = doc['_id']['month']
        day = doc['_id']['day']
        hour = doc['_id']['hour']
        minute = doc['_id']['minute']
        second = doc['_id']['second']
        timestamp.append(datetime.datetime(year,month,day,hour,minute,second))
        bp_close.append(doc['bp_close'])
        ap_close.append(doc['ap_close'])

I then wrap these lists in Pandas Series objects.
    timestamp_series = pd.Series(data=timestamp)
    timestamp_series.name='Timestamps'

    bp_series = pd.Series(data=bp_close)
    bp_series.name='BidPrice'

    ap_series = pd.Series(data=ap_close)
    ap_series.name='AskPrice'
I then concatenate the series to form a Pandas Dataframe object. The Dataframe is my data structure of choice for working with time series data in Python.
    df_in = pd.concat([pd.DataFrame(timestamp_series),pd.DataFrame(bp_series),pd.DataFrame(ap_series)], axis=1)
    df_in = df_in.set_index('Timestamp')
    df_in = df_in.sort()

The aggregation process omits 1 second intervals where no quote was available. However, I want a full range from 00:00:00 to the final 1 second period of the day 23:59:59.
    from = datetime.datetime(year,month,day,0,0,0)
    to = from + datetime.timedelta(days=1)
    step = datetime.timedelta(seconds=1)
    range = []
    t = from
    while t < to:
        range.append(t)
        t += step
I then create an empty Pandas Series indexed on this range followed by a Dataframe.
    range_series = pd.Series(data=range)
    range_series.name='Timestamp'
    df = pd.DataFrame(index=range_series,columns=['BidPrice','AskPrice' ])
I then paste the empirical Dataframe into the full range column by column.
    for col in df_in.columns:
        df[col] = df_in[col]
The missing intervals will contain NaN values which I fill by copying forward previous quotes.
    df = df.fillna(method='pad')
The midpoint is then calculated and a new column created in the Dataframe.
    df['m'] = pd.Series( (df['BidPrice']+df['AskPrice'])/2,index=df.index)
1 second log returns are then calculated
    df_r = np.log(df).diff()
    df_r = df_r.fillna(0)
Following this I square the returns.
    df_r2 = np.power(df_r,2)
And aggregate the squared returns into 1 minute intervals.
    df_r2_agg = df_r2.resample('1min',how='sum')
The above steps are applied for each of the dates resulting in millions of rows. I then create a pivot table on date and minute of the day.
    df_r2_agg['minute'] = t
    df_r2_agg['date'] = dates
    df_r2_agg_pivoted = df_r2_agg.pivot(index='date',columns='minute',values='m')
where t is a time only series in increments of one second, replicated for each date. I then calculate the median variance across dates.
    df_r2_agg_median = df_r2_agg_pivoted.median()
To calculate the volatilities I take the square root of the 1 minute variances, and scale to one hour.
    df_vol = np.sqrt(df_r2_agg_median)
    df_vol = df_vol*np.sqrt(60)
Finally, I plot the volatilities. I have left out some code, but you get the idea! ;)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    colors = itertools.cycle(["#00FA9A", "c", "y","r","m","g","w" ])
    for col in df_vol.columns:
        ax.plot_date(df_vol.index,df_vol[col]/.0001,color=next(colors),linestyle='-',marker='' )

Monday 26 August 2013

Volume at Expiration (CL/RB/HO)

As a future contract approaches the expiration date, open interest rolls to the back month and volumes in the front month dwindle. Traders holding large positions in the front month with the intention of rolling prior to expiration are faced with the decision of when to roll. Waiting until the expiration date is not advisable due to illiquidity. One day prior to expiration is also relatively illiquid. The following figures show the median intraday volume aggregated over 15 minute intervals for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The analysis suggests that large traders should roll at least 2 days prior to expiration to avoid serious liquidity constraints. The legend in these figures show the number of days to expiration, with 0 being the expiration date.


Crude Oil (CME:CL)




Gasoline (CME:RB)



Heating Oil (CME:HO)


Sunday 18 August 2013

Crude Oil Inventory

Each week (usually Wednesdays) the U.S. Energy Information Administration publish the number of barrels of crude oil held in inventory by commercial firms during the past week. The figure is eagerly anticipated by energy traders as it is interpreted as an indicator of supply. Thus, an increase (decrease) in the figure is typically associated with a bearish (bullish) sentiment toward crude oil.


The following chart shows the trading action around the Jan 30th 2013 release over a three minute window starting one minute prior to the release, which occurs at 15.30pm GMT, and ending two minutes following. A green dot indicates an up tick, a red dot indicates a down tick, and a white dot is displayed when a trade occurs at the same prices.



The next figure shows the bid ask spread at 50 millisecond intervals over the same time horizon.



And the following figure shows the midpoint, which is defined as an average of the bid and ask prices.







Very little trading takes place over the one minute period prior to the release. The next plot shows the trading volume aggregated over 5 second intervals.




To estimate volatility I aggregate 50 millisecond returns into 5 second intervals, where return is defined as the change in the log of the midpoint. The volatility graph below shows basis points on the y-axis. As expected there is a spike in volatility on the release.




Each week market analysts forecast the inventory figure. The following plot shows the consensus forecast over 2012 versus the actual figure.




As the following plot highlights, there is usually a significant differential between the consensus forecast and the actual outcome. The differential, or forecast error, is measured in millions of barrels of crude oil.





In the next plot I show the cumulative log returns of the midpoint for each announcement over a 30 second window starting 10 seconds prior to the release. As expected there is little activity prior to the announcement and a significant volatility regime switch at 15.30pm.



The volatility switch is better illustrated in the following figure, which shows the absolute 50 millisecond returns.


Given that the market typically prices in the consensus forecast prior to the announcement, one might expect there to be a positive correlation between forecast error and post release volatility.

To test this hypothesis I investigate the correlation between the consensus forecast error and post announcement volatility. The volatility estimate is scaled to a one hour horizon. The figure below suggests that an unexpected increase in supply has a bigger impact on volatility than an unexpected decrease.



The final figure shows the relationship between the volatility estimate and the absolute forecast error. I find a significant linear correlation of 45%.




I plan to post more analysis on this topic soon! Thanks for reading!