Saturday 31 August 2013

Naive Bayes Classification

Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.

 As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted $\tau$, and defined as $$ Y_{T0 + \tau} = ln \left( S_{T0 + \tau}\right) - ln \left( S_{T0} \right) $$ where $S$ denotes the price of the asset we are attempting to predict, $ln$ denotes the natural logarithm, and $T0$ indicates the current period. The aim of the algorithm is to predict Y, conditional on the state of a set of signals.  Let $X$ < $ X_1, X_2,... , X_k $ > be a vector of real valued stochastic signals. Both Y and X are standardized by subtracting a rolling mean and dividing by some rolling measure of dispersion.   Bayes' Theorem allows us to relate Y and X and is defined as
$$ P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) } $$
where $x$ <$x_1,x_2$> is an observed signal vector, $X=x$ is shorthand for $X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k $, and $y$ is an observed value of Y. Since the denominator $ P (X | x) $ is invariant across values of Y, the formula can be simplified to $$ P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) $$ However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value $c=1$ when $ Y < -2*\sigma$, a value of $c=3$ when $ Y > 2*\sigma$, and a value $c=2$ otherwise. We apply the same discretization procedure to each signal $X_i$ to form a corresponding categorical variable $X_i^*$. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
$$ P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c)  $$
Assuming two predictor signals, the expression $P(C=3 | X_1^*=1 \wedge X_2^*=3 )$ reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal $X_1$ and a positive 2 sigma event in $X_2$". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms. I have set thresholds at the 2 sigma levels in my example. In practice these thresholds are parameters which can be optimized and recursively adjusted to account for changes in the underlying dynamics of a signal's distribution.

The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector $x^*$ exists in our training set, which means that we may not be able to estimate $P(X^*=x^*| C = c)$. The so called Naive Bayes' assumption addresses this issue by assuming that the signals $X_1,X_2,...,X_k$ are conditionally independent of each other for a particular class c. This assumption allows us to express $P(X^*=x^*| C = c)$ in the following form $$ P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) $$ This means that $P(C=c | X^* = x^*)$ can now be expressed as $$ P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) $$ "Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). PS Naive Bayes assumes that the covariance matrix is diagonal for a particular class. Standard PCA does not consider class when rotating the data. We need an alternative procedure called class-conditional Independent Component Analysis. Finally, in typical applications there may be a large number of signals, PCA can be used to extract a set of orthogonal features smaller than the original signal set, with some loss of variance. I will investigate this issue in a future post!

Friday 30 August 2013

Pairwise Linear Correlation (CL/RB/HO)


The prices of crude oil, gasoline, and heating oil are highly correlated for obvious reasons. In this post I present an estimate of the pairwise linear correlations between the future contracts on these products. The correlations are estimated using a short sample of two weeks. The following figure shows the cumulative return over the sample period where the straight line represents Saturday and Sunday, when the market is closed.


I estimate the pairwise correlation for a range of sampling frequencies from 1 second out to 120 seconds. I use samples of the midpoint to avoid the impact of the bid ask bounce effect. The correlations break down as we approach the highest sampling frequency, which is a well documented phenomenon called the Epps Effect. That said, the correlations remain above 50% at the one second frequency which is evidence for the high frequency spreader algorithms that are trading these contracts at sub second frequencies and in the process strengthening the correlations.


Of course, these correlations evolve over time, and will vary depending on the sample used to estimate. However, the same pattern emerges, that is diminishing correlations with increased sampling frequency.

Wednesday 28 August 2013

Intraday Volatility (CL/RB/HO)

I implemented a high frequency volatility estimator that takes one second snapshots of the midpoint as input. The following figure shows minute by minute volatility estimates (scaled to one hour) for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The median estimate for each minute over 250 trading days is used to illustrate the durnial volatility pattern.  The most significant volatility spikes can be easily matched with well known market events such as open, close, economic figure announcements, and less obvious events such as minute markers and settlements.  Some of the volatility spikes are caused by weekly (as in the case of the crude oil inventory figure) or monthly (as in the case of German Manufacturing PMI) announcements.



In this post I go through the Python code I developed to create this figure. The steps include querying a MongoDB database, aggregating results, computing volatility, and plotting.

We begin by connecting to MongoDB via the PyMongo package.
    client = MongoClient()
    client = MongoClient('localhost', 27017)
    db = client.taq
    quote_collection = db['CL_quotes']
I then query the available list of dates
    dates = t_collection.find({},{"d" : 1 }).distinct("d")
    dates = sorted(dates)
I don't want to include exchange holidays or expiration dates in my analysis. The following function calls retrieve the relevant dates.
    holidays= sd.holidays()
    expirationdays= sd.CL_expirationdates()
I then filter the date list.
    dates_filtered = [x for x in dates if x not in holidays]
    dates_filtered = [x for x in dates_filtered if x not in expirationdays]
I also filter Sundays.
dates_filtered = [x for x in dates_filtered if x.weekday() != 6]
For each date "dt" in dates_filtered I specify an end date to be used in the database query following.
    start_dt = dt
    end_dt = dt
    end_dt = end_dt + datetime.timedelta(days=1)
I then write an aggregation query to sample bid and ask quotes at one second intervals. The aggregation framework is a recent addition to MongoDB. It is pretty fast, but apparently the Mongo developers are in the process of making it faster. This query aggregates bid and ask quotes into 1 second buckets and returns the closing quotes for each second. The aggregation process occurs at database level. The alternative is to load the quotes into a Python data structure to perform the aggregation.

    docs = collection.aggregate([
        {"$match": {"dt": {"$lt" : end_dt, "$gte" : start_dt  }}},
        {"$project": {
             "year":       {"$year": "$dt"},
             "month":      {"$month": "$dt"},
             "day":        {"$dayOfMonth": "$dt"},
             "hour":       {"$hour": "$dt"},
             "minute":     {"$minute": "$dt"},
             "second":     {"$second": "$dt"},
             "dt": 1,
             "ap": 1,
             "bp": 1 }},
         {"$sort": {"dt": 1}},
         {"$group":
           {"_id" : {"year": "$year", "month": "$month", "day": "$day", "hour": "$hour","minute":"$minute","second":"$second" },
                        "bp_close":  {"$last": "$bp"},
                        "ap_close":  {"$last": "$ap"} }} ] )

The following code iterates through the result set and creates lists of closing bid and ask quotes and corresponding timestamps.
    timestamp = []
    bp_close=[]
    ap_close=[]

    for doc in docs['result']:
        year = doc['_id']['year']
        month = doc['_id']['month']
        day = doc['_id']['day']
        hour = doc['_id']['hour']
        minute = doc['_id']['minute']
        second = doc['_id']['second']
        timestamp.append(datetime.datetime(year,month,day,hour,minute,second))
        bp_close.append(doc['bp_close'])
        ap_close.append(doc['ap_close'])

I then wrap these lists in Pandas Series objects.
    timestamp_series = pd.Series(data=timestamp)
    timestamp_series.name='Timestamps'

    bp_series = pd.Series(data=bp_close)
    bp_series.name='BidPrice'

    ap_series = pd.Series(data=ap_close)
    ap_series.name='AskPrice'
I then concatenate the series to form a Pandas Dataframe object. The Dataframe is my data structure of choice for working with time series data in Python.
    df_in = pd.concat([pd.DataFrame(timestamp_series),pd.DataFrame(bp_series),pd.DataFrame(ap_series)], axis=1)
    df_in = df_in.set_index('Timestamp')
    df_in = df_in.sort()

The aggregation process omits 1 second intervals where no quote was available. However, I want a full range from 00:00:00 to the final 1 second period of the day 23:59:59.
    from = datetime.datetime(year,month,day,0,0,0)
    to = from + datetime.timedelta(days=1)
    step = datetime.timedelta(seconds=1)
    range = []
    t = from
    while t < to:
        range.append(t)
        t += step
I then create an empty Pandas Series indexed on this range followed by a Dataframe.
    range_series = pd.Series(data=range)
    range_series.name='Timestamp'
    df = pd.DataFrame(index=range_series,columns=['BidPrice','AskPrice' ])
I then paste the empirical Dataframe into the full range column by column.
    for col in df_in.columns:
        df[col] = df_in[col]
The missing intervals will contain NaN values which I fill by copying forward previous quotes.
    df = df.fillna(method='pad')
The midpoint is then calculated and a new column created in the Dataframe.
    df['m'] = pd.Series( (df['BidPrice']+df['AskPrice'])/2,index=df.index)
1 second log returns are then calculated
    df_r = np.log(df).diff()
    df_r = df_r.fillna(0)
Following this I square the returns.
    df_r2 = np.power(df_r,2)
And aggregate the squared returns into 1 minute intervals.
    df_r2_agg = df_r2.resample('1min',how='sum')
The above steps are applied for each of the dates resulting in millions of rows. I then create a pivot table on date and minute of the day.
    df_r2_agg['minute'] = t
    df_r2_agg['date'] = dates
    df_r2_agg_pivoted = df_r2_agg.pivot(index='date',columns='minute',values='m')
where t is a time only series in increments of one second, replicated for each date. I then calculate the median variance across dates.
    df_r2_agg_median = df_r2_agg_pivoted.median()
To calculate the volatilities I take the square root of the 1 minute variances, and scale to one hour.
    df_vol = np.sqrt(df_r2_agg_median)
    df_vol = df_vol*np.sqrt(60)
Finally, I plot the volatilities. I have left out some code, but you get the idea! ;)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    colors = itertools.cycle(["#00FA9A", "c", "y","r","m","g","w" ])
    for col in df_vol.columns:
        ax.plot_date(df_vol.index,df_vol[col]/.0001,color=next(colors),linestyle='-',marker='' )

Monday 26 August 2013

Volume at Expiration (CL/RB/HO)

As a future contract approaches the expiration date, open interest rolls to the back month and volumes in the front month dwindle. Traders holding large positions in the front month with the intention of rolling prior to expiration are faced with the decision of when to roll. Waiting until the expiration date is not advisable due to illiquidity. One day prior to expiration is also relatively illiquid. The following figures show the median intraday volume aggregated over 15 minute intervals for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The analysis suggests that large traders should roll at least 2 days prior to expiration to avoid serious liquidity constraints. The legend in these figures show the number of days to expiration, with 0 being the expiration date.


Crude Oil (CME:CL)




Gasoline (CME:RB)



Heating Oil (CME:HO)


Sunday 18 August 2013

Crude Oil Inventory

Each week (usually Wednesdays) the U.S. Energy Information Administration publish the number of barrels of crude oil held in inventory by commercial firms during the past week. The figure is eagerly anticipated by energy traders as it is interpreted as an indicator of supply. Thus, an increase (decrease) in the figure is typically associated with a bearish (bullish) sentiment toward crude oil.


The following chart shows the trading action around the Jan 30th 2013 release over a three minute window starting one minute prior to the release, which occurs at 15.30pm GMT, and ending two minutes following. A green dot indicates an up tick, a red dot indicates a down tick, and a white dot is displayed when a trade occurs at the same prices.



The next figure shows the bid ask spread at 50 millisecond intervals over the same time horizon.



And the following figure shows the midpoint, which is defined as an average of the bid and ask prices.







Very little trading takes place over the one minute period prior to the release. The next plot shows the trading volume aggregated over 5 second intervals.




To estimate volatility I aggregate 50 millisecond returns into 5 second intervals, where return is defined as the change in the log of the midpoint. The volatility graph below shows basis points on the y-axis. As expected there is a spike in volatility on the release.




Each week market analysts forecast the inventory figure. The following plot shows the consensus forecast over 2012 versus the actual figure.




As the following plot highlights, there is usually a significant differential between the consensus forecast and the actual outcome. The differential, or forecast error, is measured in millions of barrels of crude oil.





In the next plot I show the cumulative log returns of the midpoint for each announcement over a 30 second window starting 10 seconds prior to the release. As expected there is little activity prior to the announcement and a significant volatility regime switch at 15.30pm.



The volatility switch is better illustrated in the following figure, which shows the absolute 50 millisecond returns.


Given that the market typically prices in the consensus forecast prior to the announcement, one might expect there to be a positive correlation between forecast error and post release volatility.

To test this hypothesis I investigate the correlation between the consensus forecast error and post announcement volatility. The volatility estimate is scaled to a one hour horizon. The figure below suggests that an unexpected increase in supply has a bigger impact on volatility than an unexpected decrease.



The final figure shows the relationship between the volatility estimate and the absolute forecast error. I find a significant linear correlation of 45%.




I plan to post more analysis on this topic soon! Thanks for reading!


Sunday 4 August 2013

Unix Commands

In this post I catalogue most of the Unix commands that I have been using on a frequent basis.

-bash$ history | awk '{a[$2]++}END{for(i in a){print a[i] " " i}}' | sort -rn | head

List of commands you use most often

-bash$ ssh -X user@host
ssh remote login with X11 forwarding

-bash$ strace -ff -e trace=write -e write=1,2 -p SOME_PID
intercept stdout/stderr of another process

-bash$  strace -p PID
print stack trace for process PID

-bash$  ls -l /proc/PID/fd
list the file handles that process PID has open.

-bash$  Ctrl + z
-bash$  bg
-bash$  disown -h [job-spec]
pause already running process, put into background, and detach.

-bash$  top - u user
list running processes for user. -bash$ echo "ls -l" | at 1800
Execute a command at a given time

-bash$ man ascii 
quick access to the ascii table

-bash$ !!:gs/foo/bar 
Runs previous command replacing foo by bar every time that foo appears

-bash$ curl ifconfig.me
get your external ip address

-bash$ sshfs name@server:/path/to/folder /path/to/mount/point
Mount folder/filesystem through SSH

-bash$ ctrl-l 
clear the terminal screen

-bash$ (cd /tmp && ls)
Jump to a directory, execute a command and jump back to current dir

-bash$ du -h --max-depth=1 | sort -n
List the size (in human readable form) of all sub folders from the current location, sorted by size.

-bash$ disown -a && exit
Close shell keeping all subprocess running

-bash$ ps aux | sort -nk +4 | tail
Display the top ten running processes - sorted by memory usage

-bash$ find . -name "*.[py]" -exec grep -i -H "search pattern" {} \;
Search recursively to find a word or phrase in certain file types, such as Python code.
-H tells grep to print the filename. 
-i to match the case exactly.
-bash$ awk '/start_pattern/,/stop_pattern/' file.txt
Display a block of text with AWK.

-bash$ grep -A # pattern file.txt"
to see a specific number of lines following your pattern

-bash$ grep -n --color pattern * -r


-bash$ lsof -i
Watch Network Service Activity in Real-time

-bash$ some_very_long_and_complex_command # label
Using the above trick it's possible to label your commands and access them easily by pressing ^R and typing the label

-bash$ lsof -P -i -n
Show apps that use internet connection at the moment.

-bash$ mv filename.{old,new}
quickly rename a file

-bash$ rm !(*.foo|*.bar|*.baz)
Delete all files in a folder that don't match a certain file extension

-bash$ wget -qO - "http://www.tarball.com/tarball.gz" | tar zxvf -
Extract tarball from internet without local saving

-bash$ cat /etc/issue
Display which distro is installed

-bash$ man hier
Show File System Hierarchy

-bash$ sudo dd if=/dev/mem | cat | strings
this command will show you all the string (plain text) values in ram

-bash$ ls -R | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/   /' -e 's/-/|/'
Graphical tree of sub-directories

-bash$ mkdir -p a/long/directory/path
Make directory including intermediate directories

-bash$ sed -n '10,20p' 
Print all the lines between 10 and 20 of a file


-bash$ pv sourcefile > destfile
Copy a file using pv and watch its progress

-bash$ rename 'y/ /_/' *
replace spaces in filenames with underscores

-bash$ vim scp://username@host//path/to/somefile
Edit a file on a remote host using vim

-bash$ du -s * | sort -n | tail
Get the 10 biggest files/folders for the current directory

-bash$ mkdir /home/foo/doc/bar && cd $_
mkdir & cd into it as single command

-bash$ chmod --reference file1 file2
Makes the permissions of file2 the same as file1

-bash$ sed -n 5p 
To print a specific line from a file

-bash$ tar -tf  | xargs rm -r
Remove all files previously extracted from a tar(.gz) file.

-bash$ ls -d */
list directory entries instead of contents, and do not dereference symbolic links

-bash$ ps awwfux | less -S
Show a 4-way scrollable process tree with full details.

-bash$ find . -type d -empty -delete
Recursively remove all empty directories

-bash$ leave +15 ; leave 1855
set an alarm at for a particular time

-bash$ grep ^Dirty /proc/meminfo
Find out how much data is waiting to be written to disk

-bash$ mkdir -p work/{d1,d2}/{src,bin,bak}
make directory tree

-bash$ cp file.txt{,.bak}
Create a quick back-up copy of a file

-bash$ echo ${SSH_CLIENT%% *}
Get the IP of the host your coming from when logged in remotely

-bash$ showkey -a
Quick access to ASCII code of a key

-bash$ history -d
delete a line from your shell history

-bash$ find ./ -type f -exec chmod 644 {} \;
Recursively change permissions on files, leave directories alone.

-bash$ alias dush="du -sm *|sort -n|tail"
easily find megabyte eating files or directories. dush for the above output. dush -n 3 for only the 3 biggest files

-bash$ :mksession!  
Save your sessions in vim to resume later

-bash-3.2$  ./myapp &> log.txt & disown -h ;  tail -f log.txt
Run background process. Redirect stdout and stderr to log.txt, disown the process, and tail the log file.

-bash-3.2$  tar -zxvf data.tar.gz
where the options zxvf are:
-z : Uncompress the resulting archive with gzip command.
-x : Extract to disk from the archive.
-v : Produce verbose output i.e. show progress and file names while extracting files.
-f data.tar.gz : Read the archive from the specified file called data.tar.gz.


-bash-3.2$  ls -lhs
List files by size, with option l for a long listing format, h for human readable file size, and s to indicate sort by size.

-bash-3.2$  ls -lSr
To list in reverse order.

-bash-3.2$ wc -l *
Count the number of rows in each file in the current directory.

-bash-3.2$  ls -1 targetdir | wc -l
Count the number of files in the current directory

 -bash-3.2$  top
List running processes. Press 1 to list cpus.

 -bash-3.2$  find /home/me | grep foo.c
Locate file foo.c

 -bash-3.2$ netstat -tlnp
Lists all listening ports together with the PID of the associated process

 -bash-3.2$ mrt HOSTNAME
Better than traceroute and ping combined. Investigates the network connection between the host mtr runs on and HOSTNAME.

 -bash-3.2$ > file.txt
flush the contents of a file without deleting it

 -bash-3.2$ Ctrl-r
Allows you to reverse search through your command history.

 -bash-3.2$ du -skh .
Outputs the total size of files and directories within the current directory?

 -bash-3.2$ df -kh .
Outputs the used and available space for the current drive.

Friday 2 August 2013

U.S. Petroleum Balance Sheet

I developed a Python script to parse the U.S. Petroleum Balance Sheet  Excel file and generate a figure for each statistic. All reported statistics are weekly and are measured in thousands of barrels per day. The statistics fall under two categories: Petroleum Stocks, and Petroleum Supply. The figures are up-to-date for the week ending August 4th 2013.



Petroleum Stocks

U.S. Ending Stocks of Crude Oil




U.S. Ending Stocks excluding SPR of Crude Oil




U.S. Ending Stocks of Crude Oil in SPR




U.S. Ending Stocks of Total Gasoline