Wednesday, 28 August 2013

Intraday Volatility (CL/RB/HO)

I implemented a high frequency volatility estimator that takes one second snapshots of the midpoint as input. The following figure shows minute by minute volatility estimates (scaled to one hour) for Crude Oil (CME:CL), Gasoline (CME:RB), and Heating Oil (CME:HO). The median estimate for each minute over 250 trading days is used to illustrate the durnial volatility pattern.  The most significant volatility spikes can be easily matched with well known market events such as open, close, economic figure announcements, and less obvious events such as minute markers and settlements.  Some of the volatility spikes are caused by weekly (as in the case of the crude oil inventory figure) or monthly (as in the case of German Manufacturing PMI) announcements.



In this post I go through the Python code I developed to create this figure. The steps include querying a MongoDB database, aggregating results, computing volatility, and plotting.

We begin by connecting to MongoDB via the PyMongo package.
    client = MongoClient()
    client = MongoClient('localhost', 27017)
    db = client.taq
    quote_collection = db['CL_quotes']
I then query the available list of dates
    dates = t_collection.find({},{"d" : 1 }).distinct("d")
    dates = sorted(dates)
I don't want to include exchange holidays or expiration dates in my analysis. The following function calls retrieve the relevant dates.
    holidays= sd.holidays()
    expirationdays= sd.CL_expirationdates()
I then filter the date list.
    dates_filtered = [x for x in dates if x not in holidays]
    dates_filtered = [x for x in dates_filtered if x not in expirationdays]
I also filter Sundays.
dates_filtered = [x for x in dates_filtered if x.weekday() != 6]
For each date "dt" in dates_filtered I specify an end date to be used in the database query following.
    start_dt = dt
    end_dt = dt
    end_dt = end_dt + datetime.timedelta(days=1)
I then write an aggregation query to sample bid and ask quotes at one second intervals. The aggregation framework is a recent addition to MongoDB. It is pretty fast, but apparently the Mongo developers are in the process of making it faster. This query aggregates bid and ask quotes into 1 second buckets and returns the closing quotes for each second. The aggregation process occurs at database level. The alternative is to load the quotes into a Python data structure to perform the aggregation.

    docs = collection.aggregate([
        {"$match": {"dt": {"$lt" : end_dt, "$gte" : start_dt  }}},
        {"$project": {
             "year":       {"$year": "$dt"},
             "month":      {"$month": "$dt"},
             "day":        {"$dayOfMonth": "$dt"},
             "hour":       {"$hour": "$dt"},
             "minute":     {"$minute": "$dt"},
             "second":     {"$second": "$dt"},
             "dt": 1,
             "ap": 1,
             "bp": 1 }},
         {"$sort": {"dt": 1}},
         {"$group":
           {"_id" : {"year": "$year", "month": "$month", "day": "$day", "hour": "$hour","minute":"$minute","second":"$second" },
                        "bp_close":  {"$last": "$bp"},
                        "ap_close":  {"$last": "$ap"} }} ] )

The following code iterates through the result set and creates lists of closing bid and ask quotes and corresponding timestamps.
    timestamp = []
    bp_close=[]
    ap_close=[]

    for doc in docs['result']:
        year = doc['_id']['year']
        month = doc['_id']['month']
        day = doc['_id']['day']
        hour = doc['_id']['hour']
        minute = doc['_id']['minute']
        second = doc['_id']['second']
        timestamp.append(datetime.datetime(year,month,day,hour,minute,second))
        bp_close.append(doc['bp_close'])
        ap_close.append(doc['ap_close'])

I then wrap these lists in Pandas Series objects.
    timestamp_series = pd.Series(data=timestamp)
    timestamp_series.name='Timestamps'

    bp_series = pd.Series(data=bp_close)
    bp_series.name='BidPrice'

    ap_series = pd.Series(data=ap_close)
    ap_series.name='AskPrice'
I then concatenate the series to form a Pandas Dataframe object. The Dataframe is my data structure of choice for working with time series data in Python.
    df_in = pd.concat([pd.DataFrame(timestamp_series),pd.DataFrame(bp_series),pd.DataFrame(ap_series)], axis=1)
    df_in = df_in.set_index('Timestamp')
    df_in = df_in.sort()

The aggregation process omits 1 second intervals where no quote was available. However, I want a full range from 00:00:00 to the final 1 second period of the day 23:59:59.
    from = datetime.datetime(year,month,day,0,0,0)
    to = from + datetime.timedelta(days=1)
    step = datetime.timedelta(seconds=1)
    range = []
    t = from
    while t < to:
        range.append(t)
        t += step
I then create an empty Pandas Series indexed on this range followed by a Dataframe.
    range_series = pd.Series(data=range)
    range_series.name='Timestamp'
    df = pd.DataFrame(index=range_series,columns=['BidPrice','AskPrice' ])
I then paste the empirical Dataframe into the full range column by column.
    for col in df_in.columns:
        df[col] = df_in[col]
The missing intervals will contain NaN values which I fill by copying forward previous quotes.
    df = df.fillna(method='pad')
The midpoint is then calculated and a new column created in the Dataframe.
    df['m'] = pd.Series( (df['BidPrice']+df['AskPrice'])/2,index=df.index)
1 second log returns are then calculated
    df_r = np.log(df).diff()
    df_r = df_r.fillna(0)
Following this I square the returns.
    df_r2 = np.power(df_r,2)
And aggregate the squared returns into 1 minute intervals.
    df_r2_agg = df_r2.resample('1min',how='sum')
The above steps are applied for each of the dates resulting in millions of rows. I then create a pivot table on date and minute of the day.
    df_r2_agg['minute'] = t
    df_r2_agg['date'] = dates
    df_r2_agg_pivoted = df_r2_agg.pivot(index='date',columns='minute',values='m')
where t is a time only series in increments of one second, replicated for each date. I then calculate the median variance across dates.
    df_r2_agg_median = df_r2_agg_pivoted.median()
To calculate the volatilities I take the square root of the 1 minute variances, and scale to one hour.
    df_vol = np.sqrt(df_r2_agg_median)
    df_vol = df_vol*np.sqrt(60)
Finally, I plot the volatilities. I have left out some code, but you get the idea! ;)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    colors = itertools.cycle(["#00FA9A", "c", "y","r","m","g","w" ])
    for col in df_vol.columns:
        ax.plot_date(df_vol.index,df_vol[col]/.0001,color=next(colors),linestyle='-',marker='' )

3 comments:

  1. What are the import statements for the Code

    ReplyDelete
  2. Can you please attach the python source code file so that we can follow along on your tutorials. Great work by the way

    ReplyDelete
  3. Can you please attach the python source code file so that we can follow along on your tutorials. Great work by the way

    ReplyDelete