In this post I go through the Python code I developed to create this figure. The steps include querying a MongoDB database, aggregating results, computing volatility, and plotting.
We begin by connecting to MongoDB via the PyMongo package.
client = MongoClient() client = MongoClient('localhost', 27017) db = client.taq quote_collection = db['CL_quotes']I then query the available list of dates
dates = t_collection.find({},{"d" : 1 }).distinct("d") dates = sorted(dates)I don't want to include exchange holidays or expiration dates in my analysis. The following function calls retrieve the relevant dates.
holidays= sd.holidays() expirationdays= sd.CL_expirationdates()I then filter the date list.
dates_filtered = [x for x in dates if x not in holidays] dates_filtered = [x for x in dates_filtered if x not in expirationdays]I also filter Sundays.
dates_filtered = [x for x in dates_filtered if x.weekday() != 6]For each date "dt" in dates_filtered I specify an end date to be used in the database query following.
start_dt = dt end_dt = dt end_dt = end_dt + datetime.timedelta(days=1)I then write an aggregation query to sample bid and ask quotes at one second intervals. The aggregation framework is a recent addition to MongoDB. It is pretty fast, but apparently the Mongo developers are in the process of making it faster. This query aggregates bid and ask quotes into 1 second buckets and returns the closing quotes for each second. The aggregation process occurs at database level. The alternative is to load the quotes into a Python data structure to perform the aggregation.
docs = collection.aggregate([
{"$match": {"dt": {"$lt" : end_dt, "$gte" : start_dt }}},
{"$project": {
"year": {"$year": "$dt"},
"month": {"$month": "$dt"},
"day": {"$dayOfMonth": "$dt"},
"hour": {"$hour": "$dt"},
"minute": {"$minute": "$dt"},
"second": {"$second": "$dt"},
"dt": 1,
"ap": 1,
"bp": 1 }},
{"$sort": {"dt": 1}},
{"$group":
{"_id" : {"year": "$year", "month": "$month", "day": "$day", "hour": "$hour","minute":"$minute","second":"$second" },
"bp_close": {"$last": "$bp"},
"ap_close": {"$last": "$ap"} }} ] )
The following code iterates through the result set and creates lists of closing bid and ask quotes and corresponding timestamps.
timestamp = [] bp_close=[] ap_close=[] for doc in docs['result']: year = doc['_id']['year'] month = doc['_id']['month'] day = doc['_id']['day'] hour = doc['_id']['hour'] minute = doc['_id']['minute'] second = doc['_id']['second'] timestamp.append(datetime.datetime(year,month,day,hour,minute,second)) bp_close.append(doc['bp_close']) ap_close.append(doc['ap_close'])I then wrap these lists in Pandas Series objects.
timestamp_series = pd.Series(data=timestamp) timestamp_series.name='Timestamps' bp_series = pd.Series(data=bp_close) bp_series.name='BidPrice' ap_series = pd.Series(data=ap_close) ap_series.name='AskPrice'I then concatenate the series to form a Pandas Dataframe object. The Dataframe is my data structure of choice for working with time series data in Python.
df_in = pd.concat([pd.DataFrame(timestamp_series),pd.DataFrame(bp_series),pd.DataFrame(ap_series)], axis=1) df_in = df_in.set_index('Timestamp') df_in = df_in.sort()The aggregation process omits 1 second intervals where no quote was available. However, I want a full range from 00:00:00 to the final 1 second period of the day 23:59:59.
from = datetime.datetime(year,month,day,0,0,0) to = from + datetime.timedelta(days=1) step = datetime.timedelta(seconds=1) range = [] t = from while t < to: range.append(t) t += stepI then create an empty Pandas Series indexed on this range followed by a Dataframe.
range_series = pd.Series(data=range) range_series.name='Timestamp' df = pd.DataFrame(index=range_series,columns=['BidPrice','AskPrice' ])I then paste the empirical Dataframe into the full range column by column.
for col in df_in.columns: df[col] = df_in[col]The missing intervals will contain NaN values which I fill by copying forward previous quotes.
df = df.fillna(method='pad')The midpoint is then calculated and a new column created in the Dataframe.
df['m'] = pd.Series( (df['BidPrice']+df['AskPrice'])/2,index=df.index)1 second log returns are then calculated
df_r = np.log(df).diff() df_r = df_r.fillna(0)Following this I square the returns.
df_r2 = np.power(df_r,2)And aggregate the squared returns into 1 minute intervals.
df_r2_agg = df_r2.resample('1min',how='sum')The above steps are applied for each of the dates resulting in millions of rows. I then create a pivot table on date and minute of the day.
df_r2_agg['minute'] = t df_r2_agg['date'] = dates df_r2_agg_pivoted = df_r2_agg.pivot(index='date',columns='minute',values='m')where t is a time only series in increments of one second, replicated for each date. I then calculate the median variance across dates.
df_r2_agg_median = df_r2_agg_pivoted.median()To calculate the volatilities I take the square root of the 1 minute variances, and scale to one hour.
df_vol = np.sqrt(df_r2_agg_median) df_vol = df_vol*np.sqrt(60)Finally, I plot the volatilities. I have left out some code, but you get the idea! ;)
fig = plt.figure() ax = fig.add_subplot(111) colors = itertools.cycle(["#00FA9A", "c", "y","r","m","g","w" ]) for col in df_vol.columns: ax.plot_date(df_vol.index,df_vol[col]/.0001,color=next(colors),linestyle='-',marker='' )
What are the import statements for the Code
ReplyDeleteCan you please attach the python source code file so that we can follow along on your tutorials. Great work by the way
ReplyDeleteCan you please attach the python source code file so that we can follow along on your tutorials. Great work by the way
ReplyDelete