Manipulating financial data with Python
Wed 01 March 2017 by Adrian TorrieThis post is part 1 of the "Udacity - Machine Learning for Trading" series:
Table of Contents¶
Summary¶
Notes for the sections of the mini-course, Manipulating Financial Data in Python:
- 01-01 Reading and plotting stock data
- 01-02 Working with multiple stocks
- 01-03 The power of numpy
- 01-04 Statistical analysis of time series
- 01-05 Incomplete data
- 01-06 Histograms and scatterplots
- 01-07 Sharpe ratio and other portfolio statistics
- 01-08 Optimzers: Building a parameterized model
- 01-09 Optimzers: How to optimze a portfolio
Version Control¶
%run ../../../code/version_check.py
Change Log¶
Date Created: 2017-02-28
Date of Change Change Notes
-------------- ----------------------------------------------------------------
2017-02-28 Initial draft
Setup¶
Import needed modules and establish control variables.
%matplotlib inline
import datetime
from matplotlib.pylab import rcParams
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import quandl
import scipy.optimize as spo
The dates
variable will be a datetime index
used for joining the time series we download.
start_date = datetime.datetime(1999, 1, 1)
end_date = datetime.datetime(2017, 2, 27)
dates = pd.date_range(start_date, end_date)
We download spy
to use as a benchmark, and also as a reference point to determine if the market is open. It's generally safe to say that if the market was open, then spy
would have been traded.
The time series are downloaded into their own dataframe, as opposed to using the pandas.datareader
functionality to download into a panel.
spy = web.DataReader("SPY", 'yahoo', start_date, end_date)
aapl = web.DataReader("AAPL", 'yahoo', start_date, end_date)
gld = web.DataReader("GLD", 'yahoo', start_date, end_date)
ibm = web.DataReader("IBM", 'yahoo', start_date, end_date)
xom = web.DataReader("XOM", 'yahoo', start_date, end_date)
Assumptions used throughout analysis
n_trading_days = 252
Used for formatting of charts
ts_figsize = (10, 3)
subplot_figsize = (10, 8)
square_figsize = (5, 5)
rcParams['grid.alpha'] = 0.5
rcParams['grid.linestyle'] = 'dashed'
rcParams['axes.spines.left'] = False
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['legend.loc'] = 'best'
rcParams['legend.fancybox'] = True
rcParams['ytick.left'] = False
ts_kwargs = {'figsize': ts_figsize, 'grid': True}
subplot_kwargs = {'figsize': subplot_figsize, 'grid': True, 'subplots': True, }
legend_kwargs = {'bbox_to_anchor': (1.05, 1), 'loc': 2, 'borderaxespad': 0.}
Used for the quandl module to prevent daily download limits
quandl_api_key = os.environ['QUANDL_API_KEY']
quandl.ApiConfig.api_key = quandl_api_key
Preparation of data¶
We see that:
['spy', 'aapl', 'ibm', 'xom']
series have the same number of observations = 4,566['spy', 'aapl', 'ibm', 'xom']
series start date = 1999-01-04gld
starts 2004-11-18 and has 3,089 observations- Approximately 250 KB is used for each dataset (
gld
: 170 KB)
print(spy.info())
print(spy.head())
print(aapl.info())
print(aapl.head())
print(gld.info())
print(gld.head())
print(ibm.info())
print(ibm.head())
print(xom.info())
print(xom.head())
Join the data prices
¶
# create empty dataframe with datetime index set
prices = pd.DataFrame(index=dates)
# add spy 'Adj Close' column and rename to the column to 'spy'
# only keeping those rows where price data exists for spy
# to ensure we are analysing days where the market was open
prices = prices.join(spy['Adj Close'], how='inner')
prices = prices.rename(columns={'Adj Close': 'spy'})
# review changes
print(prices.info())
print(prices.head())
# add aapl, ibm, gld, xom
prices['aapl'] = aapl['Adj Close']
prices['gld'] = gld['Adj Close']
prices['ibm'] = ibm['Adj Close']
prices['xom'] = xom['Adj Close']
# visualise the price data
prices.plot(**ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Prepare and normalise the dataset for analysis norm
¶
Use forward-fill first before back-fill. This is done to prevent data from the future creeping into past results. We only use the backfill to allow for time series that start part-way through the defined index time window to have the start of their data series aligned with the rest of the data.
# fill gaps
norm = prices.copy()
norm.fillna(method='ffill', axis=0, inplace=True)
norm.fillna(method='bfill', axis=0, inplace=True)
# normalise the price data
norm = norm / norm.ix[0]
# review changes
print(norm.head())
# generate plot
norm.plot(title='Cumulative Returns', **subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
When viewing the series all together we can see the easily see the outsized gains aapl
had compartive to the market (spy
) and ibm
.
# generate plot to show the outsized returns aapl had relative the others
norm.plot.line(**ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Analysis¶
Statistical analysis daily_returns
¶
daily_returns = prices.pct_change()
print(daily_returns.head())
daily_returns.plot(**subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
mean = daily_returns.mean()
mean
std = daily_returns.std()
std
# the mean series object can be still be indexed into using the series name
mean['spy']
Reset daily_returns
¶
Here we reset the daily_returns
to the precent change of the normalised dataframe, norm
. We do this to as we've already processed it for forward-fill, and back-fill. This allows every row of the data frame to be used by the dataframe methods without running into issues of dealing with Nan
values.
# reset daily_returns to use the normalised data now so the NaNa don't interfere
daily_returns = norm.pct_change()
daily_returns = daily_returns.dropna()
print(daily_returns.head())
# global min and max, using '.values' converts to a numpy array
min = daily_returns.values.min()
max = daily_returns.values.max()
print('({}, {})'.format(min, max))
Rolling returns sma_20
¶
Smoothed curve of the returns over a 20 day window. Later used to create Bollinger bands. Often used as a technical indicator and overlayed on the price series.
# sma is the "simple moving average"
sma_20 = prices.rolling(center=False, window=20).mean()
sma_20.plot(**subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
Rolling deviations rdev_20
¶
Will always have a positive value, calculated over a window of 20 days. Later used to create Bollinger bands.
# rdev is the "rolling deviations"
rdev_20 = prices.rolling(center=False, window=20).std()
rdev_20.plot(**subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
Bollinger bands¶
Bollinger bands are a way to view the current price relative to the current volatility. They are used as a technical indicator, and may be used in trading strategies.
When plotting, only the last 200 days of the price series are shown, to allow for better clarity of their visual representation.
upper_band = sma_20 + rdev_20 * 2
lower_band = sma_20 - rdev_20 * 2
n_subplot_rows = len(prices.columns)
fig = plt.figure(figsize=subplot_figsize, tight_layout=True)
for i, series in enumerate(prices.columns):
colour = 'C' + str(i)
label = series + ' (200 most recent trading days)'
subplot = i + 1
ax = fig.add_subplot(n_subplot_rows, 1, subplot)
ax.plot(prices[series][-200:-1], label=series, color=colour)
ax.plot(upper_band[series][-200:-1], linestyle='dashed', color='grey', label='')
ax.plot(lower_band[series][-200:-1], linestyle='dashed', color='grey', label='')
ax.grid()
if subplot != n_subplot_rows:
ax.xaxis.set_ticklabels([])
ax.legend(**legend_kwargs)
Daily returns distributions¶
Plot a histogram of spy
daily returns, along with its mean and 2 standard deviations (upper and lower bounds)
# plot returns distribution and mark with the mean and 2 std. dev.
series = 'spy'
daily_returns[series].hist(label=series, bins=1000, range=[min, max], align='mid', **ts_kwargs)
# mean
plt.axvline(mean[series], color='r', linestyle='dashed')
# upper deviation
plt.axvline(mean[series] + std[series] * 2, color='r', linestyle='dashed')
# lower deviation
plt.axvline(mean[series] - std[series] * 2, color='r', linestyle='dashed')
plt.title('SPY Returns Distribution (with 2 S.D)')
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Overlay the histograms of spy
and aapl
to compare return distributions.
daily_returns['spy'].hist(label='spy',
bins=100,
range=[min, max],
align='mid',
alpha=0.5,
**ts_kwargs)
daily_returns['aapl'].hist(label='aapl',
bins=100,
range=[min, max],
align='mid',
alpha=0.5,
**ts_kwargs)
plt.title('SPY vs. AAPL Returns Distributions')
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Use a binned histogram to view the shape of the returns distribution
daily_returns.plot(kind='hist',
bins=100,
range=[min, max],
align='mid',
alpha=0.8,
**subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
Often a better way to see the shape is to use the Kernel Density Estimator, as this provides a better view of the skew and kurtosis (if any).
daily_returns.plot(kind='kde', **subplot_kwargs)
[ax.legend(**legend_kwargs) for ax in plt.gcf().axes]
plt.tight_layout()
plt.show()
Alpha and beta¶
$\alpha$ and $\beta$
# the series to use to determine beta and alpha coeffecients
series = 'aapl'
# the series to benchmark returns against
benchmark = 'spy'
# create scatterplot
daily_returns.plot(kind='scatter', x=benchmark, y=series, figsize=square_figsize, label=series)
# create linear regression plot of series vs spy
beta, alpha = np.polyfit(x=daily_returns[benchmark], y=daily_returns[series], deg=1)
plt.plot(daily_returns[benchmark],
beta * daily_returns[benchmark] + alpha,
linestyle='-',
color='r',
label='$y = {0:.2f}x + {1:.2f}$'.format(beta, alpha))
print('(alpha = {})'.format(alpha))
print('(beta = {})'.format(beta))
plt.grid()
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
# the series to use to determine beta and alpha coeffecients
series = 'gld'
# the series to benchmark returns against
benchmark = 'spy'
# create scatterplot
daily_returns.plot(kind='scatter', x=benchmark, y=series, figsize=square_figsize, label=series)
# create linear regression plot of series vs spy
beta, alpha = np.polyfit(x=daily_returns[benchmark], y=daily_returns[series], deg=1)
plt.plot(daily_returns[benchmark],
beta * daily_returns[benchmark] + alpha,
linestyle='-',
color='r',
label='$y = {0:.2f}x + {1:.2f}$'.format(beta, alpha))
print('(alpha = {})'.format(alpha))
print('(beta = {})'.format(beta))
plt.grid()
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
# the series to use to determine beta and alpha coeffecients
series = 'ibm'
# the series to benchmark returns against
benchmark = 'spy'
# create scatterplot
daily_returns.plot(kind='scatter', x=benchmark, y=series, figsize=square_figsize, label=series)
# create linear regression plot of series vs spy
beta, alpha = np.polyfit(x=daily_returns[benchmark], y=daily_returns[series], deg=1)
plt.plot(daily_returns[benchmark],
beta * daily_returns[benchmark] + alpha,
linestyle='-',
color='r',
label='$y = {0:.2f}x + {1:.2f}$'.format(beta, alpha))
print('(alpha = {})'.format(alpha))
print('(beta = {})'.format(beta))
plt.grid()
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
# the series to use to determine beta and alpha coeffecients
series = 'xom'
# the series to benchmark returns against
benchmark = 'spy'
# create scatterplot
daily_returns.plot(kind='scatter', x=benchmark, y=series, figsize=square_figsize, label=series)
# create linear regression plot of series vs spy
beta, alpha = np.polyfit(x=daily_returns[benchmark], y=daily_returns[series], deg=1)
plt.plot(daily_returns[benchmark],
beta * daily_returns[benchmark] + alpha,
linestyle='-',
color='r',
label='$y = {0:.2f}x + {1:.2f}$'.format(beta, alpha))
print('(alpha = {})'.format(alpha))
print('(beta = {})'.format(beta))
plt.grid()
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Correlations¶
Note how the GLD ETF has a very low correlation with most other tradable assets.
# review correlations of the daily returns series
corr = daily_returns.corr(method='pearson')
print(corr)
Sharpe ratio¶
An annualised measure of risk-adjusted returns.
\begin{align} sr & = \frac{R_p - R_f}{\sigma} \\ & = \frac{E[R_p - R_f]}{\sigma} \\ & = \frac{mean(R_p - R_f)}{std(R_p)} \\ \end{align}where:
- $R_p$ = Daily returns
- $R_f$ = Risk-free rate
- $\sigma$ = Std deviation of daily returns
- $R_p - R_f$ = Daily excess returns
further the risk free rate is typically one of the following:
- LIBOR
- 30-day T-Bill
- 3-month T-Bill
- 0% (zero -> modern day interst rates are near non-existant)
Sharpe ratio adjustment¶
The adjusment is $sr = sr * \sqrt{k}$, where $k$ is determined by the sampling frequency.
- $k$ = 252 when using daily returns
- $k$ = 52 when using weekly returns
- $k$ = 12 when using monthly returns
Example using daily returns \begin{align} sr & = \sqrt{252} * \frac{R_p - R_f}{\sigma} \ \end{align}
Risk free rate adjustment¶
Risk free rates, i.e. Treasury yields, are often quoted as an Effective Annual Rate (EAR). To convert the EAR into a daily retuns we use the following:
\begin{align} rf & = \left[ (1 + EAR) ^ \frac{1}{k} \right] - 1 \\ \end{align}where $k$ is determined by the sampling frequency:
- Daily quotes: $k = 252$
- Weekly quotes: $k = 52$
- Monthly quotes: $k = 12$
Portfolio returns rp
¶
Assume a buy and hold approach for each tradable asset individually investing $10,000 at the beginning of the holding period into each asset.
Note: this would cause \$10,000 to be allcoated to `gld` which isn't actually traded until 2004-11-08. This leaves $1 / 5$ (20%) of our funds not generating a return for 4+ years, which of course is a terrible idea.
rp = norm * 10000
# display final holdings
pd.options.display.float_format = '${:,.2f}'.format
final_holdings = rp.ix[-1:,].transpose()
print('Final values of holdings:')
print(final_holdings)
pd.options.display.float_format = None
# determine portfolio value in $ at the end of each trading day
rp['value'] = rp.sum(axis=1)
rp['value'] .plot(**ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
# determine daily returns of the portfolio
rp = rp[['value']].pct_change()
rp = rp.rename(columns={'value': 'rp'})
# drops the first row, as there is no preceding data to determine returns
# for the first day in the series
rp = rp.dropna()
# review
print(rp.info())
print(rp.head())
print()
print('Mean daily portfolio return {0}'.format(rp.mean()))
# plot
rp.plot(**ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Risk free rate rf
¶
Below, we get the yield for the 3-month treasury bill, quoted as an EAR on a daily basis. Quandl also has the 30-day yield, but the time series doesn't go back as far as the data we are using.
Quandl data a tag = USTREASURY/YIELD
# create empty dataframe with dates to join to
# use the portfolio returns, this ensures index date alignment
# and an equal number of observations for the Sharpe ratio calc.
rf = pd.DataFrame(index=rp.index)
# get risk free rate
k = n_trading_days
# only return the 2nd column, '3 MO', and rename it to 'rf
rf = rf.join(quandl.get("USTREASURY/YIELD.2",
start_date=start_date,
end_date=end_date), method='left')
rf = rf.rename(columns={'3 MO': 'rf'})
# fill blanks
rf = rf.fillna(method='ffill', axis=0)
rf = rf.fillna(method='bfill', axis=0)
# convert EAR to daily risk-free rate
rf = (1 + rf) ** (1 / k) - 1
# review
print(rf.info())
print(rf.head())
print('Mean daily risk-free return {0}'.format(rf.mean()))
# plot
rf.plot(**ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Review the excess returns over the risk-free rate. A quick review will let us know what to expect from our Sharpe ratio calculation.
excess_returns = rp['rp'] - rf['rf']
# stats
er_mean = excess_returns.mean()
er_std = excess_returns.std()
er_skew = excess_returns.skew()
er_kurt = excess_returns.kurtosis()
print('Mean\tStd Dev\tSkew\tKurtosis')
print('------\t-------\t------\t---------')
print('{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}'.format(er_mean, er_std, er_skew, er_kurt))
# plots
excess_returns.plot(label='excess_returns', **ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
excess_returns.plot(kind='kde', label='excess_returns', **ts_kwargs)
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Sharpe ratio calculation sr
¶
sr = np.sqrt(n_trading_days) * np.mean((excess_returns)) / np.std(rp.values)
print('Sharpe ratio: {}'.format(sr))
The above is the same as the code below.
sr = np.sqrt(n_trading_days) * np.mean(rp.values - rf.values) / np.mean(np.std(rp.values))
print('Sharpe ratio: {}'.format(sr))
This negative result means the portfolio performed worse than the risk-free rate over the holding period. We could've determined this would occur by quickly reviewing the means of the rp
and rf
time series, where we see that the mean daily portfolio return is less than the mean daily risk-free return.
rp.mean()
rf.mean()
Optimising functions¶
Read the following page to understand what is happening below: scipy.optimize.minimize
def f(X, trace=False):
"""The model"""
Y = (X - 1.5)**2 + 0.5
# allow for a the results to be printed iteratively
if trace == True:
print('X = {0}, Y = {1}'.format(X, Y))
return Y
# parameter into the optimisation routine
x_guess = 2.0
# perform optimisation
min_result = spo.minimize(
fun=f, x0=x_guess, args=(True), method='SLSQP', options={'disp': True})
# review
print('Minima found at:')
print('X = {}, Y = {}'.format(min_result.x, min_result.fun))
# create y values using our model function
x_plot = np.linspace(0.5, 2.5, 21)
y_plot = f(x_plot)
# plot the cost functioin
plt.plot(x_plot, y_plot, label='cost')
# plot the optimisation result
plt.plot(min_result.x, min_result.fun, 'ro', label='min result')
# plot
plt.grid()
plt.title('Minima of the objective (cost) function')
plt.legend(**legend_kwargs)
plt.tight_layout()
plt.show()
Optimisation ranges and constraints¶
Framing the optimisation problem¶
We need to know what we're optimising for:
- Sharpe ratio is our objective.
- However, we want to maximise the Sharpe ratio, therefore when optimising we would need to mulitply by negative 1.
- Based on the above, the coeffecients returned are the desired asset allocations in the portfolio that maximise the Sharpe ratio.