Python and Quantopian Tools¶

Reasons to use Python¶

Speed
- Python is much faster
Big Data
- Python tools for blocked/parallel computing is more well regarded, and older than R alternatives
  - Blaze/Dask
  - pyspark
  - sklearn (partial fit)

Reasons to use Python¶

Packages
- NLTK
- Web Scraping
  - beautifulsoup, Scrapy, Selenium
- Deep Learning
- Quantopian (other backtesting frameworks)
Implement into production (app/software)

Speed¶

Python's data science tools are better optimized for speed.
Numpy arrays are the building blocks for most of python's data science tools, and numpy arrays are optimized for speed.
Pandas provides high-level, easy-to-use, data manipulation tools built off of numpy.

Speed Testing¶

Comparing Python and R using the "Million Songs" dataset
- Tidyverse vs Pandas/Numpy Aggregate
- Random Forest in R vs sklearn

Data¶

438 MB
Response is the year of the song release
90 features taken from the audio
Added an id column for the purpose of comparing pandas and dplyr performance

In [4]:

songs.shape

Out[4]:

(515345, 92)

In [5]:

songs.head(10)

Out[5]:

	0	1	2	3	4	5	6	7	8	9	...	82	83	84	85	86	87	88	89	90	id
0	2001	49.94357	21.47114	73.07750	8.74861	-17.40628	-13.09905	-25.01202	-12.23257	7.83089	...	-54.40548	58.99367	15.37344	1.11144	-23.08793	68.40795	-1.82223	-27.46348	2.26327	0
1	2001	48.73215	18.42930	70.32679	12.94636	-10.32437	-24.83777	8.76630	-0.92019	18.76548	...	-19.68073	33.04964	42.87836	-9.90378	-32.22788	70.49388	12.04941	58.43453	26.92061	0
2	2001	50.95714	31.85602	55.81851	13.41693	-6.57898	-18.54940	-3.27872	-2.35035	16.07017	...	26.05866	-50.92779	10.93792	-0.07568	43.20130	-115.00698	-0.05859	39.67068	-0.66345	0
3	2001	48.24750	-1.89837	36.29772	2.58776	0.97170	-26.21683	5.05097	-10.34124	3.55005	...	-171.70734	-16.96705	-46.67617	-12.51516	82.58061	-72.08993	9.90558	199.62971	18.85382	0
4	2001	50.97020	42.20998	67.09964	8.46791	-15.85279	-16.81409	-12.48207	-9.37636	12.63699	...	-55.95724	64.92712	-17.72522	-1.49237	-7.50035	51.76631	7.88713	55.66926	28.74903	0
5	2001	50.54767	0.31568	92.35066	22.38696	-25.51870	-19.04928	20.67345	-5.19943	3.63566	...	-50.69577	26.02574	18.94430	-0.33730	6.09352	35.18381	5.00283	-11.02257	0.02263	0
6	2001	50.57546	33.17843	50.53517	11.55217	-27.24764	-8.78206	-12.04282	-9.53930	28.61811	...	25.44182	134.62382	21.51982	8.17570	35.46251	11.57736	4.50056	-4.62739	1.40192	1
7	2001	48.26892	8.97526	75.23158	24.04945	-16.02105	-14.09491	8.11871	-1.87566	7.46701	...	-58.46192	-65.56438	46.99856	-4.09602	56.37650	-18.29975	-0.30633	3.98364	-3.72556	1
8	2001	49.75468	33.99581	56.73846	2.89581	-2.92429	-26.44413	1.71392	-0.55644	22.08594	...	5.20391	-27.75192	17.22100	-0.85210	-15.67150	-26.36257	5.48708	-9.13495	6.08680	1
9	2007	45.17809	46.34234	-40.65357	-2.47909	1.21253	-0.65302	-6.95536	-12.20040	17.02512	...	-87.55285	-70.79677	76.57355	-7.71727	3.26926	-298.49845	11.49326	-89.21804	-15.09719	1

10 rows × 92 columns

Aggregation - Mean¶

songs %>%
    group_by(id) %>%
    summarise_all(funs(mean(., na.rm = TRUE))) %>%
    select(-id)

On my system using a single system.time call, execution time of 1.25 seconds.

In [6]:

%%timeit 

(songs
    .groupby("id")
    .mean()
    .reset_index(drop = True))

884 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Aggregation - Median¶

songs %>%
    group_by(id) %>%
    summarise_all(funs(median(., na.rm = TRUE))) %>%
    select(-id)

Execution time was 497.5 seconds (8 minutes)

In [7]:

%%timeit

(songs
    .groupby("id")
    .median()
    .reset_index(drop = True))

1.2 s ± 61.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Aggregation - Custom¶

songs %>%
    group_by(id) %>%
    summarise_all(funs(sum(. > 0))) %>%
    select(-id)

Execution time was 185.8 seconds (2 minutes)

In [8]:

%%timeit -r 3

(songs
    .groupby('id')
    .apply(lambda df: (df > 0).sum())
    .reset_index(drop = True))

51 s ± 1.82 s per loop (mean ± std. dev. of 3 runs, 1 loop each)

Additional gains can be made by converting to numpy, and letting numpy handle the calculations.

In [9]:

def number_positive(df):
    
    numpy_df = df.values
    numpy_df = numpy_df > 0
    counts = numpy_df.sum(axis = 0)
    pandas_series = pd.Series(counts)
    
    return(pandas_series)

In [10]:

%%timeit -r 3

(songs
    .groupby('id')
    .apply(number_positive)
    .reset_index(drop = True))

14.1 s ± 164 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Running a model¶

What do all good data scientists do when they first want to analyze a dataset?

Blindly run a random forest on it!

In [11]:

songs = songs.groupby("id").mean().reset_index(drop = True)

In [12]:

songs.shape

Out[12]:

(85891, 91)

Random Forest¶

fit <- randomForest(X, y, ntree=10)

Execution time of 525.8 seconds (9 minutes)

In [15]:

X = songs.iloc[:, range(1, 91)]
y = songs[0]

rf = RandomForestRegressor(n_estimators=10, n_jobs = 1, max_features = "sqrt", min_samples_leaf = 5)
%timeit -r 3 rf.fit(X, y)

6.31 s ± 42.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Quantopian Tools¶

Alphalens
Zipline
Pyfolio

Alphalens¶

Analyzes how predictive a single feature is at predicting stock returns.

The core functionality of alphalens is as follows.

For each day:

Split candidate stocks into groups by the feature (quantiles)
Calculates 1, 5, and 10 day returns for each group

Then aggregates over all days

Zipline¶

First quantopian tool
Algorithmic trading simulator
Used for backtesting a trading strategy/algorithm

Zipline - Reasons to use¶

Well tested and thought out platform
Can automatically handle slippage/transaction costs
Calculates many important metrics out of the box
- Beta
- Measures of variability (Sharpe/Sortino)
Integrates with Pyfolio

Pyfolio¶

Pretty tables and plots

Simple example¶

In [4]:

from zipline import run_algorithm
from zipline.api import symbol, get_datetime, order_target_percent
import pyfolio
import pandas as pd

Our simple strategy is to long Apple on even days of the month, and short it on odd days.

In [2]:

def initialize(context):
    pass

def handle_data(context, data):
    day_of_month = get_datetime().day
    
    if (int(day_of_month) % 2 == 0):
        order_target_percent(symbol("AAPL"), 1)
    else:
        order_target_percent(symbol("AAPL"), -1)

In [5]:

result = run_algorithm(start = pd.Timestamp("2017-01-01", tz = "UTC"), 
                       end = pd.Timestamp("2017-06-30", tz = "UTC"), 
                       initialize = initialize, 
                       capital_base = 1000000, 
                       handle_data = handle_data,
                       data_frequency = "daily",
                       bundle = "quantopian-quandl")

In [6]:

result.positions.head()

Out[6]:

2017-01-03 21:00:00+00:00                                                   []
2017-01-04 21:00:00+00:00    [{'cost_basis': 116.01249807190719, 'sid': Equ...
2017-01-05 21:00:00+00:00    [{'cost_basis': 116.62499919342616, 'sid': Equ...
2017-01-06 21:00:00+00:00    [{'cost_basis': 117.89491918697846, 'sid': Equ...
2017-01-09 21:00:00+00:00    [{'cost_basis': 119.00500307449789, 'sid': Equ...
Name: positions, dtype: object

In [8]:

result.ending_cash.head(5)

Out[8]:

2017-01-03 21:00:00+00:00           1,000,000.00
2017-01-04 21:00:00+00:00           1,998,751.60
2017-01-05 21:00:00+00:00             -10,218.20
2017-01-06 21:00:00+00:00           2,011,573.81
2017-01-09 21:00:00+00:00             -18,523.59
Name: ending_cash, dtype: float64

In [9]:

pf_data = pyfolio.utils.extract_rets_pos_txn_from_zipline(result)
pyfolio.create_full_tear_sheet(*pf_data)

Entire data start date: 2017-01-03
Entire data end date: 2017-06-30
Backtest months: 5

	Backtest
Annual return	18.0%
Cumulative returns	8.5%
Annual volatility	17.2%
Sharpe ratio	1.04
Calmar ratio	1.79
Stability	0.64
Max drawdown	-10.0%
Omega ratio	1.23
Sortino ratio	1.89
Skew	1.70
Kurtosis	8.05
Tail ratio	0.97
Daily value at risk	-2.1%
Gross leverage	1.00
Daily turnover	94.1%
Alpha	0.29
Beta	-0.59

Worst drawdown periods	Net drawdown in %	Peak date	Valley date	Recovery date	Duration
0	10.04	2017-06-09	2017-06-26	NaT	NaN
1	5.55	2017-02-06	2017-03-15	2017-03-28	37
2	4.59	2017-04-24	2017-05-05	2017-05-17	18
3	2.68	2017-05-23	2017-06-02	2017-06-09	14
4	2.56	2017-01-06	2017-01-30	2017-02-01	19

Stress Events	mean	min	max
New Normal	0.07%	-2.87%	6.06%

Top 10 long positions of all time	max
AAPL	105.89%

Top 10 short positions of all time	max
AAPL	-103.01%

Top 10 positions of all time	max
AAPL	105.89%

All positions ever held	max
AAPL	105.89%

Vardon Backtest¶

In [10]:

def initialize(context):
    context.transactions = picks
    
    z.set_commission(f.commission.PerShare(cost=0, min_trade_cost=0))
    z.set_slippage(f.slippage.FixedSlippage(spread=0.0))
    
    # Scheduling function to run one minute before close will result in it executing at close
    z.schedule_function(func = order_stocks, time_rule = z.time_rules.market_close(minutes = 1), 
                        half_days=True)

def order_stocks(context, data):
    # Get date that zipline is on in the backtest
    current_date = (z.get_datetime()
                        .tz_localize(None)
                        .normalize())
    
    transactions = context.transactions.loc[current_date]
    
    # Place order for each transaction
    for row in transactions.itertuples(index=True, name='Pandas'):
        z.order(z.symbol(row[1]), row[2])