--- title: "Scaling and Modeling with tidyquant" author: "Matt Dancho" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Scaling and Modeling with tidyquant} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 150) # devtools::load_all() # Travis CI fails on load_all() ``` > Designed for the data science workflow of the `tidyverse` # Overview The greatest benefit to `tidyquant` is the ability to apply the data science workflow to easily model and scale your financial analysis as described in [_R for Data Science_](https://r4ds.hadley.nz/). Scaling is the process of creating an analysis for one asset and then extending it to multiple groups. This idea of scaling is incredibly useful to financial analysts because typically one wants to compare many assets to make informed decisions. Fortunately, the `tidyquant` package integrates with the `tidyverse` making scaling super simple! All `tidyquant` functions return data in the `tibble` (tidy data frame) format, which allows for interaction within the `tidyverse`. This means we can: * Seamlessly scale data retrieval and mutations * Use the pipe (`%>%`) for chaining operations * Use `dplyr` and `tidyr`: `select`, `filter`, `group_by`, `nest`/`unnest`, `spread`/`gather`, etc * Use `purrr`: mapping functions with `map()` * Model financial analysis using the data science workflow in [_R for Data Science_](https://r4ds.hadley.nz/) We'll go through some useful techniques for getting and manipulating groups of data. # Prerequisites Load the `tidyquant` package to get started. ```r # Loads tidyquant, xts, quantmod, TTR, and PerformanceAnalytics library(tidyverse) library(tidyquant) ``` ```{r, include=FALSE} # Loads packages for R CMD CHECK library(lubridate) library(dplyr) library(purrr) library(ggplot2) library(tidyr) library(tidyquant) ``` # 1.0 Scaling the Getting of Financial Data A very basic example is retrieving the stock prices for multiple stocks. There are three primary ways to do this: ## Method 1: Map a character vector with multiple stock symbols ```{r} c("AAPL", "GOOG", "META") %>% tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") ``` The output is a single level tibble with all or the stock prices in one tibble. The auto-generated column name is "symbol", which can be preemptively renamed by giving the vector a name (e.g. `stocks <- c("AAPL", "GOOG", "META")`) and then piping to `tq_get`. ## Method 2: Map a tibble with stocks in first column First, get a stock list in data frame format either by making the tibble or retrieving from `tq_index` / `tq_exchange`. The stock symbols must be in the first column. ### Method 2A: Make a tibble ```{r} stock_list <- tibble(stocks = c("AAPL", "JPM", "CVX"), industry = c("Technology", "Financial", "Energy")) stock_list ``` Second, send the stock list to `tq_get`. Notice how the symbol and industry columns are automatically expanded the length of the stock prices. ```{r} stock_list %>% tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") ``` ### Method 2B: Use index or exchange Get an index... ```{r} tq_index("DOW") ``` ...or, get an exchange. ```{r, eval=FALSE} tq_exchange("NYSE") ``` Send the index or exchange to `tq_get`. _Important Note: This can take several minutes depending on the size of the index or exchange, which is why only the first three stocks are evaluated in the vignette._ ```{r} tq_index("DOW") %>% slice(1:3) %>% tq_get(get = "stock.prices") ``` You can use any applicable "getter" to get data for __every stock in an index or an exchange__! This includes: "stock.prices", "key.ratios", "key.stats", and more. # 2.0 Scaling the Mutation of Financial Data Once you get the data, you typically want to do something with it. You can easily do this at scale. Let's get the yearly returns for multiple stocks using `tq_transmute`. First, get the prices. We'll use the `FANG` data set, but you typically will use `tq_get` to retrieve data in "tibble" format. ```{r} FANG ``` Second, use `group_by` to group by stock symbol. Third, apply the mutation. We can do this in one easy workflow. The `periodReturn` function is applied to each group of stock prices, and a new data frame was returned with the annual returns in the correct periodicity. ```{r} FANG_returns_yearly <- FANG %>% group_by(symbol) %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, period = "yearly", col_rename = "yearly.returns") ``` Last, we can visualize the returns. ```{r} FANG_returns_yearly %>% ggplot(aes(x = year(date), y = yearly.returns, fill = symbol)) + geom_bar(position = "dodge", stat = "identity") + labs(title = "FANG: Annual Returns", subtitle = "Mutating at scale is quick and easy!", y = "Returns", x = "", color = "") + scale_y_continuous(labels = scales::percent) + coord_flip() + theme_tq() + scale_fill_tq() ``` # 3.0 Modeling Financial Data using purrr Eventually you will want to begin modeling (or more generally applying functions) at scale! One of the __best__ features of the `tidyverse` is the ability to map functions to nested tibbles using `purrr`. From the Many Models chapter of "[R for Data Science](https://r4ds.hadley.nz/)", we can apply the same modeling workflow to financial analysis. Using a two step workflow: 1. Model a single stock 2. Scale to many stocks Let's go through an example to illustrate. ## Example: Applying a Regression Model to Detect a Positive Trend In this example, we'll use a simple linear model to identify the trend in annual returns to determine if the stock returns are decreasing or increasing over time. ### Analyze a Single Stock First, let's collect stock data with `tq_get()` ```{r} AAPL <- tq_get("AAPL", from = "2007-01-01", to = "2016-12-31") AAPL ``` Next, come up with a function to help us collect annual log returns. The function below mutates the stock prices to period returns using `tq_transmute()`. We add the `type = "log"` and `period = "monthly"` arguments to ensure we retrieve a tibble of monthly log returns. Last, we take the mean of the monthly returns to get MMLR. ```{r} get_annual_returns <- function(stock.returns) { stock.returns %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, type = "log", period = "yearly") } ``` Let's test `get_annual_returns` out. We now have the annual log returns over the past ten years. ```{r} AAPL_annual_log_returns <- get_annual_returns(AAPL) AAPL_annual_log_returns ``` Let's visualize to identify trends. We can see from the linear trend line that AAPL's stock returns are declining. ```{r} AAPL_annual_log_returns %>% ggplot(aes(x = year(date), y = yearly.returns)) + geom_hline(yintercept = 0, color = palette_light()[[1]]) + geom_point(size = 2, color = palette_light()[[3]]) + geom_line(linewidth = 1, color = palette_light()[[3]]) + geom_smooth(method = "lm", se = FALSE) + labs(title = "AAPL: Visualizing Trends in Annual Returns", x = "", y = "Annual Returns", color = "") + theme_tq() ``` Now, we can get the linear model using the `lm()` function. However, there is one problem: the output is not "tidy". ```{r} mod <- lm(yearly.returns ~ year(date), data = AAPL_annual_log_returns) mod ``` We can utilize the `broom` package to get "tidy" data from the model. There's three primary functions: 1. `augment`: adds columns to the original data such as predictions, residuals and cluster assignments 2. `glance`: provides a one-row summary of model-level statistics 3. `tidy`: summarizes a model's statistical findings such as coefficients of a regression We'll use `tidy` to retrieve the model coefficients. ```{r} library(broom) tidy(mod) ``` Adding to our workflow, we have the following: ```{r} get_model <- function(stock_data) { annual_returns <- get_annual_returns(stock_data) mod <- lm(yearly.returns ~ year(date), data = annual_returns) tidy(mod) } ``` Testing it out on a single stock. We can see that the "term" that contains the direction of the trend (the slope) is "year(date)". The interpretation is that as year increases one unit, the annual returns decrease by 3%. ```{r} get_model(AAPL) ``` Now that we have identified the trend direction, it looks like we are ready to scale. ### Scale to Many Stocks Once the analysis for one stock is done scale to many stocks is simple. For brevity, we'll randomly sample ten stocks from the S&P500 with a call to `dplyr::sample_n()`. ```{r} set.seed(10) stocks_tbl <- tq_index("SP500") %>% sample_n(5) stocks_tbl ``` We can now apply our analysis function to the stocks using `dplyr::mutate()` and `purrr::map()`. The `mutate()` function adds a column to our tibble, and the `map()` function maps our custom `get_model` function to our tibble of stocks using the `symbol` column. The `tidyr::unnest()` function unrolls the nested data frame so all of the model statistics are accessible in the top data frame level. The `filter`, `arrange` and `select` steps just manipulate the data frame to isolate and arrange the data for our viewing. ```{r} stocks_model_stats <- stocks_tbl %>% select(symbol, company) %>% tq_get(from = "2007-01-01", to = "2016-12-31") %>% # Nest group_by(symbol, company) %>% nest() %>% # Apply the get_model() function to the new "nested" data column mutate(model = map(data, get_model)) %>% # Unnest and collect slope unnest(model) %>% filter(term == "year(date)") %>% arrange(desc(estimate)) %>% select(-term) stocks_model_stats ``` We're done! We now have the coefficient of the linear regression that tracks the direction of the trend line. We can easily extend this type of analysis to larger lists or stock indexes. For example, the entire S&P500 could be analyzed removing the `sample_n()` following the call to `tq_index("SP500")`. # 4.0 Error Handling when Scaling Eventually you will run into a stock index, stock symbol, FRED data code, etc that cannot be retrieved. Possible reasons are: * An index becomes out of date * A company goes private * A stock ticker symbol changes * Yahoo / FRED just doesn't like your stock symbol / FRED code This becomes painful when scaling if the functions return errors. So, the `tq_get()` function is designed to handle errors _gracefully_. What this means is an `NA` value is returned when an error is generated along with a _gentle error warning_. ```{r} tq_get("XYZ", "stock.prices") ``` ## Pros and Cons to Built-In Error-Handling There are pros and cons to this approach that you may not agree with, but I believe helps in the long run. Just be aware of what happens: * __Pros__: Long running scripts are not interrupted because of one error * __Cons__: Errors can be inadvertently handled or flow downstream if the user does not read the warnings ## Bad Apples Fail Gracefully, tq_get Let's see an example when using `tq_get()` to get the stock prices for a long list of stocks with one `BAD APPLE`. The argument `complete_cases` comes in handy. The default is `TRUE`, which removes "bad apples" so future analysis have complete cases to compute on. Note that a gentle warning stating that an error occurred and was dealt with by removing the rows from the results. ```{r, warning = TRUE} c("AAPL", "GOOG", "BAD APPLE") %>% tq_get(get = "stock.prices", complete_cases = TRUE) ``` Now switching `complete_cases = FALSE` will retain any errors as `NA` values in a nested data frame. Notice that the error message and output change. The error message now states that the `NA` values exist in the output and the return is a "nested" data structure. ```{r, warning = TRUE} c("AAPL", "GOOG", "BAD APPLE") %>% tq_get(get = "stock.prices", complete_cases = FALSE) ``` In both cases, the prudent user will review the warnings to determine what happened and whether or not this is acceptable. In the `complete_cases = FALSE` example, if the user attempts to perform downstream computations at scale, the computations will likely fail grinding the analysis to a halt. But, the advantage is that the user will more easily be able to filter to the problem root to determine what happened and decide whether this is acceptable or not.