--- title: "tidyfst包实例分æž" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{chinese_tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` 我的Rè¯è¨€å°ä¼™ä¼´æœ€è¿‘分享了自己使用Ræ¥åšå·¥ä¸šçº§æ•°æ®æ¸…æ´—çš„[ç»éªŒ](https://mp.weixin.qq.com/s/NVlCPss32j6Ohdrc9Edx-A)ï¼Œæœ€è¿‘æˆ‘è‡ªå·±åœ¨ä¸æ–测试我的新包tidyfstï¼Œå› æ¤å°±æ‹¿è¿™ä¸ªdata.table的案例æ¥å°è¯•一下。 ## æµ‹è¯•æ•°æ®æž„é€ æœ¬æ¬¡æµ‹è¯•ï¼Œå°†ä¸ä¼šåŠ è½½data.table包,但是其实tidyfsté‡Œé¢æ— 处䏿˜¯data.tableçš„å…ƒç´ ï¼Œè€Œä¸”ä¹Ÿå¯¼å‡ºäº†å¾ˆå¤šå†…ç½®çš„data.table函数,比如as.data.tableå’Œdata.table。所以这些代ç 在tidyfstä¸å°±å¯ä»¥è‡ªå¦‚地使用。 ```{r} library(tidyfst) diamonds <- ggplot2::diamonds n = 1e5 #如果想åšå·¥ä¸šçº§æµ‹è¯•,å¯ä»¥ç»§ç»å¢žåŠ æ•°é‡ set.seed(2020) dtranges <- seq.Date(from = as.Date("2011-01-01"), to = as.Date("2020-01-01"), by = 1) n1 <- sample(nrow(diamonds), n, replace = TRUE) dat1 <- as.data.table(diamonds[n1, ]) dat1[, "dt"] <- sample(dtranges, n, replace = TRUE) # å¢žåŠ dt列 n2 <- sample(nrow(dat1), nrow(dat1)/1000) dat1[n2, "price"] <- NA # priceåˆ—æž„é€ åƒåˆ†ä¹‹ä¸€ç¼ºå¤±å€¼ dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))), price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE)) dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))), price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE)) print(dat1) ``` ## 基础 ### 尿Ѐ巧 åŽé¢çš„分æžï¼Œç»å¸¸è¦æ ¹æ®æ—¥æœŸè¿›è¡Œè®¡ç®—。所以,先对日期进行排åºï¼Œå°±èƒ½å¤Ÿæé«˜è¿è¡Œé€Ÿåº¦ã€‚在tidyfstä¸ï¼Œå¯ä»¥ä½¿ç”¨`arrange_dt`函数æ¥å¯¹æ•°æ®è¿›è¡ŒåŽŸä½çš„å„ç§æ“作,其ä¸å°±åŒ…括排åºã€‚ ```{r} dat1 = arrange_dt(dat1,dt) dat1 ``` 那么,现在dat1的数æ®å°±æŒ‰ç…§æ—¥æœŸæŽ’好åºäº†ã€‚ ### èšåˆ #### 1.求æ¯ç§åˆ‡å‰²ç±»åž‹ã€æ¯ç§é¢œè‰²é’»çŸ³çš„å¹³å‡ä»·æ ¼ã€ä¸ä½æ•°ä»·æ ¼ä¸Žæœ€é«˜ä»·æ ¼ 在tidyfstä¸ï¼Œæˆ‘设置了一个`sys_time_print`函数,å¯ä»¥æ–¹ä¾¿åœ°è¾“出`system.time()`函数返回的结果。 ```{r} sys_time_print({ r1_1 <- dat1 %>% summarise_dt( by = .(cut,color), mean_price = mean(price, na.rm = TRUE), median_price = median(price, na.rm = TRUE), max_price = max(price, na.rm = TRUE) ) }) r1_1 ``` tidyfst是永远ä¸å¯èƒ½æ¯”data.tableå¿«çš„ï¼Œä½†æ˜¯å¦‚æžœä½ è§‰å¾—ä¸Šé¢çš„ä»£ç æ›´å®¹æ˜“掌æ¡ã€æ›´å®¹æ˜“读懂,而在日常工作ä¸å¤šèŠ±é›¶ç‚¹å‡ ç§’çš„è¿è¡Œæ—¶é—´æ²¡æœ‰å¤ªå¤§é—®é¢˜ï¼ˆå®žé™…上节çœäº†å¤§å®¶çš„äº¤æµæ—¶é—´ï¼Œç”šè‡³å°±æ˜¯èŠ‚çœå°†æ¥è‡ªå·±å†æ¬¡è¯»æ‡‚自己代ç 的时间),tidyfst就值得拥有。 #### 2.求æ¯å¤©æœ€é«˜å‡ºå”®ä»·æ ¼å¯¹åº”çš„é‚£ç¬”è®¢å• ```{r} sys_time_print({ r1_2 <- dat1 %>% arrange_dt(dt,-price) %>% drop_na_dt(price) %>% group_dt( by = dt, head(1) ) }) r1_2 ``` ### join #### 1.dat1与dat2以dt列左连接 实质上,merge函数已ç»ä¼˜åŒ–得很好。tidyfst设计`*_join`ç³»åˆ—å‡½æ•°çš„æ—¶å€™ï¼Œåªæ˜¯ä¸ºäº†ä¸€ç§ä¸ä¸€æ ·çš„è¯æ³•结构æ¥å¸®åŠ©å®žçŽ°ä¸åŒçš„è¿žæŽ¥ï¼Œå› ä¸ºå®ƒç¡®å®žæ›´åŠ ç›´è§‚ä¸€äº›ã€‚ä½†æ˜¯å®žè´¨ä¸Šå®ƒè¿˜æ˜¯merge.data.table函数的包装版本。 ```{r} sys_time_print({ r2_1 <- dat1 %>% left_join_dt(dat2,by = "dt") }) r2_1 ``` #### 2.多é‡join ```{r} sys_time_print({ mymerge <- function(x, y) left_join_dt(x, y, by = "dt") r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3)) }) r2_2 ``` ### é•¿å®½è¡¨è½¬æ¢ #### 1.长表转宽表 ```{r} sys_time_print({ mean1 <- function(x) mean(x, na.rm = TRUE) max1 <- function(x) max(x, na.rm = TRUE) r3_1 <-dat1 %>% wider_dt(cut, value = c("depth", "price"), name = "color", fun = list(mean1,max1)) }) r3_1 ``` #### 2.宽表转长表 ```{r} sys_time_print({ r3_2 <-dat1 %>% select_dt(cut,color,x,y,z) %>% longer_dt(cut,color, name = "xyz", value = "xyzvalue") }) r3_2 ``` ## 高阶 ### å‘上/下填充空值 对于填充空值æ¥è¯´ï¼Œå¯ä»¥è¿™æ ·æ“作: ```{r} sys_time_print({ dat1 %>% fill_na_dt(price) -> dat1 }) dat1 ``` ### æ·»åŠ å维度èšåˆç»“果为新列 #### 1.以dat1ä¸ºä¾‹ï¼Œæ·»åŠ ä¸¤åˆ—ï¼Œä¸€åˆ—ä¸ºä»¥cutã€colorèšåˆæ±‚priceçš„å‡å€¼ï¼Œå¦ä¸€åˆ—æ˜¯æ±‚æ ‡å‡†å·® ```{r} sys_time_print({ mutate_dt(dat1, mean_price = mean(price, na.rm = TRUE), sd_price = sd(price, na.rm = TRUE), by = .(cut, color)) }) dat1 ``` #### 2.以dat1为例,以dtåˆ†ç»„æ·»åŠ ä¸€åˆ—åºå·id ```{r} sys_time_print({ dat1 %>% group_dt( by = dt, mutate_dt(id = seq(.N)) ) -> dat1 }) dat1 ``` ### 移动函数 ```{r} sys_time_print({ dat1 %>% group_dt( by = color, mutate_dt( MA10_price = frollmean(price, 10), MSD10_price = frollapply(price, 10, FUN = sd) ) ) -> dat1 }) dat1 ``` ## ç³»ç»Ÿå‚æ•° ```{r} sessionInfo() ```