[R语言] magrittr包管道操作《R for data science》 11

半为花间酒 · 简书 · · 2020-04-28 09:56

《R for Data Science》第十八章 Pipes 啃书知识点积累
参考链接： R for Data Science

library(magrittr)

Piping alternatives

- Intermediate steps

R will share columns across data frames , where possible.

diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 3.46 MB
pryr::object_size(diamonds2)
#> 3.89 MB
pryr::object_size(diamonds, diamonds2)
#> 3.89 MB

#  如果修改了其中一列，该列在数据框就不再共享
diamonds$carat[1] <- NA
pryr::object_size(diamonds)
#> 3.46 MB
pryr::object_size(diamonds2)
#> 3.89 MB
pryr::object_size(diamonds, diamonds2)
#> 4.32 MB

pryr::object_size() 可以获取给定对象占用的内存，可以给多个对象
object.size() 只能给定一个对象

- Function composition

bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ), 
  on = head
)

The dagwood sandwhich problem :
The disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart.

- Use the pipe

foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mice) %>%
  bop(on = head)

# 本质上如下
my_pipe <- function(.) {
  . <- hop(., through = forest)
  . <- scoop(., up = field_mice)
  bop(., on = head)
}
my_pipe(foo_foo)

两种不适用管道的情况

(1) 使用当前环境的函数：如 assign load get

assign("x", 10); x
# [1] 10

"x" %>% assign(100); x
# [1] 10

env <- environment()
"x" %>% assign(100, envir = env); x
# [1] 100

(2) 延迟使用、惰性计算的函数: 如多数捕获异常的函数
tryCatch try suppressMessages suppressWarnings

tryCatch(stop("!"), error = function(e) "An error")
#> [1] "An error"

stop("!") %>% 
  tryCatch(error = function(e) "An error")
#> Error in eval(lhs, parent, parent): !

When not to use the pipe

知道什么时候不用管道也是很重要的事情

Pipes are most useful for rewriting a fairly short linear sequence of operations.

Your pipes are longer than (say) ten steps. In that case, create intermediate objects with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe.
You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will typically yield confusing code.

Other tools from magrittr

When working with more complex pipes, it’s sometimes useful to call a function for its side-effects. Maybe you want to print out the current object, or plot it, or save it to disk. Many times, such functions don’t return anything, effectively terminating the pipe.

%T>%
%T>% works like %>% except that it returns the left-hand side instead of the right-hand side. It’s called “tee” because it’s like a literal T-shaped pipe.

library(magrittr)

rnorm(100) %>%
  matrix(ncol = 2) %>%
  plot() %>%
  str()
#  NULL

rnorm(100) %>%
  matrix(ncol = 2) %T>%
  plot() %>% 
  str()
# num [1:50, 1:2] -0.351 -1.751 0.666 0.516 -0.686 ...

%$%
It “explodes” out the variables in a data frame so that you can refer to them explicitly.
(便于显式调用变量)

mtcars %$%
  cor(disp, mpg)
#> [1] -0.8475514

# 可以用with显式变量
with(mtcars, cor(disp, mpg))

%<>%
直接替换不需要重赋值

mtcars <- mtcars %>% 
  transform(cyl = cyl * 2)

mtcars %<>% transform(cyl = cyl * 2)

原文地址：访问原文地址
快照地址：访问文章快照

分享到微博