看啥推荐读物

专栏名称: DumplingLucky

学习笔记<br>研究方向: 植物群体基因组学...

今天看啥

微信公众号rss订阅, 微信rss, 稳定的RSS源

微信公众号RSS订阅方法

B站投稿RSS订阅方法

知乎回答RSS订阅方法

知乎专栏 RSS订阅方法

雪球动态RSS订阅方法

微博RSS订阅方法

微博搜索关键词订阅方法

豆瓣日记 RSS订阅方法

学习tidyverse - Strings(4)

DumplingLucky · 简书 · · 2021-05-25 21:32

已经了解了正则表达式的基础知识，就该学习如何将其应用于实际问题了。包括如下：

确定哪些字符串与模式匹配。
找到匹配的位置。
提取匹配的内容。
用新值替换匹配项。
根据匹配项拆分字符串。

上节解决了前三个问题，本节来解决后两个问题。

1. Replacing matches

str_replace（） 和 str_replace_all（） 允许将匹配项替换为新字符串。最简单的用法是用固定的字符串替换模式：

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

使用 str_replace_all（） ，可以通过提供命名的向量来执行多次替换：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

除了使用固定的字符串替换之外，还可以使用向后引用来插入匹配项的组成部分。在下面的代码中，我翻转了第二个和第三个单词的顺序。

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

2. Splitting

使用 str_split（） 将字符串拆分为多个部分。例如，我们可以将句子分解为单词：

sentences %>%
  head(5) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

由于每个组件可能包含不同数量的零件，因此将返回一个列表。如果使用的是长度为1的向量，那么最简单的方法就是提取列表的第一个元素：

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
#> [1] "a" "b" "c" "d"

否则，就像其他返回列表的字符串函数一样，可以使用simple = TRUE返回矩阵：

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""

还可以求最大件数：

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"

除了按模式分割字符串，还可以按字符，行，句子和单词boundary（）进行分割：

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))

This is a sentence. This is another sentence.


str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." ""          "This"     
#> [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

3. Other types of pattern

当使用字符串形式的模式时，该模式会自动包装到对regex（）的调用中：

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

可以使用regex（）的其他参数来控制匹配的详细信息：

ignore_case = TRUE允许字符匹配其大写或小写形式。这始终使用当前语言环境。

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")

banana
Banana
BANANA

str_view(bananas, regex("banana", ignore_case = TRUE))

multiline = TRUE允许^和$匹配每行的开头和结尾，而不是完整字符串的开头和结尾。

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comments = TRUE允许使用注释和空格使复杂的正则表达式更易于理解。空格和＃之后的所有空格都将被忽略。要匹配文字空间，需要对其进行转义：“ \”。

phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
#>      [,1]          [,2]  [,3]  [,4] 
#> [1,] "514-791-814" "514" "791" "814"

dotall = TRUE允许。匹配所有内容，包括\ n。

您可以使用其他三个函数来代替regex（）：

fixed（）：与指定的字节序列完全匹配。
它忽略所有特殊的正则表达式，并在非常低的级别上运行。这样可以避免复杂的转义，并且比正则表达式要快得多。以下微基准测试表明，简单示例的速度提高了约3倍。

microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
)
#> Unit: microseconds
#>   expr     min       lq     mean   median       uq     max neval
#>  fixed 100.392 101.3465 118.7986 105.9055 108.8545 367.118    20
#>  regex 346.595 349.1145 353.7308 350.2785 351.4135 403.057    20

当心对非英语数据使用fixed（）。这是有问题的，因为通常有多种方式表示同一角色。例如，有两种定义“á”的方法：作为单个字符或作为“ a”加重音符号：

a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
#> [1] "á" "á"
a1 == a2
#> [1] FALSE

它们的渲染方式相同，但是由于定义不同，fixed（）找不到匹配项。相反，您可以使用下面定义的coll（）来遵守人类角色比较规则：

str_detect(a1, fixed(a2))
#> [1] FALSE
str_detect(a1, coll(a2))
#> [1] TRUE

coll（）：使用标准归类规则比较字符串。这对于执行不区分大小写的匹配很有用。请注意，coll（）采用一个语言环境参数，该参数控制使用哪些规则比较字符。不幸的是，世界不同地区使用不同的规则！

# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i
#> [1] "I" "İ" "i" "ı"

str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
#> [1] "İ" "i"

fixed（）和regex（）都具有ignore_case参数，但它们不允许您选择语言环境：它们始终使用默认语言环境。

stringi::stri_locale_info()
#> $Language
#> [1] "en"
#> 
#> $Country
#> [1] "US"
#> 
#> $Variant
#> [1] ""
#> 
#> $Name
#> [1] "en_US"

coll（）的缺点是速度；因为识别哪些字符相同的规则很复杂，所以与regex（）和fixed（）相比，coll（）相对较慢。

如您在str_split（）中所看到的，您可以使用boundary（）来匹配边界。您还可以将其与其他功能一起使用：

x <- "This is a sentence."
str_view_all(x, boundary("word"))

This is a sentence.

str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This"     "is"       "a"        "sentence"

4. Other uses of regular expressions

基本R中有两个有用的函数，它们也使用正则表达式：

apropos（）搜索全局环境中可用的所有对象。如果您不太记得函数的名称，此功能将非常有用。

apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"

dir（）列出目录中的所有文件。 pattern参数采用正则表达式，并且仅返回与模式匹配的文件名。例如，您可以使用以下命令在当前目录中找到所有R Markdown文件：

head(dir(pattern = "\\.Rmd$"))
#> [1] "communicate-plots.Rmd" "communicate.Rmd"       "datetimes.Rmd"        
#> [4] "EDA.Rmd"               "explore.Rmd"           "factors.Rmd"

参考： https://r4ds.had.co.nz/strings.html

原文地址：访问原文地址
快照地址：访问文章快照

分享到微博