今天看啥  ›  专栏  ›  DumplingLucky

学习tidyverse - Strings(3)

DumplingLucky  · 简书  ·  · 2021-05-22 08:17

已经了解了正则表达式的基础知识,就该学习如何将其应用于实际问题了。 包括如下:

  • 确定哪些字符串与模式匹配。
  • 找到匹配的位置。
  • 提取匹配的内容。
  • 用新值替换匹配项。
  • 根据匹配项拆分字符串。

本节解决前三个问题。

Tools

1. Detect matches

要确定字符向量是否与模式匹配,请使用 str_detect() 。 它返回与输入长度相同的逻辑向量:

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

请记住,当在数字上下文中使用逻辑向量时,FALSE变为0,TRUE变为1。如果想回答有关较大向量上的匹配问题,则 sum() mean() 很有用:

# How many common words start with t?
sum(str_detect(words, "^t"))
#> [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.2765306

如果遇到复杂的逻辑条件(例如,匹配a或b,但不匹配c,除非匹配d),将多个 str_detect() 调用与逻辑运算符组合起来通常比尝试创建单个正则表达式要容易得多。 例如,以下两种方法可以查找不包含任何元音的所有单词:

# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE

结果是相同的,但是我认为第一种方法明显更容易理解。 如果正则表达式过于复杂,请尝试将其分解为较小的部分,为每个部分命名,然后将这些部分与逻辑运算结合起来。

str_detect() 的常见用法是选择与模式匹配的元素。 可以使用逻辑子设置或便捷的 str_subset() 包装器来执行此操作:

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

不过,通常情况下,字符串将是数据框的一列,通过使用过滤器:

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_detect() 的一个变体是 str_count() :它告诉字符串中有多少个匹配项,而不是简单的yes或no:

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1

# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
#> [1] 1.991837

str_count() mutate() 结合使用:

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
#> # A tibble: 980 x 4
#>   word         i vowels consonants
#>   <chr>    <int>  <int>      <int>
#> 1 a            1      1          0
#> 2 able         2      2          2
#> 3 about        3      3          2
#> 4 absolute     4      4          4
#> 5 accept       5      2          4
#> 6 account      6      3          4
#> # … with 974 more rows

请注意,匹配永远不会重叠。 例如,在“ abababa”中,模式“ aba”将匹配多少次? 正则表达式说两个,而不是三个:

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

aba b aba

2. Extract matches

要提取匹配的实际文本,请使用 str_extract() 。 为了证明这一点,我们将需要一个更复杂的示例。这些以 stringer :: sentences 提供:

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

假设我们要查找所有包含颜色的句子。 我们首先创建颜色名称的向量,然后将其转换为单个正则表达式:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"

现在,我们可以选择包含一种颜色的句子,然后提取该颜色以找出它是哪一种:

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

请注意, str_extract() 仅提取第一个匹配项。 通过首先选择所有具有多个匹配项的句子,我们可以最轻松地看到这一点:

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)

It is hard to erase blue or red ink.
The green light in the brown box flickered.
The sky in the west is tinged with orange red .

str_extract(more, colour_match)
#> [1] "blue"   "green"  "orange"

这是字符串函数的常见模式,因为使用单个匹配项的数据结构。 要获取所有匹配项,请使用 str_extract_all() 。 它返回一个列表:

str_extract_all(more, colour_match)
#> [[1]]
#> [1] "blue" "red" 
#> 
#> [[2]]
#> [1] "green" "red"  
#> 
#> [[3]]
#> [1] "orange" "red"

如果使用 simpleise = TRUE ,则 str_extract_all() 将返回短匹配扩展为最长长度的矩阵:

str_extract_all(more, colour_match, simplify = TRUE)
#>      [,1]     [,2] 
#> [1,] "blue"   "red"
#> [2,] "green"  "red"
#> [3,] "orange" "red"

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#>      [,1] [,2] [,3]
#> [1,] "a"  ""   ""  
#> [2,] "a"  "b"  ""  
#> [3,] "a"  "b"  "c"
3. Grouped matches

假设我们要从句子中提取名词,寻找在“ a”或“ the”之后的任何单词。 在正则表达式中定义“单词”有点棘手,因此在这里我使用一个简单的近似值:一个至少一个字符的序列,该字符不是空格。

noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
#>  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
#>  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract() 给我们完整的匹配; str_match() 给出每个单独的组件。 它返回一个矩阵,而不是字符向量,该矩阵的第一列是完全匹配项,第二组是每一列:

has_noun %>% 
  str_match(noun)
#>       [,1]         [,2]  [,3]     
#>  [1,] "the smooth" "the" "smooth" 
#>  [2,] "the sheet"  "the" "sheet"  
#>  [3,] "the depth"  "the" "depth"  
#>  [4,] "a chicken"  "a"   "chicken"
#>  [5,] "the parked" "the" "parked" 
#>  [6,] "the sun"    "the" "sun"    
#>  [7,] "the huge"   "the" "huge"   
#>  [8,] "the ball"   "the" "ball"   
#>  [9,] "the woman"  "the" "woman"  
#> [10,] "a helps"    "a"   "helps"

如果数据比较麻烦,那么使用 tidyr :: extract() 通常会更容易。 它的工作原理类似于 str_match() ,但需要命名匹配项,然后将其放置在新列中:

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
#> # A tibble: 720 x 3
#>   sentence                                    article noun   
#>   <chr>                                       <chr>   <chr>  
#> 1 The birch canoe slid on the smooth planks.  the     smooth 
#> 2 Glue the sheet to the dark blue background. the     sheet  
#> 3 It's easy to tell the depth of a well.      the     depth  
#> 4 These days a chicken leg is a rare dish.    a       chicken
#> 5 Rice is often served in round bowls.        <NA>    <NA>   
#> 6 The juice of lemons makes fine punch.       <NA>    <NA>   
#> # … with 714 more rows

参考: https://r4ds.had.co.nz/strings.html




原文地址:访问原文地址
快照地址: 访问文章快照