Frontier Software

grep

grep me no patterns and I’ll tell you no lines. โ€” fortune cooky

manual

wikibooks

As explained in The Unix Programing Environment by Brian Kernighan and Rob Pike, sed can do just about everything grep can.

sed -n '/pattern/p' files

is the same as

grep -h pattern files

Why do we have both sed and grep? After all, grep is just a simple special case of sed. Part of the reason is history โ€” grep came well before sed. But grep survives, indeed thrives, because for the job that they both do, it is significantly easier to use than sed is.

Assuming I don’t want to use grep -h, (ie –no-filename), grep has the advantage of listing the files that contain a pattern. The -l, --files-with-matches flag lists these files.

grep seems to be a better choice than sed for predicates.

set difference: file1.txt - file2.txt

grep -Fxv -f file2.txt file1.txt

Outputs set intersection: those lines of file1.txt that are also in file2.txt.

grep -Fx -f file1.txt file2.txt
N M o a t M c a h t e c s h e s R i g h M t i H s i P s t a e s t s t e r n W F r a o l O n s m g e i s P A s a l i t a o t r n e m s r s n

Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the false alarms by limiting the possible matches and you try to capture the omissions by expanding the possible matches. โ€” sed & awk

Capturing substrings

https://www.baeldung.com/linux/grep-output-capturing-group-content

pcregrep

POSIX bracket expressions

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French รจ is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.

Class Matching Characters
[:alnum:] Printable characters (includes whitespace)
[:alpha:] Alphabetic characters
[:blank:] Space and tab characters
[:cntrl:] Control characters
[:digit:] Numeric characters
[:graph:] Printable and visible (non-space) characters
[:lower:] Lowercase characters
[:print:] Printable characters (includes whitespace)
[:punct:] Punctuation characters
[:space:] Whitespace characters
[:upper:] Uppercase characters
[:xdigit:] Hexadecimal digits
#!/bin/bash

# dictionary_matches $dictionary $text
function json_dict_words {
  arr=()
  while IFS= read -r line; do
    arr+=("$line")
  done <<< "$(grep -Fwio -f "${HOME}/share/dict/${1}" <<< "$2" | sort -u)"
  if [[ -n ${arr[*]} ]]; then
    jq -cn '$ARGS.positional' --args "${arr[@]}"
  fi
}

ExampleGroup 'Create a JSON array of phrases matching those in a dictionary'

  Example 'find a word in the text'
    When call json_dict_words 'actors' 'Appel @ Padstal'
    The output should equal '["Appel"]'
    The status should be success
    The error should be blank
  End

  Example 'find a phrase in the text'
    When call json_dict_words 'actors' 'n Aand saam Jakkie Louw (ten bate van SAACA) @ Die Centurion Teater'
    The output should equal '["Jakkie Louw"]'
    The status should be success
    The error should be blank
  End

  Example 'find two phrases in the text'
    When call json_dict_words 'actors' 'Emile Alexander & Dakes LIVE in Johannesburg at Gatzbys LIVE, Midrand - 10 February 2024'
    The output should equal '["Dakes","Emile Alexander"]'
    The status should be success
    The error should be blank
  End

  Example 'no matches'
    When call json_dict_words 'actors' 'Nothing in dictionary'
    The output should equal ''
    The status should be success
    The error should be blank
  End

End

As is common, the above uses IFS= read -r and courtesy of the SC2162 I discovered IFS= clears the internal field separator prevents stripping leand and trailing whitepace, which I actuall usually want.

Reversing this process by converting comma seperated strings back to “line-oriented” strings is covered in Indexed Arrays.

Back-references

grep -E '^(ab*)*\1$' <<< 'ababbabb'

produces ababbabb

grep -Eo '(ab)\1' <<< 'ababbab'

produces abab