grep

February 2, 2024

grep me no patterns and I’ll tell you no lines. — fortune cooky

As explained in The Unix Programing Environment by Brian Kernighan and Rob Pike, sed can do just about everything grep can.

sed -n '/pattern/p' files

is the same as

grep -h pattern files

Why do we have both sed and grep? After all, grep is just a simple special case of sed. Part of the reason is history — grep came well before sed. But grep survives, indeed thrives, because for the job that they both do, it is significantly easier to use than sed is.

Assuming I don’t want to use grep -h, (ie –no-filename), grep has the advantage of listing the files that contain a pattern. The -l, --files-with-matches flag lists these files.

grep seems to be a better choice than sed for predicates.

set difference: file1.txt - file2.txt

grep -Fxv -f file2.txt file1.txt

Outputs set intersection: those lines of file1.txt that are also in file2.txt.

grep -Fx -f file1.txt file2.txt

Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the false alarms by limiting the possible matches and you try to capture the omissions by expanding the possible matches. — sed & awk

Capturing substrings

https://www.baeldung.com/linux/grep-output-capturing-group-content

pcregrep

POSIX bracket expressions

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.

Class	Matching Characters
[:alnum:]	Printable characters (includes whitespace)
[:alpha:]	Alphabetic characters
[:blank:]	Space and tab characters
[:cntrl:]	Control characters
[:digit:]	Numeric characters
[:graph:]	Printable and visible (non-space) characters
[:lower:]	Lowercase characters
[:print:]	Printable characters (includes whitespace)
[:punct:]	Punctuation characters
[:space:]	Whitespace characters
[:upper:]	Uppercase characters
[:xdigit:]	Hexadecimal digits

#!/bin/bash

# dictionary_matches $dictionary $text
function json_dict_words {
  arr=()
  while IFS= read -r line; do
    arr+=("$line")
  done <<< "$(grep -Fwio -f "${HOME}/share/dict/${1}" <<< "$2" | sort -u)"
  if [[ -n ${arr[*]} ]]; then
    jq -cn '$ARGS.positional' --args "${arr[@]}"
  fi
}

ExampleGroup 'Create a JSON array of phrases matching those in a dictionary'

  Example 'find a word in the text'
    When call json_dict_words 'actors' 'Appel @ Padstal'
    The output should equal '["Appel"]'
    The status should be success
    The error should be blank
  End

  Example 'find a phrase in the text'
    When call json_dict_words 'actors' 'n Aand saam Jakkie Louw (ten bate van SAACA) @ Die Centurion Teater'
    The output should equal '["Jakkie Louw"]'
    The status should be success
    The error should be blank
  End

  Example 'find two phrases in the text'
    When call json_dict_words 'actors' 'Emile Alexander & Dakes LIVE in Johannesburg at Gatzbys LIVE, Midrand - 10 February 2024'
    The output should equal '["Dakes","Emile Alexander"]'
    The status should be success
    The error should be blank
  End

  Example 'no matches'
    When call json_dict_words 'actors' 'Nothing in dictionary'
    The output should equal ''
    The status should be success
    The error should be blank
  End

End

As is common, the above uses IFS= read -r and courtesy of the SC2162 I discovered IFS= clears the internal field separator prevents stripping leand and trailing whitepace, which I actuall usually want.

Reversing this process by converting comma seperated strings back to “line-oriented” strings is covered in Indexed Arrays.

awk(1)
cmp(1)
diff(1)
find(1)
perl(1)
sed(1)
sort(1)
xargs(1)
read(2)
pcre2(3)
pcre2syntax(3)
pcre2pattern(3)
terminfo(5)
glob(7)
regex(7)

Back-references

grep -E '^(ab*)*\1$' <<< 'ababbabb'

produces ababbabb

grep -Eo '(ab)\1' <<< 'ababbab'

produces abab