grep
grep me no patterns and I’ll tell you no lines. โ fortune cooky
As explained in The Unix Programing Environment by Brian Kernighan and Rob Pike, sed
can do just about everything grep
can.
sed -n '/pattern/p' files
is the same as
grep -h pattern files
Why do we have both sed and grep? After all, grep is just a simple special case of sed. Part of the reason is history โ grep came well before sed. But grep survives, indeed thrives, because for the job that they both do, it is significantly easier to use than sed is.
Assuming I don’t want to use grep -h
, (ie –no-filename), grep has the advantage of listing the files that contain a pattern. The -l, --files-with-matches
flag lists these files.
grep seems to be a better choice than sed for predicates.
set difference: file1.txt - file2.txt
grep -Fxv -f file2.txt file1.txt
Outputs set intersection: those lines of file1.txt that are also in file2.txt.
grep -Fx -f file1.txt file2.txt
Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the false alarms by limiting the possible matches and you try to capture the omissions by expanding the possible matches. โ sed & awk
Capturing substrings
https://www.baeldung.com/linux/grep-output-capturing-group-content
pcregrep
POSIX bracket expressions
In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French รจ is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.
Class | Matching Characters |
---|---|
[:alnum:] | Printable characters (includes whitespace) |
[:alpha:] | Alphabetic characters |
[:blank:] | Space and tab characters |
[:cntrl:] | Control characters |
[:digit:] | Numeric characters |
[:graph:] | Printable and visible (non-space) characters |
[:lower:] | Lowercase characters |
[:print:] | Printable characters (includes whitespace) |
[:punct:] | Punctuation characters |
[:space:] | Whitespace characters |
[:upper:] | Uppercase characters |
[:xdigit:] | Hexadecimal digits |
#!/bin/bash
# dictionary_matches $dictionary $text
function json_dict_words {
arr=()
while IFS= read -r line; do
arr+=("$line")
done <<< "$(grep -Fwio -f "${HOME}/share/dict/${1}" <<< "$2" | sort -u)"
if [[ -n ${arr[*]} ]]; then
jq -cn '$ARGS.positional' --args "${arr[@]}"
fi
}
ExampleGroup 'Create a JSON array of phrases matching those in a dictionary'
Example 'find a word in the text'
When call json_dict_words 'actors' 'Appel @ Padstal'
The output should equal '["Appel"]'
The status should be success
The error should be blank
End
Example 'find a phrase in the text'
When call json_dict_words 'actors' 'n Aand saam Jakkie Louw (ten bate van SAACA) @ Die Centurion Teater'
The output should equal '["Jakkie Louw"]'
The status should be success
The error should be blank
End
Example 'find two phrases in the text'
When call json_dict_words 'actors' 'Emile Alexander & Dakes LIVE in Johannesburg at Gatzbys LIVE, Midrand - 10 February 2024'
The output should equal '["Dakes","Emile Alexander"]'
The status should be success
The error should be blank
End
Example 'no matches'
When call json_dict_words 'actors' 'Nothing in dictionary'
The output should equal ''
The status should be success
The error should be blank
End
End
As is common, the above uses IFS= read -r
and courtesy of the SC2162 I discovered IFS=
clears the internal field separator prevents stripping leand and trailing whitepace, which I actuall usually want.
Reversing this process by converting comma seperated strings back to “line-oriented” strings is covered in Indexed Arrays.
related man pages
- awk(1)
- cmp(1)
- diff(1)
- find(1)
- perl(1)
- sed(1)
- sort(1)
- xargs(1)
- read(2)
- pcre2(3)
- pcre2syntax(3)
- pcre2pattern(3)
- terminfo(5)
- glob(7)
- regex(7)
Back-references
grep -E '^(ab*)*\1$' <<< 'ababbabb'
produces ababbabb
grep -Eo '(ab)\1' <<< 'ababbab'
produces abab