Regex Basics

Image - Regex Basics
Image – Regex Basics

Regex Basics starts with symbols and basic syntax to find and manipulate text. Use character sets and repetition expressions to create flexible matching patterns. The Regex engine parses text to find matches and utilizes greedy and lazy strategies. Capture text and lazy strategies as it finds matches. Refer back to it later by using backreferences. Discover look around assertions to create complex matching patterns. Complex Regex can search for anything such as email addresses, phone numbers, URLs, prices, zip code, etc. Build step by step and explore some of the common and useful Regular Expressions. Write Regular Expressions for pattern matching, text manipulation, and parsing data. In JavaScript, it can be an object. Regex comes in different flavors based on the programming language or the application. The PCRE-version stands for Perl Compatible Regular Expressions. It is credited for popularizing the usage of RegEx. PCRE is also used in PHP.

Read more about RegEx by going through the following URLs-

  • <RegEx Syntax> https://support.google.com/a/answer/1371415?hl=en
  • <RegEx Match> https://support.google.com/docs/answer/3098292?hl=en

Regex Basics – Tools

RegExRX (Kem Tekinay) for Mac and RegExBuddy for Windows are paid tools. One of the free options for writing regular expression is RegEx101.com. In RegEx101, select PCRE (PHP) flavor. For doing multiple lines hit the return tab. But ensure to add the free space mode (?X) the inline modifier in the search when you need to use several lines of text.

fig1-Regex Basics~(?x) the inline modifier in the search
fig1-Regex Basics~(?x) the inline modifier in the search

Literal Matches are things that match literaly.

fig2-Regex Basics~Literal match
fig2-Regex Basics~Literal match

For example, type the text 55555 in the source text. Then type 55 in the search pattern, As seen it matches the first 55. It does not match the leftover because the first 55 has already matched or consumed. Next, add another 55. Now we’ve got two different matches. One is searching from left to right. The other found was 55’s and it was a match. It found another 55’s it was a match. There’s a single 5 which won’t work. 5555 matches the characters 5555 literally (case sensitive)

Writing Regular Expressions in Google Analytics

In Google Analytics, RegEx is used to find specific patterns in a list applying to –

  • goals,
  • filter/ view,
  • audiences,
  • channel groupings,
  • segments,
  • content groups for finding URLs that match particular descriptions eg (1) all pages – within a subdirectory, or (2) with a query string more than ten characters long.

For example in the /category/digital-marketing/search-engine-optimization/ there are four category pages

  • /local-seo/
  • /on-page-and-off-page-seo/
  • /what-is-link-building-in-seo/
  • /seo-fundamentals/

These category pages are important. They correlate unique views against traffic sources & destination pages. For more information click on 7steps-to-getting-started-with-google-analytics.

fig3-Regex Basics~Google Analytics~Goal>Destination
fig3-Regex Basics~Google Analytics~Goal>Destination

Regex Application for filters in Google Analytics:

In Google Analytics the common filter used is for excluding traffic from your own IP address(es). For a series of IPs, you can set up exclusions with Google Analytics regex as shown below:

  • eg 73\.234\.191\.[1-9] would exclude all IP addresses from 73.234.191.1 to 73.234.191.9

Check info by going to the [ URL – https://support.google.com/analytics/answer/1034324?hl=en ] for more information.

fig4-RegEx Characters
fig4-RegEx Characters

Character Classes

fig5-Regex Basics~Character Class
fig5-Regex Basics~Character Class

In search pattern type [A-Z0-9a-z]. It will match the source text CBAPQRSabcdefghMNOPPONMGHIJijklmnopqrstuvwxyz287640.

  • For example in the above [a-z] will match the upper & lower case. But if we do the negation [^a-z] only numbers will match.
  • The character ^[^a-z] will have a different meaning. It is referred to as an anchor. To match a dash use [-a-z] or [a-z-]
fig6-Regex Basics~Character Class@
fig6-Regex Basics~Character Class@
  • [@] will match the @ in the Source Text #@&^!(@)%&(&@#%
  • [^@^] will be a negation of [@^]

ASCII is the standard format for text files in computers and the internet.

If we select to space and end at Tilda by typing in search pattern [\ -~], it will match all types of characters in source text #@&^!(@)%&(&@#%abcdefABCDEF12340

fig7-Regex Basics~Alternation
fig7-Regex Basics~Alternation

Alternations

With apple|mango in the search pattern. And apple mango banana in the source text. The | (pipe) is referred to as the alternation. The apple mango in the source text gets matched.

The alteration tries to match what is on the left of that pipe. Then if it fails it tries to match the next alternative which is mango. Going through it says apple which is matched then it is yes. Next does apple match and it is no. And then does mango match and it is yes. Does apple or mango match banana and it is no. In an alternation, you can keep changing as much as you want.

Metacharacters

Characters in RegEx can be either

  • Regular character with a literal meaning
  • Metacharacter with a special meaning.

Literal Characters

A single literal character, such as ‘a’ matches the first occurrence of that character in the string mango is a fruit. Any letter (A to Z, a to z) or number (0 to 9) or keyboard characters ~ Tilda. ! Exclamation can be used as a single-character pattern. The below special characters cannot be used. They have special meaning in RegEx.

Special Characters

Metacharacters are the characters with special meaning.

They are

  • the opening curly brace {,
  • pipe symbol |,
  • the plus sign +,
  • the asterisk *,
  • the caret ^,
  • the dollar sign $,
  • the opening parenthesis (,
  • the closing parenthesis ),
  • the question mark ?,
  • the dot asterisk .*,
  • the opening square bracket [,
  • dashes -,
  • the backslash \,
  • the period or dot .

A multiple-character RegEx can be created with

  • A mixture of letters, digits, and other keyboard characters
  • All letters, all digits, all special keyboard characters.

The compiler processes the character before the RegEx library sees the string. You need to know the characters that get special treatment. It can be inside strings depending on the programming language used.

fig8-Regex~Quantifier
fig8-Regex~Quantifier

Quantifier & Iteration

In a Greedy Quantifier, the Regex engine matches as many possible occurrences of particular patterns. Whereas a Lazy Quantifier will stop no sooner it encounters the first pattern as per request.

+Quantifier

W+ matches both the W’s in the Source Text ‘WWorld19’ -> it says if you find 1 W match it, 2 W match it …. 100 W match it and so on. This is what the + says

?Quantifier

? says 0 or 1 match. The two WW in the Source Text ‘WWorld19’ is treated as a single match. One W can be matched but two W cannot be matched. Zero W can be matched but two single W which is a single hit ie two singles match and one is allowed in a single header.

*Quantifier

*says 0 or more (an unlimited amount). So 0 and WW or WWWW in the Source Text ‘WWWWorld19’ are matched

fig9-+Quantifier with Ranges
fig9-+Quantifier with Ranges

Quantifier with Ranges

  • [a-z]+ matches ”orld‘ and iN’ in the Source Text ‘WWorld19iN’
  • [a-zA-z]+ matches all lower and upper case ‘WWorld’ and ‘iN’ in the Source Text ‘WWorld19iN’
fig10-Iteration
fig10-Iteration

Iteration is just like Quantifiers, but it matches a particular amount of times. An iteration uses the curly brace.

  • 5{4} matches the Source Text 5555.
  • 5{2,4] range 2 to 4 times matches Source Text 5555.
  • 5{2,4}? defaulting to non-greedy matches with sets of two before the 5 in the Source Text 555555. We get three different matches 55, 55 and 55.
  • 5{2,4} defaulting to greedy matches 5 four times and for the next match it will grab two times for the Source Text 555555.

Iterations are also true for Character Class-

  • \w matches Source Text ‘W,o,r,l,d’.
  • \w{5} matches Source Text ‘World’.
  • \w{2} will grab the first two words ‘Wo’ and then the next two words ‘rl‘.
  • \w{6} will not work as the Source Text is a 5 letter word.
  • \w{3} will grab the first three words ‘Wor’.
  • [a-zA-Z0-9_]{2} will grab the 2 characters ‘Wo’ and ‘rl‘ from the Source Text ‘World’.
  • [a-zA-Z0-9_]{1,2} will grab the 3 characters ‘Wo’, ‘rland ‘d‘ from the Source Text ‘World’.
fig11-Capture 10 digit number
fig11-Capture 10 digit number

Capture Groups & Non Capture Groups

In a Capture group the matched character sequences are captured. Parentheses group the regex so that different quantifiers can be applied to that group. The part of string matched by the regex inside parentheses creates a numbered capturing which is stored for possible re-use with a numbered backreference.

  • Group 0 – 950-784-7659,
  • Group 1 – 784,
  • Group 2 – 7659
  • \d matches 9,5,0,7,8,4,7,6,5,9
  • \d{3} matches 950, 784 and 765
  • \d{3}[-.)]\d matches 950-7
  • \d{3}[-.)]\d{3} matches 950-784
  • \d{3}[-.)]\d{3}[-.] matches 950-784-
  • \d{3}[-.)]\d{3}[-.]\d matches 950-784-7
  • \d{3}[-.)]\d{3}[-.]\d{4} matches 950-784-7659
fig12-Non Capture Group
fig12-Non Capture Group

Non captured group does not store anything. Use (?:) to create a non-capture group. It can be used when we do not need the group to capture its match.

fig13-Lookaround~Lookahead and Lookbehind
fig13-Lookaround~Lookahead and Lookbehind

Look Around

There are two types of look around – look ahead and look behind. These are zero length assertions which mean they match characters. But they give up that match immediately and they only return the result of a ‘match’ or ‘no match’.

fig14-Lookaround~Negative Lookahead
fig14-Lookaround~Negative Lookahead
  • Positive Lookahead: sweet(?=\ apple) in the search pattern will match ‘sweet’ but not ”apple for ‘sweet apple’.
  • Negative Lookahead: sweet(?!\ apple) in the search pattern will match ‘sweet’ only but not ‘mango’, ‘watermelon’, and ‘peach’ for ‘sweet mango’, ‘sweet watermelon’ and ‘sweet peach’. It will not match ‘sweet apple’.
  • Positive Lookbehind: (?<=sweet\ )apple in the search pattern will match ‘apple’ but not ‘sweet’ for ‘sweet apple’.
  • Negative Lookbehind: (?!sweet\ )apple in the search pattern will match apple but not red, green, sweet, and custard before apple.
fig15-Word Boundary
fig15-Word Boundary

Word Boundary

A word boundary is any character that is not a word character. It can be a dash as in ‘spider-web’, space as in ‘spider web’, tab, etc. Numbers in Regex are considered as word characters. A word boundary is a zero-length assertion.

Type the word ‘web’ in the search pattern. It will be found in all its form in the source text.

  • web\b matches the word ‘web’ which end in ‘web’ (word boundary) that is ‘spiderweb’ and a whole word ‘web’
  • \bweb matches the word ‘web’ which are directly before a word boundary and a whole word ‘web’ that is ‘webspider‘ and ‘web’, ‘spider-web’, ‘cob-web’.
  • \bweb\w+ matches ‘webspider‘ as part of a word and stand-alone word on its own in the source text
  • \w+web\b matches ‘spiderweb’ where ‘web’ is preceded by a word character.

Anchor

Two strings are ‘site’ and ‘sitemap’. To match ‘site’ only if it is on its own and do not want a match if it is a part of a text. Do this by using Anchor^

  • site – matches both ‘site’ and ‘sitemap’ in the source text
  • ^site$ – match ‘site’ only if it is on its own and does not match the word ‘sitemap’
fig16-'s' modifier
fig16-‘s’ modifier

Modifiers

The common regex modifiers are (?misx)

  • (?m)\w+$ -> modifier ‘m‘ matches last word of the strings that are fun, thrilling
  • (?m)^\w+ -> modifier ‘m’ matches the first word of the strings that is Hunting, SEEING
  • (?i)[a-z]+ -> modifier ‘i’ is the case insensitive modifier that matches all upper and lower case text in the string.
  • (?s).+ -> modifier ‘s’ matches both the strings
  • (?x)\w+ – > modifier ‘x’ matches all the words in the two strings

Build logical patterns using Regex. These patterns identify strings of text that fit the pattern. Programming languages support Regex and are used mainly to identify files on a computer that end with an extension. It validates an email address entered in an online form, and perform redirects for URL recognized with a Regex pattern.