Advanced Regular Expressions

From ACSL Category Descriptions
Revision as of 10:22, 1 September 2020 by Mariana (talk | contribs) (→‎Problem 6)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Useful patterns

Pattern Description REGEX Sample match Sample

not match

\d Digit.

Matches any digit. Equivalent with [0-9].

\d\d\d 123 1-3
\D Non digit.

Matches any character that is not a digit.

\d\D\d 1-3 123
\w Word.

Matches any alphanumeric character and underscore. Equivalent with [a-zA-Z0-9_].

\w\w\w a_A a-A
\W Not Word.

Matches any character that is not word character (alphanumeric character and underscore).

\W\W\W +-$ +_@
\s Whitespace.

Matches any whitespace character (space, tab, line breaks).

\d\s\w 1 a 1ab
\S Not Whitespace.

Matches any character that is not a whitespace character (space, tab, line breaks).

\w\w\w\w\S\d Test#1 test 1
\b Word boundaries.

Can be used to match a complete word. Word boundaries are the boundaries between a word

and a non-word character.

\bis\b is; This

island:

{} The curly braces {…}.

It tells the computer to repeat the preceding character (or set of characters) for

as many times as the value inside this bracket.

{min,} means the preceding character is matches min times or more.

{min,max} means that the preceding character is repeated at least min and at most max times.

abc{2}

abcc

abc

.* Matches any character (except for line terminators), matches between zero and unlimited times. .*

abbb

Empty string

.+ Matches any character (except for line terminators), matches between one and unlimited times. .+ a

abbcc

Empty string
^ Anchor ^.The start of the line.

Matches position just before the first character of the string.

^The\s\w+ The contest One contest
$ Anchor $. The end of the line.

Matches position just after the last character of the string.

\d{4}\sACSL$ 2020 ACSL 2020 STAR
\ Escape a special character.

If you want to use any of the metacharacters as a literal in a regex, you need to escape them with a backslash, like: \. \* \+ \[ etc.

\w\w\w\. cat. lion
() Groups.

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characthers and capturing them using the parentheses ().

^(file.+)\.docx$ file_graphs.docx

file_lisp.docx

data.docx
\number Backreference.

A set of different symbols of a regular expression can be grouped together to act as a single unit and behave as a block.

\n means that the group enclosed within the n-th bracket will be repeated at current position.

\1 Contents of Group 1. r(\w)g\1x regex

Group \1 is e

regxx
\2 Contents of Group 2. (\d\d)\+(\d\d)=\2\+\1 20+21=21+20

Group \1 is 20

Group \2 is 21

20+21=20+21

Sample Problems

Problem 1

Which of the following strings match the regular expression pattern "^w{3}\.([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])+\.)+[a-z0-9][-a-z0-9]{0,61}[a-z0-9]" ?

1. www.google.com
2. www.-petsmart.com
3. www.edu-.ro
4. www.google.co.in
5. www.examples.c.net
6. www.edu.training.computer-science.org
7. www.everglades_holidaypark.com

Solution: This Regular Expression matches a domain name used to access web sites.

RE starts with the subdomain www, continues with a number of names of domains, separated by a dot (Top-level domain (TLD), Second-level domain (SLD), Third-level domain, and so on).

The name of a domain contains only small letters, digits and hyphen. The name can’t begin and can’t finish with a hyphen character. The length of the domain’s name is minimum 2 and maximum 63 characters.

^: the string starts with www, followed by a dot character;

[a-z0-9] : the first and the last character of the domain's name can be only a small letter or a digit;

[-a-z0-9]{0,61}: the next characters can be small letters, digits or a hyphen character. Maximum 61 characters;

The last sequence [a-z0-9][-a-z0-9]{0,61}[a-z0-9] is for the Top-Level domain, which is not followed by a dot.

The strings that are represented by this pattern are 1, 4 and 6.

Problem 2

Write a regular expression describing a set of strings formed to the following rules:

1. Contain only lowercase letters of the English alphabet and the character '.';

2. Start and end with the same letter;

3. Contain a sequence of at least one and at most 3 vowels, separated by zero or more characters '.' of a sequence consisting of at least one consonant.

Solution:

The Regular Expression is:

([a-z])[a,e,i,o,u]{1,3}\.*[b-df-hj-np-tv-z]+(\1)

([a-z]) represents the group number 1 that captures the firs letter;

\1 is the number of the group that appears at the end of the string;

[a,e,i,o,u]{1,3} describes sequence of one to three vowels;

\.* the character ‘.’ appears zero to more times;

[b-df-hj-np-tv-z]+ a sequence of consonants, at least one consonant.