Regular expressions

Below you will find an overview of the most important elements of regular expressions (so-called regexes) illustrated by the given patterns and an explanation of how to approach their creation in general. This will help you understand why certain fragments look the way they do and how they operate "under the hood."

Basics of regular expression syntax

Special characters (metacharacters) – e.g.:

  • ^ – start of the string or line (depending on mode),

  • $ – end of the string or line,

  • . – any character (usually except newline),

  • \b – word boundary,

  • \d – matches digits ([0-9]),

  • \s – matches whitespace characters (space, tab, newline, etc.),

  • \w – matches "word-forming" characters (letters, digits, underscore),

  • [...] – character class, e.g. [0-9] means any digit in the range 0–9,

  • (...) – grouping or so-called capturing group,

  • (?:...) – non-capturing group,

  • | – alternation (logical “or”),

  • *, +, ?, {n}, {n,m} – quantifiers (describe how many times a fragment should repeat).

Quantifiers:

  • * – 0 or more repetitions,

  • + – 1 or more repetitions,

  • ? – 0 or 1 repetition,

  • {n} – exactly n repetitions,

  • {n,} – at least n repetitions,

  • {n,m} – from n to m repetitions.

Word boundaries (\b)

This is a special regex “character” that indicates a transition between word characters (letters, digits, underscore) and non-word characters (space, punctuation, etc.) or the start/end of the string.

  • \b\d{2}-\d{3}\b means: look for exactly two digits, a hyphen, three digits, and check whether there is a word boundary at the start and end of this pattern.

Discussion of example regular expressions

PESEL pattern

  • \b at the start and at the end – ensures that the matching digits are extracted as a whole word (in other words – there are no further word characters before or after this sequence).

  • [0-9]{11} – we search for a sequence of exactly 11 digits (from 0 to 9).

Application: Will find a sequence like 98012345678, but will not match something like 12345678901abc (because immediately after the digits there is a letter, so there is no word boundary).

Email address

  • [a-zA-Z0-9._%+-]+ – one or more (+) letters (lowercase and uppercase), digits, dots, underscores, percent signs, pluses, minuses. In short, characters allowed in most email addresses in the part before “@”.

  • @ – literally the “at” sign.

  • [a-zA-Z0-9.-]+ – one or more letters (lower/upper), digits, dots or hyphens in the domain part (e.g. gmail.com, moj-serwer.pl etc.).

  • \. – the dot character (escaped with a backslash so that "." does not mean any character).

  • [a-zA-Z]{2,} – at least two letters as the domain suffix, e.g. .pl, .com, .info.

Application: Will detect email addresses in a typical format, e.g. [email protected]. However, it does not check detailed rules (e.g., it does not allow national characters) and does not verify whether the domain actually exists.

Phone number (e.g., 123-456-789)

  • \b – word boundary.

  • \d{3} – three digits.

  • [-.\s]? – hyphen, dot or space, which may occur 0 or 1 time. This allows formats such as 123 456 789, 123-456-789, or 123.456.789.

  • Again \d{3}[-.\s]? – the next three digits and an optional separator.

  • \d{3} – the final three digits.

  • \b – end of word.

Application: Will find a phone number divided into 3-digit blocks, and thanks to ? (optional separators) it will ignore the specific separator format (space, hyphen, dot) or its absence.

IPv4 address

You may also encounter the notation:

  • The construction 25[0-5] allows matching numbers in the range 250–255,

  • 2[0-4]\d – numbers 200–249,

  • [0-1]?\d?\d – numbers 0–199 (simplified).

  • ( ... )\. – such a group terminated by a dot and repeated {3} times means that we have four octets (each of which must satisfy the above condition) separated by dots.

  • \b – word boundary at the start and end to avoid accidentally "capturing" something that is not an address inside.

Application: Such an expression checks whether we have a valid IPv4 address like 192.168.1.1 or 255.255.255.255, and not e.g. 256.300.1.999.

Credit card number

  • \b – word boundary.

  • (?: ...) – non-capturing group. It does not create a separate grouping stored in memory; it only tells the regex what to repeat.

  • \d[ -]*? – one digit, after which any number of hyphens, spaces, etc. may occur (modified by *?, which means "minimal match" – lazy quantifier). This allows us to skip potential separators (spaces, hyphens) in card numbers.

  • {13,16} – we look for such sequences totaling 13 to 16 "blocks" (digits).

  • \b – again a word boundary.

Application: Allows finding a digit sequence of length 13–16, even if written with hyphens or spaces, e.g. 1234-5678-1234-5678, 1234 5678 1234 9999 or 1234567890123456.

Polish postal code in format (XX-XXX)

  • \b – word boundary.

  • \d{2} – two digits.

  • - – literally a hyphen.

  • \d{3} – three digits.

  • \b – end of word.

Application: Will find a Polish postal code such as 01-234, 50-001 etc. It does not verify whether the postal code actually exists, but the format itself is correct.

How to create your own regular expressions?

  1. Define the exact pattern you want to capture. Consider what rules the character sequence must satisfy (are they all digits or letters, what is the allowed length, are there separators, etc.).

  2. Choose appropriate character classes. If you need only digits, use \d or [0-9]. When you want letters, consider whether only lowercase/uppercase are involved or also Polish diacritical characters.

  3. Apply quantifiers (how many times a given pattern repeats?).

    • Precisely specify the minimum and maximum number of repetitions using {m,n}.

    • If a separator is optional, apply ?.

    • If a character can repeat many times, consider * or +.

  4. Use boundaries (e.g. ^, $, \b) To avoid matches inside other text, check whether a word boundary \b or start/end of string is needed ^ ... $.

  5. Test step by step. You can use online tools (e.g., regex101.com) to see how your expression matches (or does not match) specific examples.

Summary

Regular expressions allow you to define, in a very precise (though sometimes somewhat complex) way, what you want to "capture" from text. Each of the above examples:

  • Uses character classes ([0-9], [a-zA-Z], \d, etc.).

  • Defines number of repetitions via quantifiers (+, {n}, {n,m}).

  • Applies word boundaries \b, to match whole entities (like a full PESEL or a full postal code), not fragments of a longer sequence.

Once you understand the operation of basic elements (character classes, quantifiers, groups, boundaries), creating and adapting regular expressions to your needs becomes much easier.

If you are just starting, I recommend testing each of these expressions "live" in some online regex tester by entering sample data and observing which fragments of text are "highlighted." This way you will quickly see how a small change in the regex affects matches.

Last updated

Was this helpful?