Regular Expressions for Beginners: A Practical Starting Guide
Regular expressions — regex for short — look intimidating at first glance. A pattern like ^\+?[1-9]\d{1,14}$ appears to be random punctuation. But regex follows consistent rules, and once you learn them, you can read and write these patterns confidently. This guide covers the essential building blocks with real-world examples.
What Is a Regular Expression?
A regular expression is a pattern that describes a set of strings. You use regex to search for text that matches the pattern, extract matched portions, validate that input conforms to a format, or replace matched text with something else. Virtually every programming language supports regex, and most text editors do too.
The power of regex comes from its conciseness. A single pattern like \d{3}-\d{2}-\d{4} describes what a US Social Security number looks like — three digits, a hyphen, two digits, a hyphen, four digits. Without regex, you would need dozens of lines of character-by-character parsing code to do the same thing.
Literal Characters
The simplest regex is a literal character sequence. The pattern hello matches the exact string "hello" wherever it appears in the input. Most letters, digits, and spaces are literal in regex — they match themselves.
However, certain characters have special meanings and must be escaped with a backslash if you want to match them literally. The special characters are: . * + ? ^ $ { } [ ] | ( ) \. To match a literal period, you write \.. To match a literal dollar sign, you write \$.
Character Classes: [abc]
A character class, written inside square brackets, matches any single character from the set you specify. [aeiou] matches any vowel. [0-9] matches any digit (the dash creates a range). [a-zA-Z] matches any letter, upper or lowercase.
You can also negate a character class with a caret at the start: [^aeiou] matches any character that is NOT a vowel. [^0-9] matches any non-digit character.
Shorthand Character Classes
Common character classes have shorthand notation:
\d— Any digit, equivalent to[0-9]\D— Any non-digit, equivalent to[^0-9]\w— Any word character (letter, digit, or underscore), equivalent to[a-zA-Z0-9_]\W— Any non-word character\s— Any whitespace character (space, tab, newline)\S— Any non-whitespace character
The Dot Wildcard
A period . in regex matches any single character except a newline (in most engines). The pattern c.t matches "cat", "cut", "cot", "c4t", "c t" — any character in the middle position. This is useful when you want to say "exactly one character of any kind here."
Because dot matches so broadly, it is easy to use it too liberally. A pattern like .+ matches almost anything. Be as specific as possible with character classes when you know what characters are valid.
Anchors: ^ and $
Anchors do not match characters — they match positions. ^ matches the start of a string (or the start of a line in multiline mode). $ matches the end of a string (or end of a line).
The pattern hello would match "hello" inside "say hello world." Adding anchors, ^hello$, only matches the string "hello" and nothing else — no leading or trailing characters allowed. This is crucial for validation: a pattern for phone numbers without anchors would match any string containing the phone number pattern anywhere within it, which is rarely what you want for input validation.
Test your regex patterns live
Paste a pattern and test it against sample text instantly, with match highlighting and group capture details.
Quantifiers: How Many to Match
Quantifiers control how many times the preceding element must appear:
*— Zero or more times.ab*cmatches "ac", "abc", "abbc", "abbbc".+— One or more times.ab+cmatches "abc", "abbc" but NOT "ac".?— Zero or one time (makes the element optional).colou?rmatches both "color" and "colour".{n}— Exactly n times.\d{4}matches exactly four digits.{n,}— At least n times.\d{2,}matches two or more digits.{n,m}— Between n and m times.\d{2,4}matches 2, 3, or 4 digits.
By default, quantifiers are greedy — they match as many characters as possible. Adding ? after a quantifier makes it lazy (non-greedy), matching as few characters as possible. .+? versus .+ behaves very differently when there are multiple possible matches.
Groups and Alternation
Parentheses create capturing groups, which serve two purposes: they apply quantifiers to a sequence, and they capture the matched text for later use. The pattern (ab)+ matches "ab", "abab", "ababab" — the quantifier applies to the whole group "ab", not just the "b".
The pipe character | means "or." The pattern cat|dog matches either "cat" or "dog." Combined with groups: (cat|dog)s? matches "cat", "cats", "dog", "dogs".
Groups also capture their matched content. In most languages, after a match, you can access the captured groups by index. Group 1 is the first set of parentheses, group 2 is the second, and so on. Non-capturing groups use the syntax (?:...) — they group without capturing, which is useful for applying quantifiers without the overhead of storing the match.
Common Practical Patterns
Email Address Validation
A basic email validation pattern: ^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$. This matches one or more word-like characters before the @, then a domain, then a dot and a TLD of at least 2 characters. A fully RFC 5321-compliant email regex is notoriously complex — for most practical purposes, this simplified version catches the common cases.
Phone Numbers
US phone numbers: ^\+?1?\s?(\d{3})[\s.\-]?\d{3}[\s.\-]?\d{4}$. This handles formats like (555) 123-4567, 555-123-4567, 555.123.4567, and +1 555 123 4567. Phone number formats vary enormously internationally, which is why phone input validation is often better handled by dedicated libraries like libphonenumber.
Dates (YYYY-MM-DD)
ISO 8601 date format: ^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$. The month group ensures 01–12, and the day group allows 01–31. This does not validate whether the specific day is valid for the month (e.g., February 30 would pass), which requires logic beyond regex.
When NOT to Use Regex
Regex is not the right tool for every text-matching job. The classic example is parsing HTML or XML. These are nested, recursive structures that regular expressions (which are based on finite automata theory) fundamentally cannot handle correctly. A regex cannot reliably match "the content between opening and closing tags" because HTML tags can nest to arbitrary depth. Use a proper HTML/XML parser for these tasks.
Similarly, regex should not be your first choice for highly complex structured formats like JSON or CSV with quoted fields. Purpose-built parsers handle edge cases that regex will inevitably miss.
Practice Tips
The best way to learn regex is to write and test patterns against real examples. Start with a simple goal — match all lines that start with a digit — and build up from there. Use a regex tester with live highlighting so you can immediately see what your pattern matches. When a pattern does not work as expected, simplify it to the smallest failing case and reason through each component.
Three habits that will accelerate your learning: always anchor validation patterns with ^ and $, prefer specific character classes over the dot wildcard, and test your patterns against both valid inputs and intentional edge cases that should not match.
Try the Regex Tester
Write patterns, test against sample text, and see exactly which parts of your input match — live as you type.