Regex to find URLs without http or www

I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com

Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:

/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i

Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.

2 answers

  • answered 2018-01-11 21:15 SinistraD

    You could use a double pass search.

    Search for every url-like string, e.g.:

    ((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}
    

    On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.

    function isUrl($urlMatch) {
        $tldList = ['com', 'net'];
        $urlParts = explode(".", $urlMatch);
        $lastPart = end($urlParts);
        return in_array($lastPart, $tldList); 
    }
    

  • answered 2018-01-11 21:55 Daniel O.

    Example

    function get_host($url) {
        $host = parse_url($url, PHP_URL_HOST);
        $names = explode(".", $host);
    
        if(count($names) == 1) {
            return $names[0];
        }
    
        $names = array_reverse($names);
        return $names[1] . '.' . $names[0];
    }
    

    Usage

    echo get_host('https://google.com'); // google.com
    echo "\n";
    echo get_host('https://www.google.com'); // google.com
    echo "\n";
    echo get_host('https://sub1.sub2.google.com'); // google.com
    echo "\n";
    echo get_host('http://localhost'); // localhost
    

    Demo