How to split string in words and non-words

I am looking for an elegant way of splitting a string in words and non-words, where a "word" is defined by some regular expression (for instance, [a-zA-Z]+).

Input is a string, output should be a list of word and non-word substrings in order. For instance:

"A! B C, d." -> Arrays.asList("A", "! ", "B", " ", "C", ", ","d", ".")

Here's my take:

public static String WORD_PATTERN = "[a-zA-Z]+";

public static List<String> splitString(String str) {
    if (str == null) {
        return null;
    }
    Pattern wordPattern = Pattern.compile(WORD_PATTERN);
    Matcher wordMatcher = wordPattern.matcher(str);

    List<String> splitString = new ArrayList<>();

    int endOfLastWord = 0;

    while(wordMatcher.find())
    {
        int startOfNextWord = wordMatcher.start();
        int endOfNextWord = wordMatcher.end();

        if (startOfNextWord > endOfLastWord) {
            String nextNonWord = str.substring(endOfLastWord, startOfNextWord);
            splitString.add(nextNonWord);
        }

        String nextWord = str.substring(startOfNextWord, endOfNextWord);
        splitString.add(nextWord);
        endOfLastWord = endOfNextWord;
    }

    if (endOfLastWord < str.length()) {
        String lastNonWord = str.substring(endOfLastWord);
        splitString.add(lastNonWord);
    }
    return splitString;
}

This does not feel elegant, I think there should be a better way which I'm just not aware of.

I am not looking to improve the code above, so please don't refer to Codereview. I've only posted it to avoid "what have you tried so far" comments.

I am looking for a more concise and elegant way, ideally only using standard Java packages.

1 answer

  • answered 2018-05-16 05:51 AxelH

    You can use a regex to capture both word and non-word with an optional content :

    (\w*)(\W*)
    
    • \w : [a-zA-Z0-9_]
    • \W : [^a-zA-Z0-9_]

    Example with regex101

    For each match, take both capture groups, check if there is a value captured (length > 0) and add the value to the list.

    This give a nice and simple solution like :

    public List<String> splitWord(String s){
        List<String> result = new ArrayList<>();
        Pattern p = Pattern.compile("(\\w*)(\\W*)");
        Matcher m = p.matcher(s);
    
        while(m.find()){
            Optional.of(m.group(1)).filter(str -> !str.isEmpty()).ifPresent(result::add);
            Optional.of(m.group(2)).filter(str -> !str.isEmpty()).ifPresent(result::add);
        }
    
        return result;
    }
    

    Note : the Optional is ... optional but I am trying to improve myself on it. It will simply check if the group have a value that is not empty and will add it to the list.

    And the result formatted to match your example

    "abc def" -> Arrays.asList("abc", " ", "def")
    "a.b. c" -> Arrays.asList("a", ".", "b", ". ", "c")
    "a.b." -> Arrays.asList("a", ".", "b", ".")
    ".aa" -> Arrays.asList(".", "aa")
    "." -> Arrays.asList(".")
    "a" -> Arrays.asList("a")
    ".." -> Arrays.asList("..")
    

    Here is the example with the formatting method in ideone