Stream API Java 8 Parallel Processing

I have 2 sets

phraseSet contains "eiffel tower", "tokyo tower"

wordSet contains words like "eiffel" , "tower"

How do I use Java 8 parallel stream to process logic like: 1. for each item in phraseSet, tokenize it, see if all tokens exist in wordSet, if so add that item to a new set called resultSet. In this example, resultSet would contain "eiffel tower"

It's easy to do if i do using traditional for loop, but i am confused when attempting it using parallel stream, which i hope is faster too since it's processed in parallel.

3 answers

  • answered 2018-05-16 05:59 Flown

    A filter and an allMatch would be sufficient:

    Set<String> phrases = new HashSet<>(Arrays.asList("eifel tower", "tokyo tower"));
    Set<String> words = new HashSet<>(Arrays.asList("eifel", "tower"));
    Pattern delimiter = Pattern.compile("\\s+");
    
    Set<String> resultSet = phrases.parallelStream().filter(
        phrase -> delimiter.splitAsStream(phrase).allMatch(words::contains)
    ).collect(Collectors.toSet());
    

  • answered 2018-05-16 06:06 Hadi J

    You could use equals or containsAll method here.

    Set<String> resultSet = phraseSet.stream()
               .filter(s->wordSet.equals(Stream.of(s.split("\\s"))//wordSet.containsAll(...)
                      .collect(Collectors.toSet())))
               .collect(Collectors.toSet());
    

  • answered 2018-05-16 07:07 Holger

    The simplest solution would be

    Set<String> resultSet = phraseSet.stream()
        .filter(s -> wordSet.containsAll(Arrays.asList(s.split("\\s+"))))
        .collect(Collectors.toSet());
    

    You may turn this to parallel processing by replacing stream() with parallelStream(), but you would need a rather large input set to get a benefit from parallel processing.

    Note that this simple solution may do unnecessary work if you have a lot of non-matching phrases as it will create all substrings before checking whether they are contained in wordSet. A solution like Flown’s will defer the creation of the substrings, so it can be skipped when encountering a word not contained in wordSet (also known as short-circuiting). Another performance improvement would be moving the creation of the Pattern out of the stream processing and re-using it (a Pattern is also created behind the scenes when using a method like String.split as in above solution).

    Pattern whiteSpace = Pattern.compile("\\s+");
    Predicate<String> inWordSet = wordSet::contains;
    Set<String> resultSet = phraseSet.stream()
        .filter(phrase -> whiteSpace.splitAsStream(phrase).allMatch(inWordSet))
        .collect(Collectors.toSet());