Merge by partial string match in R

I've a df as under

+-------+---------+-------+
| Brand |  WORD   | Count |
+-------+---------+-------+
| ABC   | cell    |     1 |
| DEF   | dock    |     2 |
| XYZ   | surface |     3 |
| LMN   | pro     |     4 |
| ABC   | mobile  |     5 |
| DEF   | game    |     6 |
| XYZ   | mouse   |     7 |
+-------+---------+-------+

and another one:

+-------+-----------------+--------+
| Brand |      Name       | profit |
+-------+-----------------+--------+
| ABC   | cell game       |     10 |
| ABC   | cellular mobile |     20 |
| DEF   | docking station |     30 |
| XYZ   | surface mouse   |     40 |
| XYZ   | mouse device    |     50 |
| LMN   | pro device      |     60 |
+-------+-----------------+--------+

I want to merge them by partial string matching (word for word, meaning cell would match only with cell and not cellular) the WORD and name and grouped by the Brand, so the resulting table would be as under:

+-------+---------------+-----------------+-------+--------+
| Brand |     WORD      |      Name       | Count | profit |
+-------+---------------+-----------------+-------+--------+
| ABC   | cell          | cell game       |     1 |     10 |
| ABC   | mobile        | cellular mobile |     5 |     20 |
| XYZ   | surface mouse | surface mouse   |     3 |     40 |
| XYZ   | mouse         | mouse device    |     7 |     50 |
| XYZ   | mouse         | mouse device    |     7 |     50 |
| LMN   | pro           | pro device      |     4 |     60 |
+-------+---------------+-----------------+-------+--------+

I tried using the solution here R partial string matching and return value (in R)

but it matches even parts of strings, like cell would be matched with cellular was wondering if there was a way to have exact string match and get the results in the desired form

1 answer

  • answered 2021-09-27 16:07 G. Grothendieck

    We are assuming here that you want to match the Brand columns and the WORD columnn to the Name column and that the output is to be ordered by profit. The output shown in the question has a duplicate row which we assume was an error. The inputs d1 and d2 are shown reproducibly in the Note at the end.

    We pad WORD and Name with a space on either side to ensure that only word matches are used. The % used in the like pattern is a wildcard that matches any string of 0 or more characters.

    library(sqldf)
    
    sqldf("select d1.Brand, d2.Name, d1.WORD, d1.Count, d2.profit
      from d1
      join d2 on d1.Brand = d2.Brand and 
                 ' ' || d2.Name || ' ' like '% ' || d1.WORD || ' %'
      order by d2.profit")
    

    giving:

      Brand            Name    WORD Count profit
    1   ABC       cell game    cell     1     10
    2   ABC cellular mobile  mobile     5     20
    3   XYZ   surface mouse surface     3     40
    4   XYZ   surface mouse   mouse     7     40
    5   XYZ    mouse device   mouse     7     50
    6   LMN      pro device     pro     4     60
    

    Note

    The input in reproducible form.

    d1 <-
    structure(list(Brand = c("ABC", "DEF", "XYZ", "LMN", "ABC", "DEF", 
    "XYZ"), WORD = c("cell", "dock", "surface", "pro", "mobile", 
    "game", "mouse"), Count = c(1, 2, 3, 4, 5, 6, 7)), class = "data.frame", row.names = c(NA, 
    -7L))
    
    d2 <-
    structure(list(Brand = c("ABC", "ABC", "DEF", "XYZ", "XYZ", "LMN"
    ), Name = c("cell game", "cellular mobile", "docking station", 
    "surface mouse", "mouse device", "pro device"), profit = c(10, 
    20, 30, 40, 50, 60)), class = "data.frame", row.names = c(NA, 
    -6L))
     
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum