Hive - Update records in a table with todays date IF they are not found in another table?

I currently have a Main result table (test1) that stores all of my records of issues and a second table (test2) that is run every week or so and I'm trying to find those records where not exists in the weekly update and then update the date in the main result table as that is when it got updated in the system to be corrected.

I am trying to add the records from test2 table to test1 table if they are not already in the table.

This works:

insert into table test1 (id, name, code)
select * from test2 t2 where t2.id not in (select id from test1);

I'm trying to also update table test1 'Corrected_date' column to show current_date for all records that are found in test1 but not in test2

example data below:

Table 1

ID    NAME    CODE    CORRECTED_DATE
1     TEST    3    
29    TEST2   90 

Table 2

ID    NAME    CODE  
12    TEST5   20
1     TEST    3

Expected End Result of Table 1

ID    NAME    CODE    CORRECTED_DATE
1     TEST    3       
29    TEST2   90       3/13/2019
12    TEST5   20

1 answer

  • answered 2019-03-14 08:31 leftjoin

    Overwrite the table using FULL JOIN. FULL JOIN returns joined records + not joined from the left table + not joined from the right table. You can use case statements to implement your logic like this:

    insert OVERWRITE table test1
    
    select 
          --select t1 if both or t1 only exist, t2 if only t2 exists
          case when t1.ID is null then t2.ID   else t1.ID   end as ID,
          case when t1.ID is null then t2.NAME else t1.NAME end as NAME,
          case when t1.ID is null then t2.CODE else t1.CODE end as CODE,
    
          --if found in t1 but not in t2 then current_date else leave as is
          case when (t1.ID is not null) and (t2.ID is null) then current_date else t1.CORRECTED_DATE end as CORRECTED_DATE 
      from test1 t1 
           FULL OUTER JOIN test2 t2 on t1.ID=t2.ID;
    

    See also this similar question about incremental updates, your logic is different but approach is the same: https://stackoverflow.com/a/37744071/2700344

    Testing with your data:

    with test1 as (
    select stack (2,
    1, 'TEST',    3,null,    
    29,'TEST2',   90 , null
                 ) as (ID,NAME,CODE,CORRECTED_DATE)
    ),
    
         test2 as (
    select stack (2,
                  12,'TEST5',20,
                  1,'TEST',3
                 ) as (ID, NAME, CODE)
    )
    
    select 
          --select t1 if both or t1 only exist, t2 if only t2 exists
          case when t1.ID is null then t2.ID   else t1.ID   end as ID,
          case when t1.ID is null then t2.NAME else t1.NAME end as NAME,
          case when t1.ID is null then t2.CODE else t1.CODE end as CODE,
    
          --if found in test1 but not in test2 then current_date else leave as is
          case when (t1.ID is not null) and (t2.ID is null) then current_date else t1.CORRECTED_DATE end as CORRECTED_DATE 
      from test1 t1 
           FULL OUTER JOIN test2 t2 on t1.ID=t2.ID;
    

    Result:

    OK
    id      name    code    corrected_date
    1       TEST    3       NULL
    12      TEST5   20      NULL
    29      TEST2   90      2019-03-14
    Time taken: 41.727 seconds, Fetched: 3 row(s)
    

    The result is as expected.