How do I calculate the right mean?

I have a dataset that shows bilateral exports for several countries. Because the data fluctuates, I need to calculate the mean of year groups. All the countries do not cover exactly the years. Some start later, some have gaps in between - this means, some years are missing (but without having NA entries). I already managed to cut the data into pieces whith the help of an amazing community member: year_group.

Below I am listing two further problems along with my code, the wrong output and on the bottom some sample input data for the dataset total_trade

Problem 1

I am facing the issue, that the code does not calculate the right means. When I calculate the results manually, I get different results than my code. (see below)

This is my code

# create vectors for coding 4 years average
year_group_break <- c(1999, 2003, 2007, 2011, 2015, 2019)
year_group_labels <- c("1999-2002", "2003-2006", "2007-2010", "2011-2014", "2015-2018")
years <- c(1999, 2000, 2001, 2002,2003, 2004,   2005,   2006,   2007,   2008,   2009,   2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)


FourY_av <- total_trade %>%
  # create year_group variable for average values with above predefined labels and cuts, 
  # chose right = FALSE to take cut before year_group_break
  mutate(year_group = cut(Year, breaks = year_group_break,
                          labels  = year_group_labels,
                          include.lowest = TRUE, right = FALSE)) %>%
  # add column with mean of total trade per four year period: "avg_year_group_total"
  group_by(ReporterName, year_group) %>%
  mutate(total_year_group = mean(Total_Year)) %>%
  arrange(ReporterName,PartnerName, desc(Year))
View(FourY_av)

Below is the wrong output This output is wrong because total_year_group (the mean of the year group '2015-2018' for Angola) should be 34746013.5 (when calculated manually) instead of 34907582. (as in the output) Where is my mistake?

> head(FourY_av)
# A tibble: 6 x 9
# Groups:   ReporterName, year_group [1]
   Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group total_year_group
  <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>                 <dbl>
1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018         34907582.
2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018         34907582.
3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018         34907582.
4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018         34907582.
5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018         34907582.
6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018         34907582.

Problem 2

Another issue is that not all countries display data for all years. SOme start later, some have gaps. I still need the means for the same year groups in order to ensure comparability. The data set has no NA's. The data is just missing.

E.g. Angola does not cover the year 2008. The data set does not include NA's but does not include the row and value for 2008 for Angola. Other countries are displaying data for 2008. I still need a mean for the available years for Angola in the column total_year_group (by taking the mean for the years 2007, 2009 & 2010). This shouldn't be an issue with the mean function, right? Or do I need to consider something particular in this case?

Here is some sample input data of total_trade

dput(head(total_trade, n = 100))
structure(list(Year = c(2015L, 2018L, 2017L, 2016L, 2017L, 2015L, 
2018L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L, 
2015L, 2016L, 2017L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 
2016L, 2015L, 2018L, 2014L, 2015L, 2016L, 2017L, 2017L, 2018L, 
2016L, 2015L, 2016L, 2018L, 2017L, 2015L, 2010L, 2009L, 2016L, 
2013L, 2014L, 2018L, 2017L, 2015L, 2016L, 2017L, 2018L, 2017L, 
2018L, 2016L, 2016L, 2018L, 2007L, 2013L, 2009L, 2018L, 2015L, 
2016L, 2014L, 2010L, 2017L, 2012L, 2011L, 2018L, 2016L, 2015L, 
2016L, 2011L, 2018L, 2017L, 2015L, 2015L, 2016L, 2018L, 2017L, 
2015L, 2015L, 2016L, 2018L, 2017L, 2007L, 2014L, 2010L, 2013L, 
2011L, 2009L, 2012L, 2017L, 2018L, 2016L, 2015L, 2015L, 2015L, 
2017L, 2016L, 2018L, 2015L), ReporterName = c("Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola"
), PartnerName = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Albania", "Albania", "Albania", "Algeria", "Algeria", 
"Algeria", "Algeria", "American Samoa", "Andorra", "Andorra", 
"Antigua and Barbuda", "Antigua and Barbuda", "Antigua and Barbuda", 
"Antigua and Barbuda", "Argentina", "Argentina", "Argentina", 
"Argentina", "Armenia", "Armenia", "Armenia", "Armenia", "Australia", 
"Australia", "Australia", "Australia", "Australia", "Austria", 
"Austria", "Austria", "Austria", "Azerbaijan", "Azerbaijan", 
"Azerbaijan", "Azerbaijan", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahrain", "Bahrain", "Bahrain", "Bangladesh", 
"Bangladesh", "Bangladesh", "Barbados", "Belarus", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belize", "Belize", 
"Belize", "Benin", "Benin", "Benin", "Benin", "Benin", "Bhutan", 
"Bolivia", "Bolivia", "Bolivia", "Bolivia", "Botswana", "Botswana", 
"Botswana", "Botswana", "Brazil", "Brazil", "Brazil", "Brazil", 
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", 
"British Virgin Islands", "Brunei", "Bulgaria", "Bulgaria", "Bulgaria", 
"Bulgaria"), PartnerISO3 = c("AFG", "AFG", "AFG", "AFG", "ALB", 
"ALB", "ALB", "DZA", "DZA", "DZA", "DZA", "ASM", "AND", "AND", 
"ATG", "ATG", "ATG", "ATG", "ARG", "ARG", "ARG", "ARG", "ARM", 
"ARM", "ARM", "ARM", "AUS", "AUS", "AUS", "AUS", "AUS", "AUT", 
"AUT", "AUT", "AUT", "AZE", "AZE", "AZE", "AZE", "BHS", "BHS", 
"BHS", "BHS", "BHS", "BHS", "BHS", "BHS", "BHR", "BHR", "BHR", 
"BGD", "BGD", "BGD", "BRB", "BLR", "BEL", "BEL", "BEL", "BEL", 
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BLZ", "BLZ", 
"BLZ", "BEN", "BEN", "BEN", "BEN", "BEN", "BTN", "BOL", "BOL", 
"BOL", "BOL", "BWA", "BWA", "BWA", "BWA", "BRA", "BRA", "BRA", 
"BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "VGB", 
"BRN", "BGR", "BGR", "BGR", "BGR"), `TradeValue in 1000 USD` = c(39.586, 
19.353, 2.248, 0.775, 39.723, 2.259, 2.38, 2169.123, 2322.463, 
2241.599, 245.226, 12.007, 5.975, 0.326, 422.006, 155.467, 47.018, 
54.774, 483.147, 142.23, 98.7, 61.362, 60.105, 30.494, 0.99, 
0.731, 40220.092, 45435.804, 16096.404, 8546.882, 1904.301, 627.179, 
433.699, 23.118, 5.124, 985.67, 600.371, 143.356, 9.926, 140139.415, 
108214.936, 64444.203, 100210.999, 52974.059, 7322.893, 145.791, 
26.995, 4.847, 5.187, 1.958, 125.722, 55.22, 2.75, 3.366, 54.31, 
107976.895, 123610.469, 66757.2, 67763.201, 50046.64, 40199.706, 
52383.95, 45614.873, 28690.458, 52907.343, 39328.574, 452.078, 
5.82, 0.32, 970.324, 1700.981, 804.478, 332.216, 69.342, 1.632, 
1530.58, 308.752, 62.569, 19.822, 55.241, 37.029, 16.917, 0.198, 
874217.786, 1032751.313, 509259.955, 428750.075, 333280.441, 
192964.08, 315316.932, 119947.132, 141486.749, 66556.728, 1273.093, 
5.064, 22.324, 158.252, 33.583, 8.435, 0.077), Total_Year = c(33924937.48, 
42096736.31, 34904881.111, 28057499.527, 34904881.111, 33924937.48, 
42096736.31, 28057499.527, 33924937.48, 34904881.111, 42096736.31, 
42096736.31, 34904881.111, 42096736.31, 42096736.31, 33924937.48, 
28057499.527, 34904881.111, 28057499.527, 33924937.48, 34904881.111, 
42096736.31, 42096736.31, 34904881.111, 28057499.527, 33924937.48, 
42096736.31, 58672369.19, 33924937.48, 28057499.527, 34904881.111, 
34904881.111, 42096736.31, 28057499.527, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 52612114.76, 40639411.73, 
28057499.527, 67712526.544, 58672369.19, 42096736.31, 34904881.111, 
33924937.48, 28057499.527, 34904881.111, 42096736.31, 34904881.111, 
42096736.31, 28057499.527, 28057499.527, 42096736.31, 44177783.072, 
67712526.544, 40639411.73, 42096736.31, 33924937.48, 28057499.527, 
58672369.19, 52612114.76, 34904881.111, 70863076.416, 66427390.221, 
42096736.31, 28057499.527, 33924937.48, 28057499.527, 66427390.221, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 44177783.072, 58672369.19, 52612114.76, 
67712526.544, 66427390.221, 40639411.73, 70863076.416, 34904881.111, 
42096736.31, 28057499.527, 33924937.48, 33924937.48, 33924937.48, 
34904881.111, 28057499.527, 42096736.31, 33924937.48), pct_by_partner_year = c(0.000116687024179005, 
4.59726850497024e-05, 6.44035999679013e-06, 2.7621848456389e-06, 
0.000113803567683494, 6.65881846158674e-06, 5.65364493454718e-06, 
0.0077309918437765, 0.00684588733986371, 0.00642202158738646, 
0.000582529719629944, 2.8522401146684e-05, 1.71179497245645e-05, 
7.74406827169068e-07, 0.00100246726228929, 0.000458267609458834, 
0.000167577299448064, 0.000156923611416451, 0.00172198880208503, 
0.000419249114560196, 0.000282768474948037, 0.00014576426910659, 
0.000142778289407966, 8.73631395649993e-05, 3.5284683834613e-06, 
2.15475710288619e-06, 0.0955420669759755, 0.0774398658640565, 
0.0474471147057807, 0.0304620231456308, 0.00545568682484317, 
0.00179682319502973, 0.00103024376238159, 8.23950829180388e-05, 
1.51039335091503e-05, 0.00351303578942051, 0.00142616994243657, 
0.000410704736521284, 2.92587127267419e-05, 0.2663633948935, 
0.266280763902189, 0.229686194730164, 0.147994771595005, 0.0902879153020956, 
0.0173953936620509, 0.000417680838208199, 7.95727332317548e-05, 
1.72752386410474e-05, 1.48603858110989e-05, 4.65119192514428e-06, 
0.000360184581635431, 0.000131174064405754, 9.80130106517029e-06, 
1.19967925037684e-05, 0.000129012376636663, 0.244414471464133, 
0.18255184868885, 0.164267141570654, 0.160970200874938, 0.147521686751831, 
0.143276153177212, 0.0892821454514031, 0.0867003221749986, 0.0821961201035531, 
0.0746613690455784, 0.0592053577133712, 0.00107390272887404, 
2.07431171633786e-05, 9.43258923288073e-07, 0.00345834096536738, 
0.0025606620918584, 0.00191102225615742, 0.000951775194258733, 
0.000204398313308255, 4.81062050876917e-06, 0.00545515468521031, 
0.000733434529761055, 0.000179255731601051, 5.84289949294256e-05, 
0.000162833019316739, 0.000131975409869888, 4.01860131755188e-05, 
5.6725590719059e-07, 1.97886296054109, 1.76020046106476, 0.967951882039117, 
0.633191666126054, 0.50172141324715, 0.474820062066878, 0.444966473299775, 
0.34363999584631, 0.336099093188823, 0.237215465105691, 0.00375267603882995, 
1.49270724610338e-05, 6.58041006358842e-05, 0.000453380716286491, 
0.00011969348860786, 2.00371827827334e-05, 2.26971678416193e-07
)), row.names = c(NA, -100L), groups = structure(list(Year = c(2007L, 
2007L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L, 
2011L, 2012L, 2012L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 
2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L
), ReporterName = c("Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola"), PartnerName = c("Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Bahamas, The", 
"Belgium", "Brazil", "Belgium", "Benin", "Brazil", "Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Australia", "Bahamas, The", 
"Belgium", "Brazil", "Afghanistan", "Albania", "Algeria", "Antigua and Barbuda", 
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", 
"Bahamas, The", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", 
"Botswana", "Brazil", "British Virgin Islands", "Brunei", "Bulgaria", 
"Afghanistan", "Algeria", "Antigua and Barbuda", "Argentina", 
"Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas, The", 
"Bahrain", "Bangladesh", "Barbados", "Belgium", "Belize", "Benin", 
"Bolivia", "Botswana", "Brazil", "Bulgaria", "Afghanistan", "Albania", 
"Algeria", "Andorra", "Antigua and Barbuda", "Argentina", "Armenia", 
"Australia", "Austria", "Azerbaijan", "Bahamas, The", "Bahrain", 
"Bangladesh", "Belgium", "Benin", "Bolivia", "Botswana", "Brazil", 
"Bulgaria", "Afghanistan", "Albania", "Algeria", "American Samoa", 
"Andorra", "Antigua and Barbuda", "Argentina", "Armenia", "Australia", 
"Austria", "Azerbaijan", "Bahamas, The", "Bahrain", "Bangladesh", 
"Belarus", "Belgium", "Belize", "Benin", "Bolivia", "Botswana", 
"Brazil", "Bulgaria"), .rows = structure(list(56L, 84L, 41L, 
    58L, 89L, 40L, 63L, 86L, 66L, 71L, 88L, 65L, 90L, 43L, 57L, 
    87L, 28L, 44L, 62L, 85L, 1L, 6L, 9L, 16L, 20L, 26L, 29L, 
    35L, 39L, 47L, 60L, 69L, 74L, 75L, 79L, 80L, 94L, 95L, 96L, 
    100L, 4L, 8L, 17L, 19L, 25L, 30L, 34L, 36L, 42L, 48L, 53L, 
    54L, 61L, 68L, 70L, 76L, 81L, 93L, 98L, 3L, 5L, 10L, 13L, 
    18L, 21L, 24L, 31L, 32L, 38L, 46L, 49L, 51L, 64L, 73L, 78L, 
    83L, 91L, 97L, 2L, 7L, 11L, 12L, 14L, 15L, 22L, 23L, 27L, 
    33L, 37L, 45L, 50L, 52L, 55L, 59L, 67L, 72L, 77L, 82L, 92L, 
    99L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), row.names = c(NA, 100L), class = c("tbl_df", "tbl", 
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"))

1 answer

  • answered 2021-05-15 13:04 AnilGoyal

    The issue with mean is duplicated rows for any ReporterName in the data.

    Problem-1

    total_trade %>%
      # create year_group variable for average values with above predefined labels and cuts, 
      # chose right = FALSE to take cut before year_group_break
      mutate(year_group = cut(Year, breaks = year_group_break,
                              labels  = year_group_labels,
                              include.lowest = TRUE, right = FALSE)) %>%
      # add column with mean of total trade per four year period: "avg_year_group_total"
      group_by(ReporterName, year_group) %>%
      mutate(dup = !duplicated(paste0(ReporterName, year_group, Total_Year)),
             total_year_group = sum(Total_Year * dup)/sum(dup)) %>%
      arrange(ReporterName,PartnerName, desc(Year))
    
    # A tibble: 100 x 10
    # Groups:   ReporterName, year_group [3]
        Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group dup   total_year_group
       <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>      <lgl>            <dbl>
     1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018  TRUE         34746014.
     2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018  TRUE         34746014.
     3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018  TRUE         34746014.
     4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018  TRUE         34746014.
     5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018  FALSE        34746014.
     6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018  FALSE        34746014.
     7  2015 Angola       Albania     ALB                            2.26   33924937.          0.00000666 2015-2018  FALSE        34746014.
     8  2018 Angola       Algeria     DZA                          245.     42096736.          0.000583   2015-2018  FALSE        34746014.
     9  2017 Angola       Algeria     DZA                         2242.     34904881.          0.00642    2015-2018  FALSE        34746014.
    10  2016 Angola       Algeria     DZA                         2169.     28057500.          0.00773    2015-2018  FALSE        34746014.
    # ... with 90 more rows
    

    Problem-2

    Use complete from tidyr. If you can show a desired output, I may show you how.