Exploratory Data Analysis

Oct 20, 2019 Statistics

Data Wrangling and Data Exploration

Introduction

The first two datasets used in this project are both from World Health Organization. They contains information about cases of Tuberculosis for different countries including the new cases, previous cases, drug resistant cases, total population, etc. The third dataset is from The World Bank Group which lists the GDP per capita of countries for different years in US dollars. This data is interesting because I am currently in a reserach stream at UT trying to identify potential drugs to treat tuberculosis.

The first two datasets include information about the resistance of Mycobacterium tuberculosis to drugs. Tuberculosis is linked with poverty which is why the GDP per capita was also considered. Expectations include that as the number of TB (tuberculosis) cases increase, the number of drug resistant multidrug resistant (MDR) and extensively drug resistant (XDR) TB cases will also increase. Furthermore, as GDP per capita increases, TB cases (and mortality) per 100,000 individuals should decrease since it is assumed that there is more access to healthcare and resources for treatment and prevention.

Find data:

Tidying:

join1 <- left_join(data1a, data1b, by = c("country", "year"))
head(join1)

##       country region year new.pul.TB prev.treated.pul.TB prev.unk.pul.TB
## 1 Afghanistan    EMR 2017      19354                2233             125
## 2 Afghanistan    EMR 2018      20485                1712              NA
## 3     Albania    EUR 2017        195                  15               0
## 4     Albania    EUR 2018        198                  10               0
## 5     Algeria    AFR 2017       6278                 419               0
## 6     Algeria    AFR 2018       6137                 362              21
##   new.MDR prev.MDR MDR.tested XDR pop.number TB.100k TB.num TB_mort.100k
## 1      NA       NA         NA   5   36296113     189  69000        30.00
## 2      NA       10         10   8   37171921     189  70000        29.00
## 3       0        0          0   0    2884169      20    580         0.34
## 4       1        1         NA  NA    2882740      18    510         0.34
## 5      11       28         39   4   41389189      70  29000         7.70
## 6       2        5          7   4   42228408      69  29000         7.70
##   TB_mort.num
## 1       11000
## 2       11000
## 3          10
## 4          10
## 5        3200
## 6        3300

First, a left join was used to join the first two tuberculosis data sets since dataset 1a included information for 2017 and 2018 while dataset 1b contains data for many years (2000-2018). A left join was used compared to a full_join or right_join because there would be many rows with NAs (rows with year 2000-2016).

data2pivot <- data2 %>% pivot_longer(cols = c(3:4), names_to = "year", 
    values_to = "GDP") %>% separate(col = "year", into = c(NA, 
    "year"), sep = 1) %>% mutate(year = as.numeric(year))
head(data2pivot)

## # A tibble: 6 x 4
##   country     Country.Code  year    GDP
##   <fct>       <fct>        <dbl>  <dbl>
## 1 Aruba       ABW           2017 25630.
## 2 Aruba       ABW           2018    NA 
## 3 Afghanistan AFG           2017   556.
## 4 Afghanistan AFG           2018   521.
## 5 Angola      AGO           2017  4096.
## 6 Angola      AGO           2018  3432.

Then, to make it easier to join the first two datasets with the third dataset, pivot longer was used on the GDP data (dataset 2). This is due to the fact that while year is a column name in dataset 1a and 1b, for dataset 2, the GDP has two columns: 2017 and 2018. Additionally, when imported, the header of the GDP data inserted an X before the year value. This was removed using the seperate function. Mutate was then used on ‘year’ because its type was character which is incompatible with year in the tuberculosis (1a and 1b) dataset which is a numeric type (cannot be used to join).

Joining/Merging

nrow(join1)

## [1] 432

nrow(data2pivot)

## [1] 528

# Joining of all three datasets and deleting rows with NAs
join2 <- inner_join(join1, data2pivot, by = c("country", "year"))
nrow(join1) - nrow(join2)

## [1] 72

nrow(data2pivot) - nrow(join2)

## [1] 168

join2 <- join2 %>% na.omit()
nrow(join2)

## [1] 222

# document the type of join that you do
# (left/right/inner/full), including how many cases in each
# dataset were dropped and why you chose this particular join

Initially, the tuberculosis dataset had 432 cases while the GDP dataset had 528 cases. An inner join was chosen so that all countries remaining would contain information for tuberculosis and their GDP. This resulted in 360 observations remaining with 72 cases being dropped from the tuberculosis dataset and 168 cases dropped from the GDP dataset. Problems include that rows with NAs are more likely to be smaller and poorer countries with less documentation of the data which may skew the results.

Wrangling: filter, select, arrange, group_by, mutate, summarize

# Determining the quantile of GDP and population for each
# country
ntile <- join2 %>% mutate(ntileGDP = ntile(n = 5, x = GDP)) %>% 
    mutate(ntilepop = ntile(n = 5, x = pop.number))

# Mean and standard deviation of numeric variables (excluding
# year)
join2 %>% select_if(is.numeric) %>% select(-year) %>% summarize_all(.funs = mean)

##   new.pul.TB prev.treated.pul.TB prev.unk.pul.TB  new.MDR prev.MDR MDR.tested
## 1   16571.29            2829.351        138.8468 135.6441 197.5495   290.7928
##        XDR pop.number TB.100k   TB.num TB_mort.100k TB_mort.num      GDP
## 1 38.74324   33335846 98.5732 56241.17     16.46901    8344.464 14741.78

join2 %>% select_if(is.numeric) %>% select(-year) %>% summarize_all(.funs = sd)

##   new.pul.TB prev.treated.pul.TB prev.unk.pul.TB  new.MDR prev.MDR MDR.tested
## 1    84215.1            18464.84        1363.483 667.0942   1133.7   1601.977
##        XDR pop.number TB.100k   TB.num TB_mort.100k TB_mort.num      GDP
## 1 261.0853  131603854 135.356 276737.8      32.3854    44225.92 18664.98

join2 %>% summarize_all(.funs = n_distinct)

##   country region year new.pul.TB prev.treated.pul.TB prev.unk.pul.TB new.MDR
## 1     130      6    2        205                 148              44      77
##   prev.MDR MDR.tested XDR pop.number TB.100k TB.num TB_mort.100k TB_mort.num
## 1       77         86  43        222     152    142          139         132
##   Country.Code GDP
## 1          130 222

# top 10 Observations for extensively drug resistant TB cases
# for 2017 and 2018
ntile %>% arrange(desc(XDR)) %>% head(10)

##               country region year new.pul.TB prev.treated.pul.TB
## 1  Russian Federation    EUR 2017      32978               26058
## 2             Ukraine    EUR 2017      12840                7212
## 3             Ukraine    EUR 2018      12931                6774
## 4               India    SEA 2018     825939              208197
## 5               India    SEA 2017     868769              176450
## 6             Belarus    EUR 2017       1690                 700
## 7          Tajikistan    EUR 2017       2432                 652
## 8             Belarus    EUR 2018       1529                 612
## 9            Pakistan    EMR 2017     128806               15241
## 10               Peru    AMR 2018      17387                3075
##    prev.unk.pul.TB new.MDR prev.MDR MDR.tested  XDR pop.number TB.100k  TB.num
## 1                0    8206    14611      20477 3562  145530082      59   85000
## 2                0    2594     2414       5008 1001   44487709      84   37000
## 3                0    2755     2299       5054  972   44246156      80   36000
## 4                0    3232     5182       6832  493 1352642280     199 2690000
## 5                0    2152     5357       6787  466 1338676785     204 2740000
## 6                0     629      459       1088  343    9450231      37    3500
## 7                0     413      133        508  279    8880268      85    7500
## 8                0     559      425        984  185    9452617      31    2900
## 9              116     535     2102       2600  123  207906209     267  554000
## 10               0    1198      481        758   91   31989260     123   39000
##    TB_mort.100k TB_mort.num Country.Code        GDP ntileGDP ntilepop
## 1           8.1       12000          RUS 10750.5871        4        5
## 2          14.0        6400          UKR  2640.6757        2        5
## 3          13.0        5700          UKR  3095.1736        2        5
## 4          33.0      449000          IND  2015.5905        2        5
## 5          34.0      454000          IND  1981.4990        2        5
## 6           6.0         560          BLR  5761.7471        3        3
## 7           9.2         820          TJK   806.0416        1        3
## 8           5.9         560          BLR  6289.9386        3        3
## 9          21.0       45000          PAK  1466.8431        1        5
## 10          8.3        2700          PER  6947.2566        3        5

# new variable created using mutate; proportion of XDR/MDR:
# richest countries as well as the poorest countries have the
# lowest mean percentage of MDR TB cases developing into XDR
# cases
ntile %>% mutate(perc.XDR.MDR = XDR/MDR.tested) %>% group_by(ntileGDP) %>% 
    summarize(mean(perc.XDR.MDR, na.rm = T))

## # A tibble: 5 x 2
##   ntileGDP `mean(perc.XDR.MDR, na.rm = T)`
##      <int>                           <dbl>
## 1        1                          0.0709
## 2        2                          0.169 
## 3        3                          0.150 
## 4        4                          0.129 
## 5        5                          0.0803

# Group by quantile of GDP per capita ; mean TB per 100,000
# people and mortality due to TB per 100,00 people appears to
# decrease as the percentile of GDP per capita increases
ntile %>% group_by(ntileGDP) %>% summarize(mean(TB.100k), mean(TB_mort.100k))

## # A tibble: 5 x 3
##   ntileGDP `mean(TB.100k)` `mean(TB_mort.100k)`
##      <int>           <dbl>                <dbl>
## 1        1           220                 49.0  
## 2        2           163                 19.4  
## 3        3            64.0                9.93 
## 4        4            29.0                2.61 
## 5        5            14.9                0.811

# Group by region in world: AFR=Africa; AMR=Americas;
# EMR=Eastern Mediterranean; EUR=Europe; SEAR=South-East
# Asia; WPR=Western Pacific
join2 %>% group_by(region) %>% select_if(is.numeric) %>% select(-year) %>% 
    summarize(mean(TB.100k), mean(TB_mort.100k))

## # A tibble: 6 x 3
##   region `mean(TB.100k)` `mean(TB_mort.100k)`
##   <fct>            <dbl>                <dbl>
## 1 AFR              210                  53.0 
## 2 AMR               30.7                 3.79
## 3 EMR               76.6                10.2 
## 4 EUR               22.8                 2.09
## 5 SEA              198.                 26.4 
## 6 WPR              153.                 11.5

# Maximum TB cases per 100,000 people
ntile %>% filter(TB.100k == max(TB.100k))

##   country region year new.pul.TB prev.treated.pul.TB prev.unk.pul.TB new.MDR
## 1 Lesotho    AFR 2018       3595                 728               0      51
##   prev.MDR MDR.tested XDR pop.number TB.100k TB.num TB_mort.100k TB_mort.num
## 1        5          5   0    2108328     611  13000          200        4200
##   Country.Code      GDP ntileGDP ntilepop
## 1          LSO 1324.283        1        2

# Minimum TB cases per 100,000 people
ntile %>% filter(TB.100k == min(TB.100k))

##      country region year new.pul.TB prev.treated.pul.TB prev.unk.pul.TB new.MDR
## 1   Barbados    AMR 2017          0                   0               0       0
## 2 San Marino    EUR 2017          0                   0               0       0
##   prev.MDR MDR.tested XDR pop.number TB.100k TB.num TB_mort.100k TB_mort.num
## 1        0          0   0     286232       0      0          0.9           3
## 2        0          0   0      33671       0      0          0.0           0
##   Country.Code      GDP ntileGDP ntilepop
## 1          BRB 16327.61        4        1
## 2          SMR 48494.55        5        1

# Group by two variables: percentile of population and year
ntile %>% group_by(ntilepop, year) %>% summarize(mean(TB.100k), 
    mean(TB_mort.100k))

## # A tibble: 10 x 4
## # Groups:   ntilepop [5]
##    ntilepop  year `mean(TB.100k)` `mean(TB_mort.100k)`
##       <int> <dbl>           <dbl>                <dbl>
##  1        1  2017            89.1                 9.21
##  2        1  2018            91.8                10.4 
##  3        2  2017           117.                 24.1 
##  4        2  2018           154.                 31.6 
##  5        3  2017            23.3                 2.31
##  6        3  2018            20.9                 1.93
##  7        4  2017           106.                 22.6 
##  8        4  2018           147.                 33.8 
##  9        5  2017           124.                 15.1 
## 10        5  2018           124.                 16.4

Since most of the variables are numeric, the quantiles of GDP and the population were found and were used later to group the data. The mean and standard deviation were found for the numeric variables using the summarize_all function. Using arrange, the country that has the most extensively drug resistant TB cases for either 2017 or 2018 is the Russia followed by Ukraine and India. Using mutate to create a new variable, the proportion of of XDR/MDR appears to decrease towards the extremes of GDP per capita. Grouping by quantile of GDP per capita, mean TB cases and mortality due to TB per 100,00 people appears to decrease as the quantile of GDP per capita increases. Grouping by region, AFR has the highest mean TB cases and mortality per 100,000 people of the regions while EUR has the lowest averages.

The country with the maximum TB per 100,000 people is Lesotho in AFR and in the 1st quantile of GDP (low). The countries with the minimum TB per 100,000 people are Barbados and San Marino which both have small populations (1st quantile) and relatively high GDP. Grouping by two variables (population and year), countries in the 2nd and 4th quantile of population saw the greatest change in mean TB cases per 100,000 with both reporting an increase from 2017 to 2018. Countries in the 2nd and 4th quantile of the population also had the greatest mean mortality due to TB cases per 100,000 (as well as the greatest change between 2017 and 2018). Interestingly, the countries in the 3rd quantile had the lowest mean TB cases and mortality per 100,000 (for both 2017 and 2018).

# Correlation matrix
join2 %>% select_if(is.numeric) %>% cor() %>% round(2)

##                      year new.pul.TB prev.treated.pul.TB prev.unk.pul.TB
## year                 1.00       0.00                0.01            0.06
## new.pul.TB           0.00       1.00                0.97            0.17
## prev.treated.pul.TB  0.01       0.97                1.00            0.06
## prev.unk.pul.TB      0.06       0.17                0.06            1.00
## new.MDR             -0.03       0.39                0.47            0.02
## prev.MDR            -0.06       0.47                0.53            0.08
## MDR.tested          -0.05       0.43                0.49            0.07
## XDR                 -0.06       0.18                0.26            0.01
## pop.number           0.00       0.99                0.98            0.14
## TB.100k              0.04       0.19                0.13            0.11
## TB.num               0.00       0.99                0.96            0.21
## TB_mort.100k         0.05       0.11                0.08            0.04
## TB_mort.num          0.01       0.99                0.97            0.15
## GDP                  0.01      -0.13               -0.10           -0.02
##                     new.MDR prev.MDR MDR.tested   XDR pop.number TB.100k TB.num
## year                  -0.03    -0.06      -0.05 -0.06       0.00    0.04   0.00
## new.pul.TB             0.39     0.47       0.43  0.18       0.99    0.19   0.99
## prev.treated.pul.TB    0.47     0.53       0.49  0.26       0.98    0.13   0.96
## prev.unk.pul.TB        0.02     0.08       0.07  0.01       0.14    0.11   0.21
## new.MDR                1.00     0.96       0.98  0.96       0.43    0.04   0.38
## prev.MDR               0.96     1.00       0.99  0.93       0.51    0.06   0.47
## MDR.tested             0.98     0.99       1.00  0.96       0.47    0.05   0.42
## XDR                    0.96     0.93       0.96  1.00       0.23    0.00   0.17
## pop.number             0.43     0.51       0.47  0.23       1.00    0.13   0.98
## TB.100k                0.04     0.06       0.05  0.00       0.13    1.00   0.23
## TB.num                 0.38     0.47       0.42  0.17       0.98    0.23   1.00
## TB_mort.100k           0.01     0.03       0.02 -0.01       0.07    0.86   0.13
## TB_mort.num            0.38     0.46       0.42  0.17       0.98    0.21   0.98
## GDP                   -0.09    -0.08      -0.09 -0.06      -0.10   -0.40  -0.13
##                     TB_mort.100k TB_mort.num   GDP
## year                        0.05        0.01  0.01
## new.pul.TB                  0.11        0.99 -0.13
## prev.treated.pul.TB         0.08        0.97 -0.10
## prev.unk.pul.TB             0.04        0.15 -0.02
## new.MDR                     0.01        0.38 -0.09
## prev.MDR                    0.03        0.46 -0.08
## MDR.tested                  0.02        0.42 -0.09
## XDR                        -0.01        0.17 -0.06
## pop.number                  0.07        0.98 -0.10
## TB.100k                     0.86        0.21 -0.40
## TB.num                      0.13        0.98 -0.13
## TB_mort.100k                1.00        0.16 -0.32
## TB_mort.num                 0.16        1.00 -0.13
## GDP                        -0.32       -0.13  1.00

The correlation matrix shows that the strongest positive correlations are between new pulmonary TB cases and population number; new pulmonary TB cases and total number of TB cases; new pulmonary TB cases and total mortality due to TB; previous MDR TB cases and number of MDR cases that were tested for additional resistance. The strongest negative correlations are GDP and the TB cases per 100,000 people; GDP and TB mortality per 100,000 people.

Visualizing

# GGPlot 1: Scatterplot
ntile %>% ggplot(aes(x = TB.100k, y = TB_mort.100k)) + geom_point(aes(color = ntileGDP)) + 
    ggtitle("TB Cases and Mortality") + xlab("TB Cases per 100,000 people") + 
    ylab("TB mortality per 100,000 people") + scale_fill_brewer() + 
    scale_y_continuous(breaks = c(25, 50, 75, 100, 125, 150, 
        175, 200)) + scale_x_continuous(breaks = c(100, 200, 
    300, 400, 500, 600)) + labs(color = "GDP") + theme_classic()

This graph shows that as TB cases increase, TB mortality also increases per 100,000 people. Contrastingly, TB cases and mortality have a negative relationship with GDP per capita. This suggests that countries with lower GDP have higher occurrences of TB and mortality due to TB which is expected since there is less funding towards preventative measures as well as resources and access to treatment.

# GGPlot 2: Boxplot
ntile %>% ggplot(aes(group = region, x = region, y = TB.100k)) + 
    geom_boxplot() + geom_jitter(alpha = 0.3, aes(color = ntilepop, 
    size = ntilepop)) + ggtitle("TB cases for each Region") + 
    xlab("Region") + ylab("TB Cases per 100,000 people") + labs(color = "Population", 
    size = "Population") + theme_light()

The boxplot shows that the median of TB cases for 100,000 people is highest in SEA. Previously, summary statistics showed that AFR had the highest mean. From the boxplot, it is clear that the distribution is skewed resulting in a higher median than mean. The spread is largest for AFR as well. The countries in AMR, EMR, and EUR all have relatively low median values. When grouped by region, there is not an obvious relationship between the population and TB cases per 100,000 people.

# GGPlot 3: Stacked Bar plot
ntile$year <- factor(ntile$year, levels = c("2017", "2018"))
ggplot(ntile, aes(x = ntileGDP, y = TB_mort.100k, fill = year)) + 
    geom_bar(stat = "summary", fun.y = "mean", position = "dodge") + 
    geom_errorbar(stat = "summary", position = "dodge") + xlab("Quantile of GDP per capita") + 
    ylab("TB Mortality per 100,000 people") + ggtitle("GDP per Capita and TB mortality per 100,000 people") + 
    theme_grey() + scale_y_continuous(breaks = c(10, 20, 30, 
    40, 50, 60))

A stacked bar plot was used to demonstrate how increasing GDP per capita results in a lower TB mortality on average. The biggest change in TB mortality between 2017 and 2018 is represented in the countries in the 1st quantile of GDP per capita (however, the error bars are overlapping). As the quantile of GDP increases, the spread of the data (error bars) also decrease.

Dimensionality Reduction

# Prepare data by scaling
joinPCA <- join2 %>% select_if(is.numeric) %>% scale %>% na.omit
join_pca <- princomp(joinPCA)
summary(join_pca, loadings = T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.5078967 1.6468586 1.4044384 1.01775753 0.96840866
## Proportion of Variance 0.4512861 0.1946011 0.1415266 0.07432267 0.06728992
## Cumulative Proportion  0.4512861 0.6458872 0.7874138 0.86173646 0.92902637
##                            Comp.6      Comp.7      Comp.8      Comp.9
## Standard deviation     0.86661846 0.371251458 0.202421301 0.145441708
## Proportion of Variance 0.05388756 0.009889379 0.002939985 0.001517786
## Cumulative Proportion  0.98291394 0.992803315 0.995743300 0.997261086
##                            Comp.10      Comp.11      Comp.12     Comp.13
## Standard deviation     0.127383707 0.1098210941 0.0756806126 0.052799805
## Proportion of Variance 0.001164288 0.0008653747 0.0004109623 0.000200031
## Cumulative Proportion  0.998425374 0.9992907485 0.9997017107 0.999901742
##                             Comp.14
## Standard deviation     3.700566e-02
## Proportion of Variance 9.825826e-05
## Cumulative Proportion  1.000000e+00
## 
## Loadings:
##                     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## year                                      0.729  0.674                     
## new.pul.TB          -0.356  0.246 -0.118                              0.147
## prev.treated.pul.TB -0.362  0.188 -0.138                             -0.465
## prev.unk.pul.TB                           0.656 -0.728  0.101              
## new.MDR             -0.297 -0.385                                    -0.616
## prev.MDR            -0.320 -0.344                                     0.524
## MDR.tested          -0.310 -0.372                                     0.177
## XDR                 -0.235 -0.474  0.124                                   
## pop.number          -0.361  0.210 -0.147                                   
## TB.100k                     0.213  0.602               -0.241  0.708       
## TB.num              -0.354  0.253 -0.100                       0.131  0.239
## TB_mort.100k                0.192  0.602               -0.358 -0.670       
## TB_mort.num         -0.354  0.255 -0.101                      -0.115       
## GDP                               -0.404  0.157        -0.890              
##                     Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## year                                                              
## new.pul.TB          -0.304  0.175                  -0.790  -0.156 
## prev.treated.pul.TB  0.688         -0.130   0.264  -0.111         
## prev.unk.pul.TB                                                   
## new.MDR             -0.402 -0.197  -0.295  -0.212           0.203 
## prev.MDR             0.382 -0.267  -0.185  -0.219  -0.151   0.416 
## MDR.tested                         -0.102           0.177  -0.821 
## XDR                         0.537   0.512   0.332           0.201 
## pop.number          -0.139 -0.605   0.616           0.164         
## TB.100k                                                           
## TB.num              -0.307  0.115  -0.437   0.509   0.358   0.199 
## TB_mort.100k                                                      
## TB_mort.num                 0.424          -0.667   0.390         
## GDP

# Choose number of PC to keep convert standard deviations to
# eigenvalues
eigval <- join_pca$sdev^2
# proportion of variance explained by each PC
varprop = round(eigval/sum(eigval), 2)
ggplot() + geom_bar(aes(y = varprop, x = 1:14), stat = "identity") + 
    xlab("") + geom_path(aes(y = varprop, x = 1:14)) + geom_text(aes(x = 1:14, 
    y = varprop, label = round(varprop, 2)), vjust = 1, col = "white", 
    size = 5) + scale_y_continuous(breaks = seq(0, 0.6, 0.2), 
    labels = scales::percent) + scale_x_continuous(breaks = 1:14)

# Plot for PCA (PC1 and PC2)
join2 %>% na.omit %>% mutate(PC1 = join_pca$scores[, 1], PC2 = join_pca$scores[, 
    2]) %>% ggplot(aes(x = PC1, y = PC2, color = region, size = GDP)) + 
    geom_point() + ggtitle("PCA Plot") + theme(legend.position = "none")

# Plot for PCA (PC3 and PC4)
join2 %>% na.omit %>% mutate(PC3 = join_pca$scores[, 3], PC4 = join_pca$scores[, 
    4]) %>% ggplot(aes(x = PC3, y = PC4, color = region, size = GDP)) + 
    geom_point() + ggtitle("PCA Plot") + theme(legend.position = "none")

# Plot of loadings
join_pca$loadings[1:14, 1:2] %>% as.data.frame %>% rownames_to_column %>% 
    ggplot() + geom_hline(aes(yintercept = 0), lty = 2) + geom_vline(aes(xintercept = 0), 
    lty = 2) + ylab("PC2") + xlab("PC1") + geom_segment(aes(x = 0, 
    y = 0, xend = Comp.1, yend = Comp.2), arrow = arrow(), col = "red") + 
    geom_label(aes(x = Comp.1 * 1.1, y = Comp.2 * 1.1, label = rowname)) + 
    ggtitle("Plot of Loadings")

# biplot combining loadings plot and PC score plot
library("factoextra")
fviz_pca_biplot(join_pca)

Based on the scree plot and cumulative proportion of variance (and Kaiser’s rule), 3 to 4 PCs should be chosen. High scores on PC1 indicate low TB cases (new, previous, MDR TB, mortality, etc.) and low population. For PC2, high scores indicate high TB cases and mortality per 100,000 people but lower resistant (MDR and XDR) TB cases as well as lower GDP per capita. For PC3, high scores indicate lower cases of resistant TB cases and lower TB cases and mortality per 100,000 people but higher GDP. On the PCA plot, there appears to be some seperation based on GDP looking at PC3 (larger dots on the right). Finally, high scores PC4 indicate high numbers of confirmed TB cases with unknown TB treatment history.

The plot of loadings helps visualize which variances contribute to which of the PCs with a smaller angle between vectors showing higher correlation. Therefore, GDP is negatively correlated to the other variables and mainly differs based on PC1. Contrastingly, other variables such as new cases of pulmonary TB, previous treated pulmonary TB cases, total TB cases, etc. are almost redundant.

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] factoextra_1.0.6 forcats_0.5.0    stringr_1.4.0    dplyr_0.8.5     
##  [5] purrr_0.3.3      readr_1.3.1      tidyr_1.0.2      tibble_2.1.3    
##  [9] ggplot2_3.3.0    tidyverse_1.3.0  knitr_1.28      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.0.0 xfun_0.12        haven_2.2.0      lattice_0.20-40 
##  [5] colorspace_1.4-1 vctrs_0.2.4      generics_0.0.2   htmltools_0.4.0 
##  [9] yaml_2.2.1       utf8_1.1.4       rlang_0.4.5      ggpubr_0.2.5    
## [13] pillar_1.4.3     withr_2.1.2      glue_1.3.2       DBI_1.1.0       
## [17] dbplyr_1.4.2     modelr_0.1.6     readxl_1.3.1     lifecycle_0.2.0 
## [21] ggsignif_0.6.0   munsell_0.5.0    blogdown_0.18    gtable_0.3.0    
## [25] cellranger_1.1.0 rvest_0.3.5      codetools_0.2-16 evaluate_0.14   
## [29] labeling_0.3     fansi_0.4.1      broom_0.5.5      Rcpp_1.0.4      
## [33] formatR_1.7      backports_1.1.5  scales_1.1.0     jsonlite_1.6.1  
## [37] farver_2.0.3     fs_1.3.2         hms_0.5.3        digest_0.6.25   
## [41] stringi_1.4.6    ggrepel_0.8.2    bookdown_0.18    grid_3.6.1      
## [45] cli_2.0.2        tools_3.6.1      magrittr_1.5     crayon_1.3.4    
## [49] pkgconfig_2.0.3  ellipsis_0.3.0   xml2_1.2.5       reprex_0.3.0    
## [53] lubridate_1.7.4  assertthat_0.2.1 rmarkdown_2.1    httr_1.4.1      
## [57] rstudioapi_0.11  R6_2.4.1         nlme_3.1-145     compiler_3.6.1

## [1] "2020-07-24 14:53:51 CDT"

##                                                                                            sysname 
##                                                                                           "Darwin" 
##                                                                                            release 
##                                                                                           "18.7.0" 
##                                                                                            version 
## "Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64" 
##                                                                                           nodename 
##                                                                             "Cara-Yijin-Zou.local" 
##                                                                                            machine 
##                                                                                           "x86_64" 
##                                                                                              login 
##                                                                                         "yijinzou" 
##                                                                                               user 
##                                                                                         "yijinzou" 
##                                                                                     effective_user 
##                                                                                         "yijinzou"

Tidyverse ggplot2 Statistics R