congogbv_analysis.Rmd

---
title: Data analysis for Sexual violence, conflict, and female empowerment
author: Koen Leuveld
output: pdf_document
latex_engine: xelatex
---

# Intro

This rmarkdown file analyses data, and prepares input tables for the paper. These
input tables do not contain any formatting, just the coefficients. The paper itself
will do the formatting. This means that the input tables can go on GitHub, even if 
the raw data itself can't, and GitHub can thus build the paper. This guarantees that
anyone can see my coefficents and build my paper.

# Load packages and data

For now, I will just use the Stata data as prepared by `congogbv_dataprep.do`:

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

library(tidyverse)
library(haven)
library(here)
library(estimatr)
library(list)

set_flextable_defaults(fonts_ignore=TRUE)

cleandata <- read_dta(here("data/clean/analysis.dta"))
col_labels <- map_chr(cleandata,function(x) coalesce(attributes(x)$label,""))


sample <- 
  cleandata %>%
  filter(!is.na(ball5)) %>%
  mutate(Treatment = ifelse(ball5," Treatment","  Control"))

tibble(Variable = colnames(cleandata),
       label = col_labels) %>%
write_csv(here("tables/cleandata_collabels.csv"))

# to extract:
# col_labels <-
#   read_csv(here("tables/cleandata_collabels.csv")) %>%
#   pull(label) %>%
#   set_names(read_csv(here("tables/cleandata_collabels.csv")) %>% pull(Variable))


dhs <- read_dta(here("data/clean/dhs.dta"))

```


# Tables

## Table 1

Then I create Table 1. For now it's pivoted compared to the orignal, but that's fine.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

wgt_mean <- function(data,weights,digits=2){
   num(sum(data * weights)/sum(weights),digits=digits)
}

dhs_national <-
  dhs %>% 
  summarize(across(.cols = c(agewife,tinroof:eduwife_sec),
                         .fns  = ~wgt_mean(.x,wgt))) %>%
  mutate(label = "DHS National Mean")

dhs_skivu <-
  dhs %>% 
  filter(province == 11) %>%
  summarize(across(.cols = c(agewife,tinroof:eduwife_sec),
                         .fns  = ~wgt_mean(.x,wgt)))%>%
  mutate(label = "DHS South Kivu Mean")

sample_full <-
cleandata %>% 
summarize(across(.cols = c(agewife,tinroof,eduwife_prim,eduwife_sec),
                .fns = ~mean(.x, na.rm = TRUE) %>% num(digits = 2)))%>%
  mutate(label = "Full Sample Mean")

sample_gendermodule <-
cleandata %>% 
  filter(!is.na(ball5)) %>%
  summarize(across(.cols = c(agewife,tinroof,eduwife_prim,eduwife_sec),
                  .fns = ~mean(.x, na.rm = TRUE) %>% num(digits = 2)))%>%
  mutate(label = "Gender Module Mean")

rbind(dhs_national,dhs_skivu,sample_full,sample_gendermodule) %>%
select(label,everything()) %>%
write_csv(here("tables/table1.csv"))


```


## Table 2

This is a simple table that can be done with `tabyl()` from the `janitor` package.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

library(janitor)

cleandata %>% 
  tabyl(riskwifestatus,riskhusbandstatus) %>%
  adorn_totals(c("row", "col")) %>%
  write_csv(here("tables/table2.csv"))

```

## Table 3

This table simply contains payouts. These are defined in `table3.csv`

## Table 4

First I compute means and sds by by group, using a custom function. I create a virtual
"overall" group by simply duplicating the dataset.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

#define function to get summ stats
get_summ_stats <- function(.df, by = NULL, overall = NULL) {
  
  #this expands the data frame to include an "overall" treatment condition
  if (!missing(by)) {

    if (!is.null(overall)) {
      .df <- 
        bind_rows(.df, .df %>% mutate({{ by }} := overall ))
    }

    .df <- 
      .df %>%
      group_by( {{ by }} )
  }

  .df %>%
    summarize(across(everything(),
                       list(
                            #label = ~coalesce(attributes(.x)$label,""),
                            n =  ~sum(!is.na(.x)),
                            nmiss =  ~sum(is.na(.x)),
                            mean = ~mean(.x,na.rm=TRUE),
                            sd = ~sd(.x,na.rm=TRUE),
                            min =  ~min(.x,na.rm=TRUE),
                            max =  ~max(.x,na.rm=TRUE),
                            iqr =  ~IQR(.x,na.rm=TRUE)),
                        .names =  "{.col}-{.fn}"))   %>%
  pivot_longer(cols = - {{ by }} ,
               names_to = c("Variable",".value"),
               names_sep="-") 

}

sample %>%  
  select(Treatment, numballs, 
         victimproplost, victimfamlost, acledviolence10, 
         husbmoreland, wifemoreland, riskwife, riskhusband, barghusbandcloser, bargwifecloser, 
         agewife, agehusband, genderhead, eduwife_prim, eduwife_sec, eduhusband_prim, eduhusband_sec, 
         tinroof, livestockany, terrfe_2, terrfe_3) %>%
  get_summ_stats(by= Treatment, overall = "Overall") %>%
  write_csv(here("tables/summstats.csv"))


```

Then, I compute the difference between the groups:

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}


#this function returns a string to put in a column of differences
get_diffs <- function(.df,y,x){

  reg <-  lm(y~ x) %>% broom::tidy()

  coeff = round(reg[2,2],2)
  p <- reg[2,5]

  stars = case_when(p < 0.001 ~ "***",
                    p < 0.01 ~ "**",
                    p < 0.05 ~ "*",
                    .default = "" )

  paste0(coeff,stars)
}


sample %>%  
  select(numballs, 
         victimproplost, victimfamlost, acledviolence10, 
         husbmoreland, wifemoreland, riskwife, riskhusband, barghusbandcloser, bargwifecloser, 
         agewife, agehusband, genderhead, eduwife_prim, eduwife_sec, eduhusband_prim, eduhusband_sec, 
         tinroof, livestockany, terrfe_2, terrfe_3) %>%
  summarize(across(everything(),
                   .fns = function(x) get_diffs(.,.$x,as.factor(sample$Treatment)))) %>%
  pivot_longer(cols =everything(),
               names_to = "Variable",
               values_to="Diff")  %>%
  write_csv(here("tables/summstats_diffs.csv"))

```

## Table 5: Conflict Exposure by Territory

This is a relatively simple table:

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}


cleandata %>%
  select(victimproplost,victimfamlost,acledviolence10, territory) %>%
  mutate(Territory = case_when(territory == 1 | territory == 2 ~ " Kabare/Bagira",
                               territory == 3 ~ " Uvira",
                               territory == 4 ~ " Fizi")) %>%
  select(-territory) %>%
  get_summ_stats(by= Territory, overall = "Overall") %>%
  write_csv(here("tables/violence_by_territory.csv"))

```


## Table 6 and Table 7

The function (`mean_diff_table()`) to create this table is a bit ugly, but i consists of three sub-functions:

- `get_means()` computes the means of treatment and control for each value of a binary...
- `get_meandiffs()` comutes the differences between the treatment and control for each group using a regression; and,
- `get_diffindiff()` computes the DiD using a regression.

Each function outputs a data frame in long format: these are bound together to one large one, that can be used by `flextable::tabulator()`.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

mean_diff_table <- function(.df,var){

  # this functions gets 4 means: treatment assignment x values of var
  # it returns a long data frame with one row per mean
  get_means <- function(.df, var){
    .df %>%
      mutate( {{ var }} := as.character(as_factor( {{ var }} ))) %>%
      select(numballs,Treatment,{{ var }})   %>%
      group_by(Treatment, {{ var }} ) %>%
      summarize(across(numballs, ~mean(.x, na.rm = TRUE))) %>%
      pivot_longer({{ var }},
                   names_to = "var",
                   values_to = "group") %>%
      filter(!is.na(group)) %>%
      mutate(stat = "Mean") %>%
      select(everything(), estimate = numballs,stat_subgroup = Treatment) %>%
      ungroup()
  }

  # this function computes the differences between treatment and control
  # for the two groups created by var
  # it returns a long data frame, with 6 rows: estimate, std.error and p.value for each group
  get_meandiffs <- function(.df,var){
    #this function computes the actual means and formats the outcome
    meandiff_regress <- function(.df){
      .df %>%
        lm_robust(numballs ~ Treatment, data = .) %>%
        broom::tidy() %>%
        filter(term == "Treatment Treatment") %>%
        select(estimate,std.error,p.value)
    }

   .df <- 
      .df %>% 
      mutate( {{ var }} := as.character(as_factor( {{ var }} ))) 

    # I basically split the data frame in two and run the above 
    # function on both data sets with map.
    splitvector <- .df %>% pull( {{ var }} )
    
    .df %>%
      split(splitvector) %>%
      map(meandiff_regress) %>%
      imap(~mutate(.x, group = .y)) %>%
      reduce(bind_rows) %>% 
      #pivot_longer(-c(group,p.value),
                   # names_to = "stat_subgroup",
                   # values_to = "estimate") %>%
      mutate(stat = "Diff",
             #p.value = ifelse(stat_subgroup == "std.error",NA,p.value),
             var = rlang::quo_text(enquo(var)))
  }

  # finally: this computes the difference between the two differences.
  # it ouputs a long data frame with estimate, std.error and p.value
  get_diffindiff <- function(.df,var){
    data <- 
      .df %>% 
      select(numballs,Treatment, {{ var }})

    y <- data[[1]]
    treat <- data[[2]]
    x <- data[[3]]

    lm_robust(y ~ treat *  x) %>%
        broom::tidy() %>%
        filter(term == "treat Treatment:x") %>%
        select(estimate,std.error,p.value) %>%
        # pivot_longer(-p.value,
        #              names_to = "stat_subgroup",
        #              values_to = "estimate")  %>%
        mutate(group = "diff in diff",
               #p.value = ifelse(stat_subgroup == "std.error",NA,p.value),
               stat = "Diff",
               var = rlang::quo_text(enquo(var)))
  }

  df1 <- get_means(.df,{{ var }})
  df2 <- get_meandiffs(.df,{{ var }})
  df3 <- get_diffindiff(.df,{{ var }})

  bind_rows(df1,df2,df3) %>% 
  ungroup()
  #bind_rows(df2,df3) %>% ungroup()

}

quos(victimproplost, victimfamlost, acledviolence10d,husbmoreland, wifemoreland, barghusbandcloser, bargwifecloser) %>%
  map(~mean_diff_table(sample,!!.x)) %>%
  reduce(bind_rows) %>%
  write_csv(here("tables/meandifftab.csv"))

```


## Table 8


I first define a number of helper varaibles:

- `tidy_ict()` outputs a nice and tidy data frame with coefficients, standard errors etc. from the `ict_reg()` function.
- `format_stars()` takes a coefficient and p-value and formats it to a coefficient with stars.

I also save my `sample` as a data frame (was a tibble), because `ict_reg()` didn't work on a tibble.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

#outputs a tidy df for ict_reg
tidy_ict <- function(reg1){
    bind_cols(reg1$coef.names, reg1$par.treat, reg1$se.treat, .name_repair = ~ c("term","estimate","std.error")) %>%
    mutate(statistic = estimate / std.error,
           df =   reg1$resid.df,
           p.value = pt(statistic, df, lower.tail = FALSE) * 2)
  }

#formats stars.
format_stars <- function(b,p,digits=3){
  stars = case_when(p < 0.01 ~ "***",
                    p < 0.05 ~ "**",
                    p < 0.1 ~ "*",
                    .default = "" )

  coeff = round(b,digits)

  paste0(coeff,stars)
}


sample_df <- 
  sample %>%
  as.data.frame() 


```

Then I need to do some convenience things:

- I put all my controls in a vector: I can insert this vector into a formula using `reformulate()`
- I have five model specification, which only differ in a number of exaplanatory variables. I make a list with the explanatory variables in each specifcation, so that I can loop over this with map.


```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

controls <- c("agewife", "agehusband", "genderhead", "eduwife_sec", "eduhusband_prim", "tinroof", "livestockany", "terrfe_2", "terrfe_3", "treatment")

#list of model specification, in particular depvars to include with controls
specs <- list(reg1 = "husbmoreland", 
              reg2 = "victimfamlost",
              reg3 = "acledviolence10",
              reg4 = c("husbmoreland", "victimfamlost", "acledviolence10", "attwifetotal"))

```

Then comes the good stuff! I loop over the list using `imap()`, where each step returns a nice data frame created by `tidy_ict()`. `imap()`
can return the name of the list element in the current iteration, which I save in the `model` variable.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}
library(list)

#run all regressions in the specifications
imap(specs, ~ictreg(reformulate(c(.x, controls), response="numballs"), treat = "ball5", J = 4, method = "lm", data = sample_df) %>%
                  tidy_ict() %>% 
                  mutate(model = .y)) %>%
  reduce(bind_rows) %>%
  select(model,term,estimate,std.error,p.value) %>%
  write_csv(here("tables/regression_results.csv"))

# I can't find the N in ict_reg output; this finds the number of complete cases for each specification
map(specs,~sum(complete.cases(sample[,c(.x,controls,"numballs")])))

```


# Table A1

Here I check for systematic sample selection. I do some data wrangling to find have indicators for being selected
into the games (wife, husband and couple).

I then create function to do a logit regression which I map to the three sample selection indicators I created.
The function outputs a list with the outputs of `broom::tidy()` and `broom::glance()`. I export these to CSV.

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

attr_data <- 
  cleandata %>%
  mutate(wifeconsent = as.double(riskwifestatus == 1),
         husbandconsent = as.double(riskhusbandstatus == 1),
         coupleconsent = wifeconsent * husbandconsent) 


allcontrols <- c("agewife", "agehusband", "genderhead", "eduwife_prim", "eduwife_sec", "eduhusband_prim", "eduhusband_sec", "tinroof", "livestockany", 
                 "terrfe_2", "terrfe_3", "treatment")


logit <- function(.df,y,x,label){
  reg <- glm(reformulate(x, response=y),data = .df, family = "binomial")

  coeffs <-
    reg %>%
    broom::tidy() %>%
    mutate(model = label)

   scalars <-
    reg %>%
    broom::glance() %>%
    mutate(model = label) 

  list(coeffs = coeffs, scalars = scalars)
}


regs <-
  list(reg1 = "wifeconsent",
      reg2 = "husbandconsent",
      reg3 = "coupleconsent") %>%
  imap(~logit(attr_data,.x,allcontrols,.y)) %>%
  list_transpose(simplify = FALSE) %>%
  map(list_rbind)


regs$coeffs %>%
write_csv(here("tables/sampleselection_coeffs.csv"))


regs$scalars %>%
write_csv(here("tables/sampleselection_scalars.csv"))

```

# Table A2

Basically a copy-paste from above:

```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

run_regs <- function(.df,y,x,label){
  reg <- lm_robust(reformulate(x, response=y),data = .df)

  coeffs <-
    reg %>%
    broom::tidy() %>%
    mutate(model = label)

   scalars <-
    reg %>%
    broom::glance() %>%
    mutate(model = label) 

  list(coeffs = coeffs, scalars = scalars)
}


regs <-
  list(reg1 = list(y = "husbmoreland", x = c("victimfamlost","acledviolence10",allcontrols)),
       reg2 = list(y = "victimfamlost", x = c("husbmoreland","acledviolence10", allcontrols)),
       reg3 = list(y = "acledviolence10", x = c("husbmoreland","victimfamlost", allcontrols))) %>%
  imap(~run_regs(sample,.x$y,.x$x,.y)) %>%
  list_transpose(simplify = FALSE) %>%
  map(list_rbind)


regs$coeffs %>%
  write_csv(here("tables/determinants_coeffs.csv"))


regs$scalars %>%
write_csv(here("tables/determinants_scalars.csv"))

```

# Figures

## Figure 1


```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}


sample %>%
  select(numballs, Treatment) %>%
  group_by(Treatment) %>%
  summarize(n = n(),
            sd = sd(numballs),
            numballs = mean(numballs)) %>%
  mutate(margin = qt(0.975,df=n-1)*sd/sqrt(n),
         ci.lower = numballs - margin,
         ci.higher = numballs + margin) %>%
  write_csv(here("tables/dataforgraphs1.csv"))


read_csv(here("tables/dataforgraphs1.csv")) %>%
ggplot(aes(numballs, Treatment)) +
geom_pointrange(aes(xmin = ci.lower, xmax = ci.higher)) +
xlim(0,5) + 
xlab("Number of reported issues")


```

## Figure 2


```{r eval=TRUE, include = TRUE, echo=TRUE, warning=TRUE, error=TRUE, message=FALSE}

var = rlang::quo_text(enquo(var))


dataforgraphs <- function(df, var){
  df %>%
  mutate(group = {{ var }}) %>%
  select(numballs, Treatment,group) %>%
  filter(!is.na(group)) %>%
  group_by(Treatment,group) %>%
  mutate(group = as.character(as_factor(group))) %>%
  summarize(n = n(),
            sd = sd(numballs, na.rm = TRUE),
            numballs = mean(numballs, na.rm = TRUE)) %>%
  mutate(margin = qt(0.975,df=n-1)*sd/sqrt(n),
         ci.lower = numballs - margin,
         ci.higher = numballs + margin,
         var = rlang::quo_text(enquo(var))) 
}

quos(victimproplost,victimfamlost,acledviolence10d, statpar, bargresult) %>%
  map(~dataforgraphs(sample,!!.x)) %>%
  reduce(bind_rows) %>%
  write_csv(here("tables/dataforgraphs.csv"))


```