Skip to content

Commit

Permalink
Merge pull request #256 from xKDR/v0.1.1
Browse files Browse the repository at this point in the history
Version 0.1.1 into main
  • Loading branch information
smishr authored Apr 10, 2023
2 parents f9aa828 + 31a80ef commit 0703cfc
Show file tree
Hide file tree
Showing 19 changed files with 700 additions and 48 deletions.
68 changes: 56 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@

# Contributing to Survey.jl

* [Overview](#overview)
* [Reporting Issues](#reporting-issues)
* [Recommended workflow setup](#recommended-workflow-setup)
* [Modifying an existing docstring in `src/`](#modifying-an-existing-docstring-in--src--)
* [Adding a new docstring to `src/`](#adding-a-new-docstring-to--src--)
* [Doctests](#doctests)
* [Integration with exisiting API](#integration-with-exisiting-api)
* [Contributing](#contributing)
* [Style Guidelines](#style-guidelines)
* [Git Recommendations For Pull Requests](#git-recommendations-for-pull-requests)

## Overview
Thank you for thinking about making contributions to Survey.jl!
We aim to keep consistency in contribution guidelines to DataFrames.jl, which is the main upstream dependency for the project.
Expand All @@ -16,6 +27,46 @@ Reading through the ColPrac guide for collaborative practices is highly recommen
(`Pkg.add(name="Survey", rev="main")`) is a good gut check and can streamline the process,
along with including the first two lines of output from `versioninfo()`

## Setting up development workflow

Below tutorial uses Windows Subsystem for Linux (WSL) and VSCode. Linux/MacOS/BSD can ignore WSL specific steps.

1. Install Ubuntu on WSL from the [Ubuntu website](https://ubuntu.com/wsl) or the Microsoft Store
2. Create a fork of the [Survey.jl repository](https://github.com/xKDR/Survey.jl). You will only be ever working on this fork, and submitting Pull Requests to the main repo.
3. Copy the SSH link from your fork by clicking the green `<> Code` icon and then `SSH`.
- You must already have SSH setup for this to work. If you don't, look up a tutorial on how to clone a github repository using SSH.
4. Open a WSL terminal, and run :
- `curl -fsSL https://install.julialang.org | sh`
- `git clone [email protected]:your_username/Survey.jl.git` -- replace "*your_username**"
- `julia`
3. You are now in the Julia REPL, run :
- `import Pkg; Pkg.add("Revise")`
- `import Pkg; Pkg.add("Survey")`
- `import Pkg; Pkg.add("Test")`
- `] dev .`
4. Open VSCode and install the following extensions :
- WSL
- Julia
5. Go back to your WSL terminal, navigate to the folder of your repo, and run `code .` to open VSCode in that folder
6. Create a `dev` folder (only if you want, it is gitignored by default), and a `test.jl` file in the file. Paste this block of code and save :

```julia
using Revise, Survey, Test

@testset "ratio.jl" begin
apiclus1 = load_data("apiclus1")
dclus1 = SurveyDesign(apiclus1; clusters=:dnum, strata=:stype, weights=:pw)
@test ratio(:api00, :enroll, dclus1).ratio[1] 1.17182 atol = 1e-4
end
```

9. In the WSL terminal (not Julia REPL), run `julia dev/test.jl`
✅ If you get no errors, your setup is now complete !

You can keep working in the `dev` folder, which is .gitignored.
Once you have working code and tests, you can move them to the appropriate folders, commit, push, and submit a Pull Request.
Make sure to read the rest of this document so you can learn the best practices and guidelines for this project.

## Modifying an existing docstring in `src/`

All docstrings are written inline above the methods or types they are associated with and can
Expand Down Expand Up @@ -94,7 +145,7 @@ This way you are modifying as little as possible of previously written code, and
* If you want to propose a new functionality it is strongly recommended to open an issue first and reach a decision on the final design.
Then a pull request serves an implementation of the agreed way how things should work.
* If you are a new contributor and would like to get a guidance on what area
you could focus your first PR please do not hesitate to ask and JuliaData members
you could focus your first PR please do not hesitate to ask community members
will help you with picking a topic matching your experience.
* Feel free to open, or comment on, an issue and solicit feedback early on,
especially if you're unsure about aligning with design goals and direction,
Expand All @@ -104,22 +155,15 @@ This way you are modifying as little as possible of previously written code, and
* Aim for atomic commits, if possible, e.g. `change 'foo' behavior like so` &
`'bar' handles such and such corner case`,
rather than `update 'foo' and 'bar'` & `fix typo` & `fix 'bar' better`.
* Pull requests are tested against release and development branches of Julia,
so using `Pkg.test("DataFrames")` as you develop can be helpful.
* Pull requests are tested against release branches of Julia,
so using `Pkg.test("Survey")` as you develop can be helpful.
* The style guidelines outlined below are not the personal style of most contributors,
but for consistency throughout the project, we've adopted them.
* It is recommended to disable GitHub Actions on your fork; check Settings > Actions.
* If a PR adds a new exported name then make sure to add a docstring for it and
add a reference to it in the documentation.
* A PR with breaking changes should have `[BREAKING]` as a first part of its name.
* If a PR changes or adds functionality please update NEWS.md file accordingly as
a part of the PR (along with the link to the PR); please do not add entries
to NEWS.md for changes that are bug fixes or are not user visible, such as
adding tests, updating documentation or improving code layout.
* If you make a PR please try to avoid pushing many small commits to GitHub in
a sequence as each such commit triggers a separate CI job, which takes over
an hour. This has a consequence of making other PRs in packages from the JuliaData
ecosystem wait for such CI jobs to finish as hey share a common pool of CI resources.
* A PR which is still draft or work in progress should have `WIP:` as a first part of its name.
* If you make a PR please try to avoid pushing many small commits to GitHub in a sequence as each such commit triggers a separate CI job, which takes compuational time, and not a good use of the small pool of CI resources.

## Style Guidelines

Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Survey"
uuid = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
authors = ["Ayush Patnaik <[email protected]>"]
version = "0.1.0"
version = "0.2.0"

[deps]
AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67"
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ cluster: none
popsize: [6190.0, 6190.0, 6190.0 6190.0]
sampsize: [200, 200, 200 200]
weights: [31.0, 31.0, 31.0 31.0]
probs: [0.0323, 0.0323, 0.0323 0.0323]
allprobs: [0.0323, 0.0323, 0.0323 0.0323]
type: bootstrap
replicates: 1000

julia> mean(:api00, bootsrs)
Expand Down
4 changes: 3 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
2 changes: 2 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ SurveyDesign
ReplicateDesign
load_data
bootweights
jackknifeweights
jackknife_variance
mean
total
quantile
Expand Down
2 changes: 2 additions & 0 deletions src/Survey.jl
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ include("boxplot.jl")
include("show.jl")
include("ratio.jl")
include("by.jl")
include("jackknife.jl")

export load_data
export AbstractSurveyDesign, SurveyDesign, ReplicateDesign
Expand All @@ -35,5 +36,6 @@ export hist, sturges, freedman_diaconis
export boxplot
export bootweights
export ratio
export jackknifeweights, jackknife_variance

end
203 changes: 200 additions & 3 deletions src/SurveyDesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -126,14 +126,117 @@ end
"""
ReplicateDesign <: AbstractSurveyDesign
Survey design obtained by replicating an original design using [`bootweights`](@ref).
Survey design obtained by replicating an original design using [`bootweights`](@ref). If
replicate weights are available, then they can be used to directly create a `ReplicateDesign`.
```jldoctest
# Constructors
```julia
ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::Vector{Symbol};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::UnitIndex{Int};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::Regex;
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
```
# Arguments
The constructor has the same arguments as [`SurveyDesign`](@ref). The only additional argument is `replicate_weights`, which can
be of one of the following types.
- `Vector{Symbol}`: In this case, each `Symbol` in the vector should represent a column of `data` containing the replicate weights.
- `UnitIndex{Int}`: For instance, this could be UnitRange(5:10). This will mean that the replicate weights are contained in columns 5 through 10.
- `Regex`: In this case, all the columns of `data` which match this `Regex` will be treated as the columns containing the replicate weights.
All the columns containing the replicate weights will be renamed to the form `replicate_i`, where `i` ranges from 1 to the number of columns containing the replicate weights.
# Examples
Here is an example where the [`bootweights`](@ref) function is used to create a `ReplicateDesign`.
```jldoctest replicate-design; setup = :(using Survey, CSV, DataFrames)
julia> apistrat = load_data("apistrat");
julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);
julia> bootstrat = bootweights(dstrat; replicates=1000)
julia> bootstrat = bootweights(dstrat; replicates=1000) # creating a ReplicateDesign using bootweights
ReplicateDesign:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
cluster: none
popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0]
sampsize: [100, 100, 100 … 50]
weights: [44.21, 44.21, 44.21 … 15.1]
allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000
```
If the replicate weights are given to us already, then we can directly pass them to the `ReplicateDesign` constructor. For instance, in
the above example, suppose we had the `bootstrat` data as a CSV file (for this example, we also rename the columns containing the replicate weights to the form `r_i`).
```jldoctest replicate-design
julia> using CSV;
julia> DataFrames.rename!(bootstrat.data, ["replicate_"*string(index) => "r_"*string(index) for index in 1:1000]);
julia> CSV.write("apistrat_withreplicates.csv", bootstrat.data);
```
We can now pass the replicate weights directly to the `ReplicateDesign` constructor, either as a `Vector{Symbol}`, a `UnitRange` or a `Regex`.
```jldoctest replicate-design
julia> bootstrat_direct = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), [Symbol("r_"*string(replicate)) for replicate in 1:1000]; strata=:stype, weights=:pw)
ReplicateDesign:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
cluster: none
popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0]
sampsize: [100, 100, 100 … 50]
weights: [44.21, 44.21, 44.21 … 15.1]
allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000
julia> bootstrat_unitrange = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), UnitRange(45:1044);strata=:stype, weights=:pw)
ReplicateDesign:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
cluster: none
popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0]
sampsize: [100, 100, 100 … 50]
weights: [44.21, 44.21, 44.21 … 15.1]
allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000
julia> bootstrat_regex = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), r"r_\\d";strata=:stype, weights=:pw)
ReplicateDesign:
data: 200×1044 DataFrame
strata: stype
Expand All @@ -143,8 +246,11 @@ popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0]
sampsize: [100, 100, 100 … 50]
weights: [44.21, 44.21, 44.21 … 15.1]
allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000
```
"""
struct ReplicateDesign <: AbstractSurveyDesign
data::AbstractDataFrame
Expand All @@ -155,5 +261,96 @@ struct ReplicateDesign <: AbstractSurveyDesign
weights::Symbol # Effective weights in case of singlestage approx supported
allprobs::Symbol # Right now only singlestage approx supported
pps::Bool
type::String
replicates::UInt
replicate_weights::Vector{Symbol}

# default constructor
function ReplicateDesign(
data::DataFrame,
cluster::Symbol,
popsize::Symbol,
sampsize::Symbol,
strata::Symbol,
weights::Symbol,
allprobs::Symbol,
pps::Bool,
type::String,
replicates::UInt,
replicate_weights::Vector{Symbol}
)
new(data, cluster, popsize, sampsize, strata, weights, allprobs,
pps, type, replicates, replicate_weights)
end

# constructor with given replicate_weights
function ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::Vector{Symbol};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
# rename the replicate weights if needed
rename!(data, [replicate_weights[index] => "replicate_"*string(index) for index in 1:length(replicate_weights)])

# call the SurveyDesign constructor
base_design = SurveyDesign(
data;
clusters=clusters,
strata=strata,
popsize=popsize,
weights=weights
)
new(
base_design.data,
base_design.cluster,
base_design.popsize,
base_design.sampsize,
base_design.strata,
base_design.weights,
base_design.allprobs,
base_design.pps,
"bootstrap",
length(replicate_weights),
replicate_weights
)
end

# replicate weights given as a range of columns
ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::UnitRange{Int};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
) =
ReplicateDesign(
data,
Symbol.(names(data)[replicate_weights]);
clusters=clusters,
strata=strata,
popsize=popsize,
weights=weights
)

# replicate weights given as regular expression
ReplicateDesign(
data::AbstractDataFrame,
replicate_weights::Regex;
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
) =
ReplicateDesign(
data,
Symbol.(names(data)[findall(name -> occursin(replicate_weights, name), names(data))]);
clusters=clusters,
strata=strata,
popsize=popsize,
weights=weights
)
end
Loading

0 comments on commit 0703cfc

Please sign in to comment.