Skip to content

Latest commit

 

History

History
98 lines (74 loc) · 6.11 KB

File metadata and controls

98 lines (74 loc) · 6.11 KB

Software in research survey - 2014

Introduction

In 2012 the Software Sustainability Institute ran a survey of researchers at 15 research-intensive universities in the UK to uncover their attitudes to software. For reasons that will be explained in more detail in a forthcoming blog post, the analysis of these results was conducted in Excel. To improve the transparency and reproducibility of these results, this analysis has now been repeated in Python.

Important points

  • Licence for the code and data can be found in the the LICENCE and LICENCE_DATA files respectively.
  • The code runs on Python 3.
  • The data derives from the 2014 software in research survey.

Summary of process

  1. Get raw survey results from survey software (iSurvey)
  2. Anonymise data by manually deleting "Email" and "Further comments" fields.
  3. Make Question 11 parsable in Python
  4. Clean responses in OpenRefine
  5. Analyse results in Python
  6. Compare results in Python

How to reproduce the results of this analysis

Set up

Get the files and data:

  1. Clone the git repository

Prepare for cleaning:

  1. Download and install OpenRefine

Prepare for running Python:

  1. If not already installed, install virtualenv:
    • pip install virtualenv
  2. Create a project folder:
    • virtualenv -p <location of Python3 install directory> <name of project>
  3. Activate the virtual environment:
    • source <name of project>/bin/activate
  4. Install libraries:
    • pip install -r requirements.txt

Clean the data

There are two ways you can investigate the data cleaning. The first option is easy, and the second is thorough.

First option: the easy one

  1. Navigate to the main directory software_in_research_survey_2014
  2. Run parse_text_column.py:
    • python parse_text_column.py
  3. This will take the original survey data and parse the user-entered (and hence, very messy) answers to Question 11 ("What software do you use in your research?). This produces software_in_research_parasable.csv.
  4. Open OpenRefine and import Software-in-research-cleaning.openrefine.tar.gz. This takes software_in_research_parasable.csv and conducts the following cleaning steps:
    1. Removes responses from universities not included in the study
    2. Rationalises user responses (e.g. "Cambridge uni" and "Uni Cambridge" become "University of Cambridge", "MS Excel" and "Excel" become "Microsoft Excel", etc.)
  5. Export the cleaned data from OpenRefine as Software-in-research-cleaning.csv

Second option: the thorough one

  1. Navigate to the main directory software_in_research_survey_2014
  2. Run parse_text_column.py to take the original survey data and parse the user-entered (and hence, very messy) answers to Question 11 ("What software do you use in your research?). This produces software_in_research_parasable.csv.
  3. Open a first instance of OpenRefine and import Software-in-research-cleaning.openrefine.tar.gz
  4. Extract the cleaning steps from the first instance of OpenRefine as described in the documentation (see "Replaying Operations").
  5. Open a second instance of OpenRefine and import software_in_research_parasable.csv
  6. Apply the extracted cleaning steps from the first instance of OpenRefine to the data now held in the second instance of OpenRefine. This will conduct the following cleaning steps:
    1. Removes responses from universities not included in the study
    2. Rationalises user responses (e.g. "Cambridge uni" and "Uni Cambridge" become "University of Cambridge", "MS Excel" and "Excel" become "Microsoft Excel", etc.)
  7. Export the cleaned data from OpenRefine as Software-in-research-cleaning.csv

Run the analysis

  1. Run survey_2014_analysis.py:
    • python survey_2014_analysis.py
  2. This summarises the reseponses to the survey, by groups the answers to each question and counting how many times each one occurs. It stores the results in a series csv files (one per question) in the output/summary_csvs/ directory.
  3. Run comparison_new_old_results.py:
    • python comparison_new_old_results.py
  4. This takes the results of the summary files produced by the survey_2014_analysis.py and compares them against the results from the original analysis. It stores the results of that analysis in a series of csv files (one per question) in the output/comparison_summary_csvs/ directory.

Files and scripts

The following is a quick reference for the files and scripts, just in case you're wondering what everything does.

Data directory:

  • Software-in-research-cleaning.openrefine.tar.gz - OpenRefine export detailing the cleaning steps
  • The use of software in research (Responses) 24 Oct 14 - Form Responses 1.csv - the raw (anonymised) data from the survey
  • software_in_research_parasable.csv - data after processing to make comma separation more straightforward
  • Software-in-research-cleaning.csv - data ready for analysis

Main directory:

  • parse_text_column.py - used to create software_in_research_parasable.csv described above
  • requirements.txt - describes libraries used by the Python scripts. See "Running the analysis" for details.
  • chart_details_lookup.py - stores info about charts to make design neater
  • survey_2014_analysis.py - main script for analysing survey responses
  • comparison_new_old_results.py - script to compare results from original Excel-based analysis of survey results and the results generated by survey_2014_analysis.py

Other directories

  • results_from_original_2014_analysis - results from original Excel-based analysis of survey results, available from Zenodo
    • This includes ResearchSoftwareSurvey2014Results.xlsx - which is the original analysis conducted in Excel
  • output - all charts and results stored as csvs