In 2012 the Software Sustainability Institute ran a survey of researchers at 15 research-intensive universities in the UK to uncover their attitudes to software. For reasons that will be explained in more detail in a forthcoming blog post, the analysis of these results was conducted in Excel. To improve the transparency and reproducibility of these results, this analysis has now been repeated in Python.
- Licence for the code and data can be found in the the LICENCE and LICENCE_DATA files respectively.
- The code runs on Python 3.
- The data derives from the 2014 software in research survey.
- Get raw survey results from survey software (iSurvey)
- Anonymise data by manually deleting "Email" and "Further comments" fields.
- Make Question 11 parsable in Python
- Clean responses in OpenRefine
- Analyse results in Python
- Compare results in Python
Get the files and data:
Prepare for cleaning:
Prepare for running Python:
- If not already installed, install virtualenv:
pip install virtualenv
- Create a project folder:
virtualenv -p <location of Python3 install directory> <name of project>
- Activate the virtual environment:
source <name of project>/bin/activate
- Install libraries:
pip install -r requirements.txt
There are two ways you can investigate the data cleaning. The first option is easy, and the second is thorough.
First option: the easy one
- Navigate to the main directory
software_in_research_survey_2014
- Run
parse_text_column.py
:python parse_text_column.py
- This will take the original survey data and parse the user-entered (and hence, very messy) answers to Question 11 ("What software do you use in your research?). This produces
software_in_research_parasable.csv
. - Open OpenRefine and import
Software-in-research-cleaning.openrefine.tar.gz
. This takessoftware_in_research_parasable.csv
and conducts the following cleaning steps:- Removes responses from universities not included in the study
- Rationalises user responses (e.g. "Cambridge uni" and "Uni Cambridge" become "University of Cambridge", "MS Excel" and "Excel" become "Microsoft Excel", etc.)
- Export the cleaned data from OpenRefine as
Software-in-research-cleaning.csv
Second option: the thorough one
- Navigate to the main directory
software_in_research_survey_2014
- Run
parse_text_column.py
to take the original survey data and parse the user-entered (and hence, very messy) answers to Question 11 ("What software do you use in your research?). This producessoftware_in_research_parasable.csv
. - Open a first instance of OpenRefine and import
Software-in-research-cleaning.openrefine.tar.gz
- Extract the cleaning steps from the first instance of OpenRefine as described in the documentation (see "Replaying Operations").
- Open a second instance of OpenRefine and import
software_in_research_parasable.csv
- Apply the extracted cleaning steps from the first instance of OpenRefine to the data now held in the second instance of OpenRefine. This will conduct the following cleaning steps:
- Removes responses from universities not included in the study
- Rationalises user responses (e.g. "Cambridge uni" and "Uni Cambridge" become "University of Cambridge", "MS Excel" and "Excel" become "Microsoft Excel", etc.)
- Export the cleaned data from OpenRefine as
Software-in-research-cleaning.csv
- Run
survey_2014_analysis.py
:python survey_2014_analysis.py
- This summarises the reseponses to the survey, by groups the answers to each question and counting how many times each one occurs. It stores the results in a series csv files (one per question) in the
output/summary_csvs/
directory. - Run
comparison_new_old_results.py
:python comparison_new_old_results.py
- This takes the results of the summary files produced by the
survey_2014_analysis.py
and compares them against the results from the original analysis. It stores the results of that analysis in a series of csv files (one per question) in theoutput/comparison_summary_csvs/
directory.
The following is a quick reference for the files and scripts, just in case you're wondering what everything does.
Data directory:
Software-in-research-cleaning.openrefine.tar.gz
- OpenRefine export detailing the cleaning stepsThe use of software in research (Responses) 24 Oct 14 - Form Responses 1.csv
- the raw (anonymised) data from the surveysoftware_in_research_parasable.csv
- data after processing to make comma separation more straightforwardSoftware-in-research-cleaning.csv
- data ready for analysis
Main directory:
parse_text_column.py
- used to createsoftware_in_research_parasable.csv
described aboverequirements.txt
- describes libraries used by the Python scripts. See "Running the analysis" for details.chart_details_lookup.py
- stores info about charts to make design neatersurvey_2014_analysis.py
- main script for analysing survey responsescomparison_new_old_results.py
- script to compare results from original Excel-based analysis of survey results and the results generated bysurvey_2014_analysis.py
Other directories
results_from_original_2014_analysis
- results from original Excel-based analysis of survey results, available from Zenodo- This includes
ResearchSoftwareSurvey2014Results.xlsx
- which is the original analysis conducted in Excel
- This includes
output
- all charts and results stored as csvs