Setup For Data Qa
Automating Data Quality Assurance
This is an ongoing process where we are attempting to collect data and visualize it quickly so we can see if anything looks off.
I still think the best/simplest scenerio is to use psychopy for stimulus presentation and use other python utilities to generate a figure from the data.
Alas, we are stuck with eprime and I am struck with inspiration to make things much more complicated to practice using utilities that are not directly designed for this use-case.
Before I dive in, here are a list of tools/services/utilities I will be using to setup the quality assurance service. They each link to a tutorial/explanation.
Prerequisites
QA script exists
we’ve already written/borrowed code to generate an svg
file from the output of an
eprime task, so I will not be covering that.
We will assume there is a script that generates some form of figure
output (in a BIDS
organized fashion).
You can look at the end of the guide for what the code looks like for an example script.
You have a github account
You have a circleci account
sign up for circleci and connect your github account
You have a dockerhub account
Step 1: Create a reproducible environment to run the QA code
Our QA code for this example is written in python, and currently a good way to share the environment necessary to run/reproduce the code is anaconda.
If you developed the qa code while working in a conda environment, great! Otherwise you will create a conda environment with:
conda create -n eprime_convert python=3.6
where eprime_convert
is the name of the environment (you can make this be
anything you want) and python=3.6
is selecting the specific version
of python (we currently use 3.6
).
To activate the newly created environment:
source activate eprime_convert
Now you will look at the import statements at the top of your script
and conda install
the necessary packages.
from convert_eprime import convert
import pandas as pd
import seaborn as sns
from argparse import ArgumentParser
import os
from matplotlib import pyplot as plt
from glob import glob
import shutil
import re
From this, it appears I need to install: convert_eprime, pandas, seaborn, and matplotlib. All the other imports are from builtin packages in python so they are available by default. (you will notice which packages are default with practice) My first pass to install everything would be:
conda install convert_eprime pandas seaborn matplotlib
This would install everything if I didn’t include convert_eprime
.
convert_eprime
is not tracked by anaconda, and isn’t even tracked by pypi.
It’s a pet project from another
graduate student that was fed up with e-merge.
To install convert_eprime
I need to know how to install a github repo.
Luckily, stackoverflow has an answer for everything.
So the real commands to install everything are:
conda install pandas seaborn matplotlib
pip install git+https://github.com/tsalo/convert-eprime.git
Test your script to make sure it works with these installs. If it complains that you are missing something, install it. Now you can export your environment to a file so it can be reproduced.
conda env export > environment.yml
Open up that environment.yml because we need to edit it. It may look something like this:
name: eprime_convert
channels:
- defaults
dependencies:
- blas=1.0=mkl
- ca-certificates=2018.03.07=0
- certifi=2018.10.15=py36_0
- cycler=0.10.0=py36_0
- dbus=1.13.2=h714fa37_1
- expat=2.2.6=he6710b0_0
- fontconfig=2.13.0=h9420a91_0
- freetype=2.9.1=h8a8886c_1
- glib=2.56.2=hd408876_0
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=hb453b48_1
- icu=58.2=h9c2bf20_1
- intel-openmp=2019.0=118
- jpeg=9b=h024ee3a_2
- kiwisolver=1.0.1=py36hf484d3e_0
- libedit=3.1.20170329=h6b74fdf_2
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=8.2.0=hdf63c60_1
- libgfortran-ng=7.3.0=hdf63c60_0
- libpng=1.6.35=hbc83047_0
- libstdcxx-ng=8.2.0=hdf63c60_1
- libuuid=1.0.3=h1bed415_2
- libxcb=1.13=h1bed415_1
- libxml2=2.9.8=h26e45fe_1
- matplotlib=3.0.1=py36h5429711_0
- mkl=2019.0=118
- mkl_fft=1.0.6=py36h7dd41cf_0
- mkl_random=1.0.1=py36h4414c95_1
- ncurses=6.1=hf484d3e_0
- numpy=1.15.4=py36h1d66e8a_0
- numpy-base=1.15.4=py36h81de0dd_0
- openssl=1.0.2p=h14c3975_0
- pandas=0.23.4=py36h04863e7_0
- patsy=0.5.1=py36_0
- pcre=8.42=h439df22_0
- pip=18.1=py36_0
- pyparsing=2.3.0=py36_0
- pyqt=5.9.2=py36h05f1152_2
- python=3.6.6=h6e4f718_2
- python-dateutil=2.7.5=py36_0
- pytz=2018.7=py36_0
- qt=5.9.6=h8703b6f_2
- readline=7.0=h7b6447c_5
- scipy=1.1.0=py36hfa4b5c9_1
- seaborn=0.9.0=py36_0
- setuptools=40.5.0=py36_0
- sip=4.19.8=py36hf484d3e_0
- six=1.11.0=py36_1
- sqlite=3.25.2=h7b6447c_0
- statsmodels=0.9.0=py36h035aef0_0
- tk=8.6.8=hbc83047_0
- tornado=5.1.1=py36h7b6447c_0
- wheel=0.32.2=py36_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=ha838bed_2
- pip:
- convert-eprime==0.0.1a0
- future==0.17.1
prefix: /home/james/.conda/envs/eprime_convert
If we were only going to run this environment on identical (or near identical)
hardware, then this is fine, but if we want a more flexible yml
, then we
need to start editing.
A few things to do:
- remove the prefix
- change the convert-eprime version to the github repo
- remove the depency installs and the machine specific install codes
After editing, the file should look something like this:
name: convert_eprime
channels:
- defaults
dependencies:
- matplotlib=3.0.1
- numpy=1.15.4
- pandas=0.23.4
- seaborn=0.9.0
- python=3.6
- pip:
- git+https://github.com/tsalo/convert-eprime.git
Much cleaner (I kept numpy as its own install just to be explicit, I don’t believe it’s actually necessary to include).
We have created the yml
to basically build the same environment that we want to
use/build our code with.
This will be good for deploying/sharing the code in multiple contexts.
However, we are going to lock down the environment in which the code runs even further using docker.
Basically, we are going to build a docker container that has our conda environment installed on it.
We can do that by making a Dockerfile that could look like this:
# https://medium.com/@chadlagore/conda-environments-with-docker-82cdc9d25754
FROM continuumio/miniconda3:4.5.11
COPY eprime_convert.yml /env/
RUN conda env create -f /env/eprime_convert.yml &&\
conda clean --all
# Pull the environment name out of the environment.yml
RUN echo "source activate $(head -1 /env/eprime_convert.yml | cut -d' ' -f2)" > ~/.bashrc
ENV PATH /opt/conda/envs/$(head -1 /env/eprime_convert.yml | cut -d' ' -f2)/bin:$PATH
ENTRYPOINT [ "/bin/bash", "-c" ]
and we can build the Dockerfile with this command:
docker build -t jdkent/eprime_convert .
The tag is linked to my dockerhub account so when I push the container to dockerhub it will go the correct location. I will push the container to dockerhub with the following command:
docker push jdkent/eprime_convert
The container can be seen on dockerhub.
Excellent!
With this in place we can move on to setting up circleci
Step 2: Use circleci to run the code after each data commit
circleci
is an online service that can run arbitrary code whenever something
happens in a github repository.
The vagueness of the description hides the power behind this service.
Essentially, your imagination is the limit for what you can do.
Follow the official circleci docs to add the repository to circleci so that circleci will begin triggering builds when commits appear in that repository.
Inside your git repository add a .circleci
folder and make a config.yml
inside that folder, that is what circleci will read.
Here is a full example config.yml
for circleci, I will break it down after.
# Python CircleCI 2.0 configuration file
#
# Check https://circleci.com/docs/2.0/language-python/ for more details
#
version: 2
jobs:
build:
docker:
# specify the version you desire here
- image: jdkent/eprime_convert:latest
working_directory: ~/repo
steps:
- run:
name: clone github repo
command: |
git clone https:///${GITHUB_TOKEN}@github.com/HBClab/BetterTaskSwitch.git
- run:
name: check if data QA should be skipped
command: |
cd ~/repo/BetterTaskSwitch
if [[ "$( git log --format=oneline -n 1 $CIRCLE_SHA1 | grep -i -E '\[skip[ _]?ci\]' )" != "" ]]; then
echo "Skipping Data QA"
circleci step halt
fi
- run:
name: run eprime convert
command: |
source activate eprime_convert
~/repo/BetterTaskSwitch/code/eprime_convert.py \
-b ~/repo/BetterTaskSwitch/bids \
-r ~/repo/BetterTaskSwitch/task-full_resp-srbox \
-c ~/repo/BetterTaskSwitch/code/config_file/task_switch.json \
-a mri \
--sub-prefix GE120
- run:
name: add and commit files
command: |
cd ~/repo/BetterTaskSwitch
git config credential.helper 'cache --timeout=120'
git config user.email "helper@help.com"
git config user.name "QA Bot"
# Push quietly to prevent showing the token in log
git add .
git commit -m "[skip ci] $(date)"
git push -q https://${GITHUB_TOKEN}@github.com/HBClab/BetterTaskSwitch.git master
version: 2
: the overall version of circleci to use, they are depreciating version one so all of them should be version 2jobs:
the list of things I want circleci to run.build:
: this provides the option to choose what machinary I want circleci to run ondocker:
: I want to use docker to select the environment my jobs are run using.- image:jdkent/eprime_convert:latest
: this selects the docker image stored on dockerhub that we just made in the last step.
working_directory: ~/repo
: where the commandline interface will drop me when I’m running commands in the docker container we selected (I don’t really take advantage of this option).steps:
: the steps we will take to run the job.- run:
: instantiation of a step to take in the jobname: clone github repo
: the name of the step we are taking.command: |
: the actual command we will be running in the docker container (the|
(pipe) allows us to type the command on a separate line so the line of code does not look crowded).
- run
name: check if data QA should be skipped
command: |
: this command checks if[skip ci]
or[skip_ci]
is in the most recent commit message and will stop the circleci build if this is selected.
- run:
name: run eprime convert
command: |
: this command activates the conda environment and runs our data qa script with the appropriate inputs generating the figure output.
- run:
name: add and commit files
command: |
: this command creates a github identity so the bot can push the new data to the github repository (importantly the github message contains[skip ci]
, what would happen if that wasn’t there?)
One important detail I’ve left out is what’s up with ${GITHUB_TOKEN}
.
That is a special variable I’ve defined using circleci’s environment variable settings.
This is great for storing variables that represent some type of authentication (e.g. passwords), but you don’t want everyone to be able to see the password.
In this instance I’m using a github token.
You can make your own github token going to your github profile, clicking on settings, clicking on developer settings, and then creating a new token.
see the github announcement about tokens
Warning: you will only have explicit access to your token when you create it, so make sure you copy the token somewhere safe on your computer.
Once you have circleci setup and the config file inside your repository, you are ready to add the files and push the changes back up to github, and observe your first circleci build. The steps would look something like this:
git add .circleci/config.yml
git commit -m 'add circleci build configuration'
git push origin master
Note: the error I ran into when doing this was incorrect permissions of eprime_convert.py
in my repository.
I gave the file executable permissions with the following command:
git update-index --chmod=+x eprime_convert.py
Step 3: display the figures using github-pages
We have created a reproducible environment and setup circleci to run everytime we push a new commit to the repository. The next step is to easily visualize all the figures we have created. We will do this using github-pages.
Follow the github instructions to have github start hosting your repository as a static webpage (using github-pages).
I’m using the minimal theme and I suggest that you use that theme too.
Pull the changes to your repository.
You will have an _config.yml
file in your base directory.
Change the file to look something like this:
theme: jekyll-theme-minimal
plugins:
- jekyll-relative-links
title: [BetterTaskSwitch]
description: [Monitoring BetterTaskSwitch Data]
logo: https://avatars0.githubusercontent.com/u/24659915?s=400&u=12a4f626488fe0f692d77f355d9dd9f3e4e63f7a&v=4
baseurl: /BetterTaskSwitch
You will change the title, description, and baseurl to what’s specific in the repository you are working on. The logo is pointing towards our (HBClab) github logo.
Next we will add liquid
syntax to display all the swarmplots that are in our
repository.
You will place this code in your README.md
file located at the
base of your repository.
{% assign my_files = site.static_files | where:"extname",".svg" | sort:"modified_time" | reverse %}
{% capture sevendays %}{{'now' | date: "%s" | minus : 604800 }}{% endcapture %}
{% for taskswitch in my_files %}
{% if taskswitch.name contains "swarmplot" %}
{% capture file_mod %}{{taskswitch.modified_time | date: "%s"}}{% endcapture %}
{% if file_mod > sevendays %}
### Recent
{% else %}
### Older
{% endif %}
**{{taskswitch.name}}**
![{{taskswitch.name}}]({{ taskswitch.path | prepend:site.baseurl }})
{% endif %}
{% endfor %}
Note: This stackoverflow helped me with how to parse and compare dates
I will explain important bits of this code:
{% assign my_files = site.static_files | where:"extname",".svg" | sort:"modified_time" | reverse %}
This line creates a variable
called my_files that searches through all static files
where the extension of the file is .svg
.
Next, the resulting array is then piped to sort the array
by the date the file was last modified (from oldest -> newest).
Finally, the result is reversed so that the array is sorted from
newest -> oldest.
{% capture sevendays %}{{'now' | date: "%s" | minus : 604800 }}{% endcapture %}
This line creates a variable called sevendays which measures the current
time using seconds %s
and then subtracts seven days worth of
seconds (7 * 24 * 60 * 60 = 604800
).
This will be used to tell whether an image is seven days old or not.
{% capture file_mod %}{{taskswitch.modified_time | date: "%s"}}{% endcapture %}
This line creates the variable file_mod. file_mod is the date (in seconds) when the file was last modified. This means we can directly compare file_mod and sevendays to test whether the file is older or newer than seven days.
![{{taskswitch.name}}]({{ taskswitch.path | prepend:site.baseurl }})
This is the last line I will explain since it may look confusing.
It combines both markdown syntax and liquid syntax.
Here is the markdown portion: ![name](url)
.
That markdown syntax displays an inline image.
The double curly brackets are liquid syntax.
These return strings that can be interpreted by markdown.
taskswitch.path
is the path to the file relative to the top
directory of the repository (e.g. /some/dir/file.svg
).
However, with how github parses the url
, we also need to
include the website basename as well, so we prepend the site’s
baseurl.
If you look back, you can see we defined the baseurl variable in
_config.yml
.
This is the difference between searching for a file using this
https://hbclab.github.io
as our baseurl and this
https://hbclab.github.io/BetterTaskSwitch
(we want this one)
Next we want to check to make sure we did everything correctly. We can do this by serving the jekyll website we made locally. Please follow the github instructions to do this.
Once we are satisfied with how the website looks, we can add/commit/push the changes to github.
git add _config.yml Gemfile README.md
git commit -m 'add website functionality'
git push origin master
That’s it! Once you’ve done all that, you can reap the benefits of having an automated system that generates figures and makes them visible via a website.
Example Script
This code was written to work, not be beautiful, acknowledge that this code may not represent best (or even) recommended practices.
#!/usr/bin/env python
# generate pipelines that read in the eprime txt files and output a
# machine readable summary and a useful figure for quality assurance.
from convert_eprime import convert
import pandas as pd
import numpy as np
from argparse import ArgumentParser
import os
from glob import glob
import shutil
import re
from matplotlib import pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
sns.set_palette("bright")
# expressions
session_dict = {1: 'pre', 2: 'post'}
def get_parser():
"""Build parser object for cmdline processing"""
parser = ArgumentParser(description='betterVTSM.py: converts '
'eprime output to tsv in BIDS format')
parser.add_argument('-b', '--bids', action='store',
help='root folder of a BIDS valid dataset')
parser.add_argument('-r', '--raw-dir', action='store',
help='directory where edat and txt files live')
parser.add_argument('-p', '--participant-label', action='store', nargs='+',
help='participant label(s) to process')
parser.add_argument('-s', '--session-label', action='store', nargs='+',
help='session label(s) to process (either 1 or 2)')
parser.add_argument('-c', '--config', action='store', required=True,
help='config file to process the eprime txt. '
'see convert_eprime for details')
parser.add_argument('--sub-prefix', action='store',
help='add additional characters to the prefix of the participant label')
return parser
def copy_eprime_files(src, dest):
# collect edat2 and txt files
types = ('*.edat2', '*.txt')
raw_files = []
for type in types:
raw_files.extend(glob(os.path.join(src, type)))
# copy all files into sourcedata (if not already there)
copied_files = 0
for file in raw_files:
out_file = os.path.join(dest, os.path.basename(file))
if not os.path.isfile(out_file):
shutil.copy(file, dest)
copied_files += 1
return copied_files
def main():
"""Entry point"""
opts = get_parser().parse_args()
# set input/output directories
bids_dir = os.path.abspath(opts.bids)
# ensure bids directory exists
os.makedirs(bids_dir, exist_ok=True)
sourcedata = os.path.join(bids_dir, 'sourcedata', 'VSTM')
derivatives = os.path.join(bids_dir, 'derivatives')
# ensure sourcedata and derivatives exist
os.makedirs(sourcedata, exist_ok=True)
os.makedirs(derivatives, exist_ok=True)
# assume data is already copied over if raw_dir isn't specified
if opts.raw_dir:
raw_dir = os.path.abspath(opts.raw_dir)
# output is only the number of copied files, throwing away
files_copied = copy_eprime_files(raw_dir, sourcedata)
print('{num} file(s) copied'.format(num=files_copied))
else:
print('-r not specified, assuming data are in the correct location: '
'{dir}'.format(dir=sourcedata))
# collect participant labels
if opts.participant_label:
participants = opts.participant_label
else:
participant_files = glob(os.path.join(sourcedata, 'VSTM_*.txt'))
sub_expr = re.compile(r'^.*VSTM_PACR-(?P<sub_id>[0-9]{3})-(?P<ses_id>[1-2]).txt')
participants = []
for participant_file in participant_files:
print(participant_file)
sub_dict = sub_expr.search(participant_file).groupdict()
participants.append(sub_dict['sub_id'])
# collect sessions
if opts.session_label:
sessions = opts.session_label
else:
sessions = [1, 2]
filename_template = 'VSTM_PACR-{sub}-{ses}.{ext}'
participant_dict = {}
for participant in participants:
participant_dict[participant] = {}
for session in sessions:
# initialize sub/ses dictionary
participant_dict[participant][session] = {'edat': None, 'txt': None}
# get the edat file (if it exists)
edat_file = filename_template.format(sub=participant,
ses=session,
ext='edat2')
if os.path.isfile(os.path.join(sourcedata, edat_file)):
participant_dict[participant][session]['edat'] = os.path.join(
sourcedata, edat_file
)
else:
print('{edat} missing!'.format(edat=edat_file))
participant_dict[participant].pop(session)
continue
# get the txt file (if it exists)
txt_file = filename_template.format(sub=participant,
ses=session,
ext='txt')
if os.path.isfile(os.path.join(sourcedata, txt_file)):
participant_dict[participant][session]['txt'] = os.path.join(
sourcedata, txt_file
)
else:
print('{txt} missing!'.format(txt=txt_file))
participant_dict[participant].pop(session)
continue
# process the data per session
for participant in participant_dict.keys():
if opts.sub_prefix:
participant_label = opts.sub_prefix + participant
else:
participant_label = participant
for session in participant_dict[participant].keys():
# type coersion to integer
session = int(session)
session_label = session_dict[session]
edat_file = participant_dict[participant][session]['edat']
txt_file = participant_dict[participant][session]['txt']
config = os.path.abspath(opts.config)
folder = 'beh'
work_file = os.path.join(sourcedata, 'work', 'sub-' + participant_label,
'ses-' + session_label, 'beh',
'sub-{sub}_ses-{ses}_task-VSTM_raw.csv'.format(sub=participant_label, ses=session_label))
# ensure directory exists
os.makedirs(os.path.dirname(work_file), exist_ok=True)
# conversion to csv
convert.text_to_rcsv(txt_file, edat_file, config, work_file)
# create dataframe
df = pd.read_csv(work_file)
#drops practice trials
df.drop(df[(df.Running == 'ColorPractice') | (df.Running == 'ShapePractice') | (df.Running == 'PracticeBoth')].index, inplace=True)
#drop all NaN entries, re: trials where no response was desired (at begining of all VSTM blocks)
df.dropna(how='all', inplace=True)
# rename column headers
df.rename(index=str, columns={"Running": "trial_type",
"Probe.ACC": "correct",
"Probe.RT": "response_time",
"Probe.CRESP": "probe_novelty"}, inplace=True)
# convert response_time into seconds
df['response_time'] = df['response_time'] / 1000
# change 'correct' column from float to int
df.correct = df.correct.astype(int)
# create new column for block number
df['block'] = df['trial_type']
# replace trial_type elements with simpler description
df['trial_type'].replace({'SimColour':'color', 'SimShape':'shape',
'SimBoth':'color_and_shape'}, inplace=True)
# replace probe_novelty elements with a more sensible set
# {/} -> novel -> 1
# z -> repeat -> 0
df['probe_novelty'].replace({'{/}': 1, 'z': 0}, inplace=True)
# write processed data to file
base_file = 'sub-{sub}_ses-{ses}_task-VSTM_events.tsv'
bids_file = os.path.join(bids_dir,
'sub-' + participant_label,
'ses-' + session_label,
folder,
base_file.format(
sub=participant_label,
ses=session_label)
)
# make sure the directory exists
os.makedirs(os.path.dirname(bids_file), exist_ok=True)
df.to_csv(bids_file, sep='\t', index=False)
# Do some quality assurance
derivatives_dir = os.path.join(derivatives, 'VSTMQA')
os.makedirs(derivatives_dir, exist_ok=True)
base_json = 'sub-{sub}_ses-{ses}_task-VSTM_averages.json'
out_json = os.path.join(derivatives_dir,
'sub-' + participant_label,
'ses-' + session_label,
folder,
base_json.format(
sub=participant_label,
ses=session_label)
)
base_fig = 'sub-{sub}_ses-{ses}_task-VSTM_swarmplot.svg'
out_fig = os.path.join(derivatives_dir,
'sub-' + participant_label,
'ses-' + session_label,
folder,
base_fig.format(
sub=participant_label,
ses=session_label)
)
# make the derivatives directory for the participant/session in taskSwitchQA
os.makedirs(os.path.dirname(out_json), exist_ok=True)
# get average response time and average correct
json_dict = {'response_time': None, 'correct': None}
json_dict['response_time'] = df['response_time'].where(df['correct'] == 1).mean()
json_dict['correct'] = df['correct'].mean()
ave_res = pd.Series(json_dict)
ave_res.to_json(out_json)
if not os.path.isfile(out_fig):
# make a swarmplot
myplot = sns.swarmplot(x="trial_type", y="response_time",
hue="correct", data=df, size=6)
# set the y range larger to fit the legend
myplot.set_ylim(0, 10.0)
# remove the title of the legend
myplot.legend(title=None)
# rename the xticks
myplot.set_xticklabels(['Color', 'Shape', 'Shape and Color'])
# rename xlabel
myplot.set_xlabel('trial type')
myplot.set_ylabel('response time (seconds)')
# rename the legend labels
new_labels = ['incorrect', 'correct']
for t, l in zip(myplot.legend_.texts, new_labels):
t.set_text(l)
# save the figure
myplot.figure.savefig(out_fig, dpi=72)
# remove all plot features from memory
plt.clf()
if __name__ == '__main__':
main()