cd ii
dotnet build
cd .\bin\Debug\net6.0
#Generic help (lists modes)
.\ii --help
#Specific help (for a given mode e.g. 'db')
.\ii db --help
An example command (evaluate all the images in C:\MassiveImageArchive
) would be as follows:
.\ii dir -d C:/MassiveImageArchive --storereport
The outputs of this (on some anonymised data):
Resource,ResourcePrimaryKey,ProblemField,ProblemValue,PartWords,PartClassifications,PartOffsets
C:\MassiveImageArchive\DOI\000001.dcm,1.3.6.1.4.1.9590.100.1.2.64408251011211630124074907290278463475,"(0008,0005)",ISO_IR 100,ISO,Organization,0
C:\MassiveImageArchive\DOI\000001.dcm,1.3.6.1.4.1.9590.100.1.2.64408251011211630124074907290278463475,"(0018,1016)",MathWorks,MathWorks,Organization,0
C:\MassiveImageArchive\DOI\000001.dcm,1.3.6.1.4.1.9590.100.1.2.64408251011211630124074907290278463475,"(0018,1018)",MATLAB,MATLAB,Person,0
C:\MassiveImageArchive\DOI\000001.dcm,1.3.6.1.4.1.9590.100.1.2.64408251011211630124074907290278463475,"(0020,0010)",DDSM,DDSM,Person,0
C:\MassiveImageArchive\DOI\000002.dcm,1.3.6.1.4.1.9590.100.1.2.423893162212842428532864042250901777433,"(0008,0005)",ISO_IR 100,ISO,Organization,0
C:\MassiveImageArchive\DOI\000002.dcm,1.3.6.1.4.1.9590.100.1.2.423893162212842428532864042250901777433,"(0018,1018)",MATLAB,MATLAB,Person,0
C:\MassiveImageArchive\DOI\000002.dcm,1.3.6.1.4.1.9590.100.1.2.423893162212842428532864042250901777433,"(0020,0010)",DDSM,DDSM,Person,0
C:\MassiveImageArchive\DOI\000003.dcm,1.3.6.1.4.1.9590.100.1.2.84709658512632788123980174250729731712,"(0008,0005)",ISO_IR 100,ISO,Organization,0
C:\MassiveImageArchive\DOI\000003.dcm,1.3.6.1.4.1.9590.100.1.2.84709658512632788123980174250729731712,"(0018,1016)",MathWorks,MathWorks,Organization,0
C:\MassiveImageArchive\DOI\000003.dcm,1.3.6.1.4.1.9590.100.1.2.84709658512632788123980174250729731712,"(0018,1018)",MATLAB,MATLAB,Person,0
C:\MassiveImageArchive\DOI\000003.dcm,1.3.6.1.4.1.9590.100.1.2.84709658512632788123980174250729731712,"(0020,0010)",DDSM,DDSM,Person,0
C:\MassiveImageArchive\DOI\Calc-Test_P_00038_LEFT_CC\1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009\1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992\000000.dcm,1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294,"(0008,0005)",ISO_IR 100,ISO,Organization,0
C:\MassiveImageArchive\DOI\Calc-Test_P_00038_LEFT_CC\1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009\1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992\000000.dcm,1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294,"(0018,1016)",MathWorks,MathWorks,Organization,0
C:\MassiveImageArchive\DOI\Calc-Test_P_00038_LEFT_CC\1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009\1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992\000000.dcm,1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294,"(0018,1018)",MATLAB,MATLAB,Person,0
C:\MassiveImageArchive\DOI\Calc-Test_P_00038_LEFT_CC\1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009\1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992\000000.dcm,1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294,"(0020,0010)",DDSM,DDSM,Person,0
[...]
You can run pixel data (OCR) by passing the --tessdirectory
flag:
.\ii dir -d C:\MassiveImageArchive --storereport --tessdirectory E:/SmiServices/data/tessdata/
The directory must be named tessdata
and contain a file named eng.traineddata
The db
verb runs IsIdentifiable to detect data in a database. An example command would be:
./ii db -d "Server=localhost;Database=RDMP_ExampleData;Uid=SA;Password=<YourStrong@Passw0rd>;Trust Server Certificate=true" -p MicrosoftSqlServer --storereport -t Biochemistry
Primary Author: Thomas
Is Identifiable Reviewer is a cross platform text based UI for managing the anonymisation processes in which PII is detected and removed. It serves as a management console for the rulesbase of IsIdentifiable and as a downstream process for validating the results/redacting the database.
The review process of potentially PII
There are 3 activities that can be undertaken using the reviewer:
The application runs as a sub verb of the CLI ii
(See SmiRunner). You can see the application help by running:
.\ii review --help
The IsIdentifiable tool applies NLP and the rules base to identify PII data in the database. A sample output file is included: ExampleReport is included.
Open the report using the -f somefile.csv
command line option or File->Open Report
.
Once loaded you can iterate the reports sequentially using the 'Sequential' tab or get an overview of all the issues encountered (aggregated by frequency) in the 'Tree View' tab.
The 'Sequential' view operates on one failure at a time. It shows the full string at the top, with the failures highlighted in green. At the bottom left is the classification of the failure: Person, Organisation, Date, etc. At the bottom right is the column (or DICOM tag) where the failure was found. It is important to check this column because, for example, you should Ignore a hospital name if the column is InstitutionName, but Update it if the column is StudyDescription.
The Next
and Prev
buttons move sequentially through the failures, i.e. Next
does not skip over failures that are matched by existing rules.
'Tree View' showing PII detected aggregated by unique failing value and column where PII was found
The 'Tree View' sorts all of the failures by number of occurrences. This tree view shows all the categories of rules and then all the categories of failures. It also shows the list of Conflicting rules which is where a failure matches both an Ignore and an Update rule.
Each instance of potential PII found by IsIdentifiable is termed a 'failure' (the existing anonymisation process has failed to strip this PII). A 'failure' can be either a false positive or a genuine case of PII. Make a decision for each failure whether to ignore it or 'report' it.
Review the reports and mark either Ignore
(this is a false positive) or Update
(this is PII and needs to be redacted). This will result in a new rule being added to either NewRules.yaml
(Ignore) or Reportlist.yaml
(Update). Once a rule is written it will be applied automatically to future reports loaded eliminating the lead to make duplicate decisions. After using Ignore
or Update
the display moves onto the next failure, skipping over those which are matched by existing rules.
Conceptually these rules are slightly different from the IsIdentifiable rules. IsIdentifiable first uses rules to spot known PII. Then it uses a NLP(NER) tool which attempts to find more PII. Finally it uses Allowlist rules to ignore known false positives. Ideally these rules should be fine-tuned to reduce the work of the reviewer so, for example, if the reviewer shows 90% of failures are due to Manufacturer=AGFA
it would be wise to manually edit IsIdentifiable rules. The Reviewer rules are different in that they are used filter the IsIdentifiable output and either ignore or redact its failure reports. The syntax of the rules files looks similar but is used differently, and has no effect on future runs of IsIdentifiable, only on future Reviews.
The menu Options | Custom Patterns
menu, when ticked, will provide the opportunity to edit the Ignore/Update rule before it is saved. This allows you to make fine adjustments to the exact pattern which will be redacted. Note that all bracketed patterns are redacted so you can add (or remove) any as necessary. For example, if the full string is John Smith Hospital^MRI Head^(20/11/2020)
but only the date has been detected you could still redact the hospital name as well by editing the pattern to be (John Smith Hospital)^.*^\((\d\d/\d\d/\d\d\d\d)\)$
(i.e. adding the name in brackets).
The Custom Patterns window provides several options to edit the pattern:
x
- clears currently typed patternF
- creates a regex pattern that matches the full input valueG
- creates a regex pattern that matches only the failing part(s)\d
- replaces all digits with regex wildcards\c
- replaces all characters with regex wildcards\d\c
- replaces all digits and characters with regex wildcards
Once all 'failures' in a report have been processed and either ignored or a 'report' rule generated you can redact the database. This is done by running the application using the -u
and -t
flags.
Since you may have several servers / databases that are processed using this tool, it is necessary to indicate where UPDATE commands should be run. This is done by putting the connection string in a 'targets' file:
- Name: My Server
ConnectionString: Server=localhost;Username=root;Password=zombie
DatabaseType: MySql
Example targets file
The following flags should be combined to successfully redact the database:
Flag | Example | Purpose |
---|---|---|
-f | -f ./ExampleReport.csv | Indicates which IsIdentifiable output report to redact. You must have completed the review process for this report |
-u | -u ./misses.csv | Indicates that you want to update the database. The file value must be included and is where reports that are not covered by rules generated in the review process are output. If you have completed the review process correctly this file should be empty after execution completes |
-t | -t z:\temp\targets.yaml | Path to a file containing the connection string (and DMBS type) of the relational database server that has the table requiring redaction |
ii.exe review -f ./ExampleReport.csv -u ./misses.csv -t z:\temp\targets.yaml
Example redaction command
Over time the number of rules in IsIdentifiable and the reviewer will increase. It can be beneficial to move ignore rules upstream from the reviewer to the IsIdentifiable rulebase especially for commonly encountered reports. This will reduce the number of false positives and the size of report files.
The 'Rules Manager' tab provides visualisation and control over the rules used by the IsIdentifiable tool ('Analyser Rules') and the Reviewer ('Reviewer Rules'). You should periodically review the rules base to ensure there are no mistakes and to identify candidates for pushing upstream into the analyser.
Key | Function |
---|---|
<Delete> |
Removes a rule from the rulesbase |
<Enter> |
Opens menu (if any) for interacting with rule(s) highlighted |