Skip to content

Commit 7bd049c

Browse files
author
Tim Band
committed
Overview changes based on Albert's feedback
1 parent 821aa6c commit 7bd049c

1 file changed

Lines changed: 73 additions & 2 deletions

File tree

docs/source/overview.rst

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ In such a case, the file is alterable by hand as long as the YAML structure is m
185185
Datafaker configuration phase
186186
-----------------------------
187187

188-
These commands are not really part of the Reduce phase, but allow the user to configure
188+
The following commands are not really part of the Reduce phase, but allow the user to configure
189189
what the Reduce phase will entail (and hence also what the Repopulate phase will entail).
190190

191191
- ``datafaker configure-tables`` makes a file called ``config.yaml`` that describes what needs to happen to each table.
@@ -240,11 +240,82 @@ processes applied to them as it is these files that can be extracted from the
240240
private network or Trusted Research Environment to allow the construction of
241241
the synthetic data in a less sensitive computing environment, if required.
242242

243-
The sensitive database is no longer required in Datafaker's operation.
243+
.. list-table:: Information Governance Classification of Each Datafaker Output
244+
:widths: 10 20 20 20 10 10 20
245+
246+
* - Artefact
247+
- Derived from real data?
248+
- Contains patient-level data?
249+
- Granularity
250+
- Privacy risk
251+
- IG approval required?
252+
- Can leave TRE?
253+
* - ``orm.yaml``
254+
- Yes
255+
- No
256+
- Structural only
257+
- Low
258+
- Yes
259+
- Yes
260+
* - ``config.yaml``
261+
- User-authored
262+
- No
263+
- None
264+
- Low
265+
- No
266+
- Yes
267+
* - ``src-stats.yaml``
268+
- Yes
269+
- Occasionally
270+
- Aggregate
271+
- Medium
272+
- Yes
273+
- Conditional
274+
* - Vocabulary tables
275+
- Yes
276+
- No
277+
- Full table
278+
- None if correctly identified
279+
- Yes
280+
- Conditional
281+
* - Synthetic output (described below)
282+
- No
283+
- No
284+
- Patient-level synthetic data
285+
- Low
286+
- No
287+
- Yes
288+
289+
It is worh further elaborating on two of these boxes:
290+
Firstly, ``src-stats.yaml`` "occasionally" contains patient-level data;
291+
this is true if the table being summarized contains patient-level data
292+
*and* the summarizing function is reporting on every value in one or more columns
293+
*and* rare values are not being suppressed (leading to a value that applies
294+
to just one or two individuals being released).
295+
Search the ``src-stats.yaml`` file for comments such as:
296+
297+
All the values that appear in column *column-name* of table *table-name*
298+
299+
or
300+
301+
All the values that appear in column *column-name* of table *table-name* more than 7 times
302+
303+
Secondly, Vocabulary Tables' privacy risk is "None if correctly identified".
304+
A Vocabulary Table is supposed to be a table simply providing categories for other tables to reference.
305+
They are not changed during the operation of the database and so releasing them does not represent a privacy risk.
306+
However, there is some flexibility here; a list of care provider institutions is not technically a vocabulary table
307+
but it is probably safe to treat it as one.
308+
The important point is that Datafaker allows the user to specify any table as a vocabulary table;
309+
if the user incorrectly specifies sensitive data as Vocabulary, it must not be released!
244310

245311
Datafaker Repopulate phase
246312
--------------------------
247313

314+
Once we have released the summary data as described above we can operate outside of the TRE
315+
as the sensitive data is no longer accessed by Datafaker.
316+
317+
The remaining commands are:
318+
248319
- ``datafaker create-tables`` creates the structure of the destination database to match (as much as is requested) the structure of the source database
249320
- ``datafaker create-generators`` creates Python code files that will actually generate the data (this phase might be removed in a future version of Datafaker)
250321
- ``datafaker create-data`` writes fake data into the destination database.

0 commit comments

Comments
 (0)