For installation, set-up and basic usage refer to the package README.md file. In this tutorial, you will find more context on package inputs, how to interact with package functions and interpret package outputs.
Preparing for mapping
This bar plot is produced automatically when you run the
metadata_map
function with the demo metadata file, which
contains metadata about the National Community Child Health
Database (NCCHD).
The bar plot shows us there are 13 tables in the dataset. The height of the bar indicates the number of variables in that table:
- The ones with lots of variables (e.g. CHILD_TRUST) will take you longer to process (caveat: see HDRUK Gateway screenshot below)
- Some tables (e.g. CHE_HEALTHYCHILDWALESPROGRAMME) have a lot of empty descriptions. An empty description means that this variable will only have a label and a data type.
It is important to note that this plot is only summarising variable level metadata i.e. a description of what the variable is. Some variables also require value level metadata i.e. what does each value correspond to, 1 = Yes, 2 = No, 3 = Unknown. This value level metadata can sometimes be found in lookup tables, if it is not provided within the variable level description.
The bar plot can help you understand the scope of the dataset, but reference the HDRUK Gateway page for the fuller context. For instance, table descriptions are not included in these structural metadata files but they are included on the gateway:
For dataset NCCHD, used in the demo, the structural metadata was downloaded here:
View descriptions of each table to give you more context on their contents and how to use them:
Mapping
Use the bar plot and the HDRUK Gateway to guide your mapping choices. The main functionality of this package is to aid a researcher in mapping variables from health datasets onto their research domains (concepts/latent variables).
You will see this in the R console:
Press 'Esc' key to finish here, or press any other key to continue with mapping variables
Pressing any key will move you to the mapping phase. Demo mode (run
by metadata_map()
) only processes the fist 20 data elements
(variables) in the selected table.
Enter your initials: rs
Enter the table number you want to process:
1: BLOOD_TEST
2: BREAST_FEEDING
3: CHE_HEALTHYCHILDWALESPROGRAMME
4: CHILD
5: CHILD_BIRTHS
6: CHILD_MEASUREMENT_PROGRAM
7: CHILD_TRUST
8: EXAM
9: IMM
10: PATH_BLOOD_TESTS
11: PATH_SPCM_DETAIL
12: REFR_IMM_VAC
13: SIG_COND
Selection: 4
ℹ Processing Table 4 of 13 (CHILD)
Optional free text note about this table (or press enter to continue): tutorial testing
You will be asked to label data elements with one (or more) of the numbers shown in the Plots tab [0-7]. Here we have very simple domains [4-7] for the demo run. For a research study, your domains are likely to be much more specific e.g. ‘Prenatal, antenatal, neonatal and birth’ or ‘Health behaviours and diet’. The 4 default domains are always included [0-3], appended on to any domain list given by the user.
If it skips over a data element it means it was auto-categorised or copied from a previous table already processed (more on that later).
ℹ 20 left to process in this session
✔ Processing data element 1 of 35
ℹ 19 left to process in this session
✔ Processing data element 2 of 35
ℹ 18 left to process in this session
✔ Processing data element 3 of 35
ℹ 17 left to process in this session
✔ Processing data element 4 of 35
DATA ELEMENT -----> APGAR_1
DESCRIPTION -----> APGAR 1 score. This is a measure of a baby's physical state at birth with particular reference to asphyxia - taken at 1 minute. Scores 3 and below are generally regarded as critically low; 4-6 fairly low, and 7-10 generally normal. Field can contain high amount of unknowns/non-entries.
DATA TYPE -----> CHARACTER
Categorise data element into domain(s). E.g. 3 or 3,4: 7
Categorisation note (or press enter to continue): your note here
We chose to respond with ‘7’ because that corresponds to the ‘Health info’ domain in the table. More than one domain can be chosen, separated by commas. Do remember that this demo has over-simplified domain labels, and they will likely be more specific for a research study.
You have the option to re-do the categorisation (and note) you just made, by replying ‘y’ to the question:
Response to be saved is '7'. Would you like to re-do? (y/n): n
After completing 20, it will then ask you to review the auto-categorisations it made.
These auto-categorisations are based on the mappings included in the
default look_up.csv.
Type get("look_up")
in R
.
This look-up file can be changed by the user. ALF refers to ‘Anonymous Linking Field’ - this field is used within datasets that have been anonymised and encrypted for inclusion within SAIL Databank.
ℹ These are the auto categorised data elements:
data_element domain_code note
ALF_E 2 AUTO CATEGORISED
ALF_MTCH_PCT 2 AUTO CATEGORISED
ALF_STS_CD 2 AUTO CATEGORISED
AVAIL_FROM_DT 1 AUTO CATEGORISED
GNDR_CD 3 AUTO CATEGORISED
Select those you want to manually edit:
1: ALF_E
2: ALF_MTCH_PCT
3: ALF_STS_CD
4: AVAIL_FROM_DT
5: GNDR_CD
Enter one or more numbers separated by spaces and then ENTER, or 0 to cancel
1:
Press enter for now, as these look good, but know that you can always manually overide an auto categorisation. It will then ask you if you want to review the categorisations you made. Respond 1 to review:
Would you like to review your categorisations?
1: Yes
2: No
Selection: 1
ℹ These are the data elements you categorised:
data_element domain_code note (first 12 chars)
APGAR_1 7 note
APGAR_2 7
BIRTH_ORDER 7 10% missingness
BIRTH_TM 1,7 20% missingness
BIRTH_WEIGHT 7
BIRTH_WEIGHT_DEC 7
BREASTFEED_8_WKS_FLG 7
BREASTFEED_BIRTH_FLG 7
CHILD_ID_E 2
CURR_LHB_CD_BIRTH 5,7 Place of birth
DEL_CD 7
DOD 3,7
ETHNIC_GRP_CD 3
GEST_AGE 3,7
HEALTH_VISITOR_CD_E 2
Select those you want to edit:
1: APGAR_1
2: APGAR_2
3: BIRTH_ORDER
4: BIRTH_TM
5: BIRTH_WEIGHT
...
Enter one or more numbers separated by spaces and then ENTER, or 0 to cancel
1:
If you select a number, it will guide you through that categorisation again.
All finished! Take a look at the outputs in your project directory.
Understanding mapping outputs
See here for outputs generated from the demo run.
-
BAR_360_NationalCommunityChildHealthDatabase(NCCHD)_timestamp.html
- The bar plot that opened in your browser.
-
BAR_360_NationalCommunityChildHealthDatabase(NCCHD)_timestamp.csv
- The data that created this bar plot.
-
MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
- The mappings between variables in the CHILD table and the research domains.
-
L-MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
- The same mappings as the previous file, but saved in a longer
format. See the argument
long_output = TRUE
inmetadata_map
.
- The same mappings as the previous file, but saved in a longer
format. See the argument
-
MAPPING_LOG_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
- A log file that accompanies the MAPPING file, describing features of the session and the table processed.
-
MAPPING_PLOT_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.png
- A simple visual representation of the mappings, displaying the count of each domain code.
Compare mapping
Running the function map_compare
will allow you to
compare the mappings from two sessions, perhaps two different
researchers. This function compares csv outputs from two sessions, finds
their differences, and asks for a consensus, creating a new output
file:
CONSENSUS_MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv