Skip to contents

For installation, set-up and basic usage refer to the package README.md file. In this tutorial, you will find more context on package inputs, how to interact with package functions and interpret package outputs.

Preparing for mapping

example bar plot showing number of variables for each table alongside counts of whether variables have missing descriptions

This bar plot is produced automatically when you run the metadata_map function with the demo metadata file, which contains metadata about the National Community Child Health Database (NCCHD).

The bar plot shows us there are 13 tables in the dataset. The height of the bar indicates the number of variables in that table:

  • The ones with lots of variables (e.g. CHILD_TRUST) will take you longer to process (caveat: see HDRUK Gateway screenshot below)
  • Some tables (e.g. CHE_HEALTHYCHILDWALESPROGRAMME) have a lot of empty descriptions. An empty description means that this variable will only have a label and a data type.

It is important to note that this plot is only summarising variable level metadata i.e. a description of what the variable is. Some variables also require value level metadata i.e. what does each value correspond to, 1 = Yes, 2 = No, 3 = Unknown. This value level metadata can sometimes be found in lookup tables, if it is not provided within the variable level description.

The bar plot can help you understand the scope of the dataset, but reference the HDRUK Gateway page for the fuller context. For instance, table descriptions are not included in these structural metadata files but they are included on the gateway:

For dataset NCCHD, used in the demo, the structural metadata was downloaded here:

screenshot of HRDRUK gateway showing location to download structural metadata

View descriptions of each table to give you more context on their contents and how to use them:

screenshot of HRDRUK gateway showing table descriptions

Mapping

Use the bar plot and the HDRUK Gateway to guide your mapping choices. The main functionality of this package is to aid a researcher in mapping variables from health datasets onto their research domains (concepts/latent variables).

You will see this in the R console:

Press 'Esc' key to finish here, or press any other key to continue with mapping variables

Pressing any key will move you to the mapping phase. Demo mode (run by metadata_map()) only processes the fist 20 data elements (variables) in the selected table.

Enter your initials: rs

Enter the table number you want to process: 

 1: BLOOD_TEST    
 2: BREAST_FEEDING                   
 3: CHE_HEALTHYCHILDWALESPROGRAMME   
 4: CHILD                   
 5: CHILD_BIRTHS  
 6: CHILD_MEASUREMENT_PROGRAM        
 7: CHILD_TRUST                      
 8: EXAM                    
 9: IMM           
 10: PATH_BLOOD_TESTS                
 11: PATH_SPCM_DETAIL                
 12: REFR_IMM_VAC            
 13: SIG_COND                        

Selection: 4

ℹ Processing Table 4 of 13 (CHILD)

Optional free text note about this table (or press enter to continue): tutorial testing

You will be asked to label data elements with one (or more) of the numbers shown in the Plots tab [0-7]. Here we have very simple domains [4-7] for the demo run. For a research study, your domains are likely to be much more specific e.g. ‘Prenatal, antenatal, neonatal and birth’ or ‘Health behaviours and diet’. The 4 default domains are always included [0-3], appended on to any domain list given by the user.

description of research domains used for categorisations

If it skips over a data element it means it was auto-categorised or copied from a previous table already processed (more on that later).

ℹ 20 left to process in this session
✔ Processing data element 1 of 35

ℹ 19 left to process in this session
✔ Processing data element 2 of 35

ℹ 18 left to process in this session
✔ Processing data element 3 of 35

ℹ 17 left to process in this session
✔ Processing data element 4 of 35

DATA ELEMENT ----->  APGAR_1 

DESCRIPTION ----->  APGAR 1 score. This is a measure of a baby's physical state at birth with particular reference to asphyxia - taken at 1 minute. Scores 3 and below are generally regarded as critically low; 4-6 fairly low, and 7-10 generally normal. Field can contain high amount of unknowns/non-entries. 

DATA TYPE ----->  CHARACTER 

Categorise data element into domain(s). E.g. 3 or 3,4: 7

Categorisation note (or press enter to continue): your note here 

We chose to respond with ‘7’ because that corresponds to the ‘Health info’ domain in the table. More than one domain can be chosen, separated by commas. Do remember that this demo has over-simplified domain labels, and they will likely be more specific for a research study.

You have the option to re-do the categorisation (and note) you just made, by replying ‘y’ to the question:

Response to be saved is '7'. Would you like to re-do? (y/n): n

After completing 20, it will then ask you to review the auto-categorisations it made.

These auto-categorisations are based on the mappings included in the default look_up.csv. Type get("look_up") in R.

This look-up file can be changed by the user. ALF refers to ‘Anonymous Linking Field’ - this field is used within datasets that have been anonymised and encrypted for inclusion within SAIL Databank.

 ℹ These are the auto categorised data elements:

  data_element domain_code             note
         ALF_E           2 AUTO CATEGORISED
  ALF_MTCH_PCT           2 AUTO CATEGORISED
    ALF_STS_CD           2 AUTO CATEGORISED
 AVAIL_FROM_DT           1 AUTO CATEGORISED
       GNDR_CD           3 AUTO CATEGORISED
       
Select those you want to manually edit:

1:   ALF_E
2:   ALF_MTCH_PCT
3:   ALF_STS_CD
4:   AVAIL_FROM_DT
5:   GNDR_CD

Enter one or more numbers separated by spaces and then ENTER, or 0 to cancel
1: 

Press enter for now, as these look good, but know that you can always manually overide an auto categorisation. It will then ask you if you want to review the categorisations you made. Respond 1 to review:

Would you like to review your categorisations? 

1: Yes
2: No

Selection: 1

ℹ These are the data elements you categorised:

         data_element domain_code note (first 12 chars)
              APGAR_1           7                  note
              APGAR_2           7                      
          BIRTH_ORDER           7                  10% missingness    
             BIRTH_TM           1,7                20% missingness      
         BIRTH_WEIGHT           7                      
     BIRTH_WEIGHT_DEC           7                      
 BREASTFEED_8_WKS_FLG           7                      
 BREASTFEED_BIRTH_FLG           7                      
           CHILD_ID_E           2                      
    CURR_LHB_CD_BIRTH           5,7                Place of birth       
               DEL_CD           7                      
                  DOD           3,7             
        ETHNIC_GRP_CD           3                    
             GEST_AGE           3,7                     
  HEALTH_VISITOR_CD_E           2                     

Select those you want to edit:

 1:   APGAR_1                
 2:   APGAR_2                
 3:   BIRTH_ORDER            
 4:   BIRTH_TM               
 5:   BIRTH_WEIGHT        
 ...
 
Enter one or more numbers separated by spaces and then ENTER, or 0 to cancel
1: 

If you select a number, it will guide you through that categorisation again.

All finished! Take a look at the outputs in your project directory.

Understanding mapping outputs

See here for outputs generated from the demo run.

  • BAR_360_NationalCommunityChildHealthDatabase(NCCHD)_timestamp.html
    • The bar plot that opened in your browser.
  • BAR_360_NationalCommunityChildHealthDatabase(NCCHD)_timestamp.csv
    • The data that created this bar plot.
  • MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
    • The mappings between variables in the CHILD table and the research domains.
  • L-MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
    • The same mappings as the previous file, but saved in a longer format. See the argument long_output = TRUE in metadata_map.
  • MAPPING_LOG_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv
    • A log file that accompanies the MAPPING file, describing features of the session and the table processed.
  • MAPPING_PLOT_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.png
    • A simple visual representation of the mappings, displaying the count of each domain code.

Compare mapping

Running the function map_compare will allow you to compare the mappings from two sessions, perhaps two different researchers. This function compares csv outputs from two sessions, finds their differences, and asks for a consensus, creating a new output file:

CONSENSUS_MAPPING_360_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_timestamp.csv