Data analysis plan

1. Why a data analysis plan?
A data analysis plan helps you think through the data you will collect, what you will use it for, and how you will analyse it. Analysis planning can be an invaluable investment of time” (Center for Disease Control and Prevention, 2013)
The method for creating a data analysis plan in the context of a NIPN is not much different from the method used in a research context.
In the context of NIPN, the process should be simpler because:

  • A data analysis framework is already produced (step 3 of question formulation process) and forms the basis for the more detailed data analysis plan (after step 4 of question formulation process).
  • Section 3.4, pages 7 to 9 describes data analysis methodologies.
  • NIPN is about the use of existing data, it is not about designing a protocol for new data to be collected.

The next section describes briefly the content of a data analysis plan focusing on what is a bit specific to the NIPN.

As general recommendations:

  • Don’t panic!
  • Use the advice and experiences from colleagues and experts
  • Quickly contact an expert when necessary

Recommended sources to read:
Center for Disease Control and Prevention (2013) Creating an analysis plan. Atlanta.
Simpson, S.H. Creating a data analysis plan: what to consider when choosing statistics for a study (2015).

2. What is a data analysis plan?
Main sections of a data analysis plan (based on CDC module):

  • Main question and sub-questions
  • Dataset(s) to be used
  • Inclusion/exclusion criteria
  • Variables to be used in the main analysis
  • Statistical methods and software to be used
  • Table shells
    => Estimation of time and resources needed

3. Main question and sub-questions
At this stage, the policy relevant question (and, in some cases, its sub-questions) is already well defined (section 3.4, page 11).
Answering all the sub-questions will provide a full answer to the main question.

4. Dataset(s) to be used
The dataset(s) needed is (are) listed. In the context of the NIPN, particular attention may be needed on data management: as the dataset(s) may come from different sources and/or may not have been designed for the main question, there could be quite some work to be done to harmonise / append / clean the raw dataset(s).

  • Are the datasets comparable?
  • Are the indicators harmonised?
  • Is there a need to transform the data for the analysis?
    To answer these question, access to the datasets in question is required.

5. Inclusion/exclusion criteria
In this section population subgroups, geographic scope, timeframe… are very precisely defined.
The definition of the data quality level is also required for the analysis.
Indeed, depending on the analysis, a more or less strict data quality level could be required.
This point is detailed in the Data Quality training module (section 3.3).

6. Variables to be used in the main analysis
In this section the exact variables/indicators to be used in the analysis must be defined.
For example, to analyse “obesity”, it needs to be defined whether the indicator is Body Mass Index (BMI) and if different categories of BMI will be used or the mean BMI or both.
In the context of NIPN, the harmonisation of the definition of indicators across datasets will be important.

7. Statistical methods and software to be used
Ensure coherence with section 4 of the guidance notes on data analysis.
Also, to provide only indisputable analysis (principle 3 section 3.4, page 4), make sure that the statistical method used is coherent with the datasets available and the data quality of these datasets. The choice of the statistical method is key to avoid over-interpretation of the data that could lead to misleading conclusions.
Does the NIPN team has the technical capacity to handle the statistical method and the software identified?

8. Table shells
Nothing specific to NIPN.

9. Estimation of time and resources
At this stage, a precise estimation of the time and resources needed to conduct the analysis should be made.
If this estimation lead to more time than the initial estimation made during the data analysis framework, you may adjust the question/s to be addressed first.