Validation in epidemiological studies

From PHUSE Wiki
Revision as of 10:40, 24 September 2012 by Raimund Storb (talk | contribs) (first data selection)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Validation in epidemiological studies

Raimund Storb

ABSTRACT

Provide objective evidence that a program fulfill the requirements whilst running on huge datasets with mixed quality might become a challenge. E.g. the quality of data may lead to inopportune efforts for double programming; unforeseen exceptions in the data may cause this. Beforehand sample drawing may help reducing run time whilst programming and may also help to limit exceptions. To address all issues with certain relevance we have to define a proper sample size. An alternative approach is to define artificial test data. Thus the correct result is defined. Double programming (of the analysis datasets) would be obsolete. However, some of the data issues may not be addressed regardless of their relevance. But keep in mind that a proper Data Definition Table is the first step to ensure that is done what was intended in your analysis plan.

INTRODUCTION

At pressent there is a lot of experience how to handle the need of validating statistical programming in clinical trials and submissions, abstracts based on clinical trial data, signal detection based on clinical trial data etc. The well controlled quality of data, the explanation of the data obtained from an annotated case report form(s), study protocol(s) together with a defined analysis described in protocols and analysis plans lead to a clear understandig what has to be programmed and hopefully to a Data Definition Table defining all the derivations to be done. There are several possible ways to verify the correctness and validity of programming and there is also the obligation to do so in clinical trial reporting.

There might be no obligation from other parties to perform validation / verification for statistical programming on claims data but the need is still there because the results may aid in decision processes or publications.

Analysing claims databases is differend because

  • The data is like it is delivered, no queries are possible. Data has to be used like it is delivered.
  • The data is like it is used/collected in real live, to serve the needs of real live but not for needs of analyses.
  • You may have to work with "links by meening" in stead of links established by design of the database. For example you may link an observed claim from a pharmacy with one observed claim from a practioneer
  • The amount of data you have to deal with at the begin of your program/analysis depends on the organisation/purpose the data is collected for but not on any kind of estimate of statistical power.
    • it may happen that at the end there are too few usable observations left
    • depending on the database and table you are working with the number of observations may be beyond the millions. This will have an impact on run time and need of computational power.
    • If there is the possibility of an certain inconsistence/error in the way the data is collected it will be there.
  • The data collection is not done by some traine sides and a central laboratory but by thousends of doctors, hospitals, laboratories and pharmacies. All entering the data at the their knowledge.

What are the consequence for verification and validation of programming?

  • You may have to distinguish between (primary) data selection, derivation and tabulation of data.
  • Set your focus on the logic you have to implemtend.
  • Consider which data exception are worth to pay your attention on.
    • Any single data issue in observatiosn that most likely will not contribute to your results are worthless your attention.
  • Consider the run time when you decide how to verify and validate your programming.
  • Consider to reuse proofen program code. You may decide to validate often used program code and macros
  • consider to develop and verify only on a subset of data. Given a proper sample most importand data issues shall be present.

The following chapers will brief discuss verification and validation, ways to verify and validate programming in epidemiology.

first data selection

Programming based on claims data you have in addition to the usual consider the run time on the large tables. Find a simple first step to reduce your data. Use a piece of programming which simply trace once over the entire data and subset it as much as possible. Thus you will gain the following advantages:

  • Simple and fast reviewable reduction of data.
  • Easy traceable (time critical) first step. Even if you want do double program on the whole set you have a good chance to meet the first programming. You may compare this data before further processing with double programming.
    • A better chance to obtain reuseable program code.
    • A better chance to end up with some standard access types.
  • The more complex the first steps are the more complex is the influence of dirty data.
  • Apply a standard if there is one available in your company.
    • The result will be easily met if a double programming is required.
    • The result is easily understood by any other staff familiar with that standard

Lets have a second view on this topics.

Simple and fast reviewable reduction of data

Most likely you are only interested in patient with a certain disease(s) in a certain time frame/time point within a certain range of age and who took a certain drug. Even if you apply a control group afterwards a first selection will be most likely done. In most cases this will reduce your data to a size that is more convenient. You will avoid issues in this first step. In case there are issues with htis step you will be able to fix them fast.

Easy traceable (time critical) first step.

Here we have clearly the advantage of an easy understanding what this piece of program do. Based on this you (and any reviewer) may review the log file information to judge if the result is reasonable. More over it will be more simple to reuse the program if it is easily to understand. (There are enough "high sophisticated" programs written that nobody wants do adopt/reuse/change. No need to add one). If these programs are easy to adopt/understand and you find that some of them are repeated (because some requests are similar or there is a task that is conditioned by your database) you may decide to standardize this. Then please consider to pay the effort to develop this as a validated/verified program/macro.


SUBHEAD (HEADER 2)

This is subtopic for the above. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

(HEADER 1)

This is a main topic in the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

SUBHEAD (HEADER 2)

This is subtopic for the above. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

SUBHEAD (HEADER 2)

This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

SUBHEAD (HEADER 2)

This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

SUBHEAD (HEADER 2)

This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.

If you need to include source code:

  data one;
  set two;
  if mix(var1, var2) > 0 then do;

Continuation of body – after source code.

CONCLUSION (HEADER 1)

The conclusion summarizes your paper and ties together any loose ends. You can use the conclusion to make any final points such as recommendations predictions, or judgments. REFERENCES (HEADER 1)

References go at the end of your paper. This section is not required.

ACKNOWLEDGMENTS (HEADER 1)

Acknowledgments go after your references. This section is not required.

RECOMMENDED READING (HEADER 1)

Recommended reading lists go after your acknowledgments. This section is not required.

REFERENCES

www.phuse.eu [1]