What should I do before analysis? 3 basic steps you really must take

Ooh, it’s exciting when you FINALLY get your data isn’t it!

All that time spent trying to make this happen feels (almost!) worth it. Now the end is so close, your instinct is to launch right into your analysis and find out the answer to your question.


Let me stop you right there!



Before you analyse your data, there are three basic steps you need to take before your analysis:

  1. Get familiar with your data
  2. Check for errors
  3. Get ready for analysis


Preparing your data isn’t rocket science and it doesn’t take very long, but it will help you to feel more comfortable and confident with what you’ve got. It also allows you to pick up basic errors that will cause you problems later on. This process makes even complex analysis much easier, because you’ve already done the basics first.


What should I do before analysis?


In this blog, I’ll take you through these steps. It’s the same process I work through for every single dataset I get.


Step 1 – Get familiar with your data

Open the dataset

Now, this first bit is going to sound obvious but it’s so important that I’m writing it anyway!

Open up your dataset and actually have a look at it.

This step is particularly important if you’re doing secondary data analysis; i.e. your data wasn’t collected by you.

Check to see:


Get familiar with your data


Check your variables

Next, look back to your data collection tool (CRF, questionnaire, etc) and mentally match up variables with questions. This means you can be very clear on where those variables have come from.

You might find that extra variables have been created along the way. If there has then make sure you know how they’ve been calculated, even if it seems obvious. For example, ‘height’ could actually be ‘sitting height’ and not ‘standing height’.


Match the dataset to your expectations

Finally, check that the number of rows in your dataset is as expected. For example, if you’ve got 100 people in your study, you’d expect 100 rows of data. If there are more or less data than expected, check with the trial manager or data manager that nothing has been missed.


Step 2 – Clean up the data

Take it one column at a time

I find a systematic method most effective for this step, moving through the dataset one variable at a time from left to right. Take your first variable and either plot it or create some summary statistics (preferably both) just to get a feel for what that variable looks like.

The exact steps will depend on what kind of variable it is. If it is a continuous variable, something like a histogram might be most appropriate. If it’s categorical, try using a table instead.


Expect the unexpected

If you’re using a table, think about what categories you should have in there. Does the table reflect this? Has it been coded up in some way, i.e. has it got levels or is it free text? If it’s free text then you’ll need to code it up if you want to include that variable in a quantitative analysis.

Finally, look out for any data points that don’t fit into the pattern that they should. So for example, if you’re looking at a smoking variable, you might expect to see current smokers, never smokers and ex-smokers. If you have a non-drinker in there, this is a very obvious error that needs following up.

Likewise when compiling a histogram, look out for very high values or very low values, or ones that are implausible. For example, for a blood test, there will be a lab range indicating what is physically possible. Any data points outside of this range must be an error.


Expect the unexpected


Correcting for errors

When you find errors, you can use the source data to amend them. Or if a potential error is actually correct then you can also note this down.

Of course, the source data is not always available because datasets can be taken from weird and wonderful places. If you can’t go back to the source data, you’ll need to make an objective decision as to what you will do with values that might be errors. There is no right or wrong way to do this. It will depend on how plausible it is that it could be a true value.

The most important thing to remember is to be consistent in your decision-making. Have a threshold at which you would say: anything over this value I am going to do the same thing to it. If you are cherry picking certain values to exclude, you might start introducing bias into your data – something very important to avoid.



One that is often used is: any value that is 4 standard deviations bigger than the mean, or 4 standard deviations smaller than the mean, will get removed from the data as being implausible. So, as long as you do that for all of the values that are greater or smaller than 4 standard deviations from the mean, this is one rule you could use.

We’re not talking about manually editing data in Excel here, because it’s really hard to keep an audit trail (if you’re analysing clinical trials then you shouldn’t be doing that anyway).

If you’re making decisions like this, it’s important to keep a record of them because you will need to include it in your final report to say what you did and why you did it. You could do this using a pen and paper, a Word document, or what I recommend is using a statistical package (more on this below!)


Step 3 – Get ready for analysis

The third and final step of preparing your data is to start to get it ready for analysis. Just like in Step 2, I recommend you do this one variable at a time, moving from left to right.


Redundant variables

A redundant variable is anything that you don’t need for analysis. For example, this might be columns that have been entered as part of the study process but aren’t needed for analysis. Or it might be that you’ve produced a clean version of the variable so you no longer need the original version.

I recommend that you remove any redundant variables, ideally within a statistical package using code that can be repeated each time. There is no harm in keeping these, but it just makes it a bit harder to see what’s what.

Likewise if you’ve got a dataset that has lots of variables, you might only be looking at a subset of these for your analysis. Rather than keeping all of them in there, I would recommend only keeping the ones that you need.



Translate data into a readable format

It’s a good idea to create a dummy code for labels to your variables. A dummy code allows you to input text variables (such as our smoking variable) into statistical packages that might not recognise these datasets. For example, never smoker = 0, ex-smoker =1 and current smoker = 2. To help you remember the definitions of 0, 1, 2 you would also create labels to attach the names to these dummy codes.

At this point you should also be scoring any validated questionnaires. For example, if you’ve used something like the PAID questionnaire or a HADS (the Hospital Anxiety and Depression) questionnaire, these will tend to have validated scoring methods. This means they will have a way of attaching a score to each answer and totalling them all up to get an overall score. Doing this now will save you time later on when you might not have the questionnaires to hand.


Inclusion and exclusion criteria

Finally, check your inclusion and exclusion criteria. This might be a set of formal criteria you have in a protocol upfront, especially for observational studies or those that have only a subset of people that you are interested in for your analysis. This is less important for clinical trials, since anyone excluded would not have taken part in the study at all. Consider that some people may have withdrawn consent over time and you’ll need to take these people out of the dataset.


A note on statistical packages

Personally, I tend to do all of my data preparation within one programme, completing all three steps in one go. I feed my data into the package, and do all of my checking, processing and cleaning in the same place.

On the last line of my programme I save the dataset. Then in a separate programme, I feed in the nice clean version of my dataset and use that for analysis.

The benefits of using a statistical package for this process are many, including:

  • it will keep the records for you,
  • it’s repeatable: you can remember what you’ve done without bits of paper everywhere to remind you,
  • if either you or someone else comes to pick that up they’ll know exactly where to find it.



Are you ready to start preparing your data?

And with that, you’re done! You can save your copy of your dataset and get ready to analyse it.

If you’re raring to go but would like a little helping hand, I have a checklist that is available for free. You can print it out and work through the processes step by step. Soon enough, they’ll be second nature.




Leave a Reply

Your email address will not be published. Required fields are marked *