Randomly Generating a Truncated Normal Distribution
A brief foray into the actual nitty-gritty aspects of research, especially the part that seems to dominate the time of a graduate student in Epidemiology: coding.
Particularly, the code for generating random numbers from a truncated normal distribution in SAS 9.2. Generating numbers from a regular normal distribution is easy as pie:
x=rannor(seed), where x is the variable you want to be random, and seed is a number. -1 and 0 will generate pseudorandom numbers from – I believe – your processor’s clock, whereas some other number, say 123 will use a set seed for the algorithm, which means that while the numbers will be “random”, you’ll get the same list of random numbers each time, handy for repeatability while you’re developing.
But lets say you want a variable that only goes between -1 and 1, or in this example -2 and 2? The normal distribution doesn’t actually *have* boundaries, so you’re going to get numbers outside that range. Below is annotated code for producing a stream of numbers like that.
do i = 1 to 10000;
do while (random<min or random>max);
The variable names here have been designed to be self-explanatory. mean and sd are the mean and standard deviation of the normal distribution you want to draw from. max and min the maximum and minimum values you want generated. random is the randomly generated variable – in this case, we generate 10,000 values so we get a large number of random values, and we can reliably take a look at the results and make sure we’re getting what we want to get. The little bit about random = max +1 (min-1 also works) keeps the loop going – without it, the first random number that satisfies the boundary conditions shuts down the loop, and you’ll get 10,000 samples of a single number. When generating, rather than manipulating data, it’s best to get a firm handle on how the SAS system handles DATA steps, loops etc. They are, on occasion, nightmarish.
How to check? A simple bit of PROC UNIVARIATE code:
proc univariate data=work.trunc_norm;
histogram/normal( color=blue mu=0 sigma=1);
This generates some statistical moments about the random variable, and then a histogram of the random variable, with a blue overlay of the normal distribution we’re looking for – one with a mean of 0 and a standard deviation of 1. Did it work?
N = 10,000 – hooray, we got all the numbers we wanted.
Mean= 0.0042 – pretty close to zero for the number’s we’re looking for
Standard Deviation= 0.8841- note per the comments (and ignoring a ‘herp derp’ moment on my part) that this will not be one because we’ve changed the distribution.
The minimum and maximum values? 1.9980 and -1.9970 respectively – right at where we wanted the boundaries to be. The histogram:
Generally, it looks like the code above generates what we want it to – which is good, as its being used in my work.
Acknowledgements: Much of the code above is adapted from a post by Bernard Tremblay on the SAS-L archives from 1997.
Filed under: Epidemiology, Grad School Life, SAS | 6 Comments
Tags: RNG, SAS