### Randomly Generating a Truncated Normal Distribution

A brief foray into the actual nitty-gritty aspects of research, especially the part that seems to dominate the time of a graduate student in Epidemiology: coding.

Particularly, the code for generating random numbers from a truncated normal distribution in SAS 9.2. Generating numbers from a *regular *normal distribution is easy as pie:

x=rannor(seed), where x is the variable you want to be random, and seed is a number. -1 and 0 will generate pseudorandom numbers from – I believe – your processor’s clock, whereas some other number, say 123 will use a set seed for the algorithm, which means that while the numbers will be “random”, you’ll get the same list of random numbers each time, handy for repeatability while you’re developing.

But lets say you want a variable that only goes between -1 and 1, or in this example -2 and 2? The normal distribution doesn’t actually *have* boundaries, so you’re going to get numbers outside that range. Below is annotated code for producing a stream of numbers like that.

**data work.trunc_norm;**

** mean=0;**

** sd=1;**

** max=2;**

** min=-2;**

** do i = 1 to 10000;**

** random=max+1;**

** do while (random<min or random>max);**

** random=rannor(-1)*sd+mean;**

** end;**

** output;**

** end;**

**run;**

The variable names here have been designed to be self-explanatory. **mean** and **sd** are the mean and standard deviation of the normal distribution you want to draw from. **max** and **min** the maximum and minimum values you want generated. **random** is the randomly generated variable – in this case, we generate 10,000 values so we get a large number of random values, and we can reliably take a look at the results and make sure we’re getting what we want to get. The little bit about **random = max +1 ** (min-1 also works) keeps the loop going – without it, the first random number that satisfies the boundary conditions shuts down the loop, and you’ll get 10,000 samples of a single number. When generating, rather than manipulating data, it’s best to get a firm handle on how the SAS system handles DATA steps, loops etc. They are, on occasion, nightmarish.

How to check? A simple bit of PROC UNIVARIATE code:

**proc univariate data=work.trunc_norm;**

** var random;**

** histogram/normal( color=blue mu=0 sigma=1);**

**run;**

This generates some statistical moments about the random variable, and then a histogram of the random variable, with a blue overlay of the normal distribution we’re looking for – one with a mean of 0 and a standard deviation of 1. Did it work?

N = 10,000 – hooray, we got all the numbers we wanted.

Mean= 0.0042 – pretty close to zero for the number’s we’re looking for

Standard Deviation= 0.8841- note per the comments (and ignoring a ‘herp derp’ moment on my part) that this will *not* be one because we’ve changed the distribution.

The minimum and maximum values? 1.9980 and -1.9970 respectively – right at where we wanted the boundaries to be. The histogram:

Generally, it looks like the code above generates what we want it to – which is good, as its being used in my work.

Acknowledgements: Much of the code above is adapted from a post by Bernard Tremblay on the SAS-L archives from 1997.

Filed under: Epidemiology, Grad School Life, SAS | 6 Comments

Tags: RNG, SAS

Thanks. I’m using this idea in a project I’m working on. As my statistician friend put it, “It will make pseudorandom normal-like data, but I wouldn’t use it if anyone’s life is on the line.”

Now, how can I create a truncated skewed normal distribution? 🙂

Nor would I use it when someone’s life is on the line. To be frank, I wouldn’t use anything I found in a blog when someone’s life is on the line 🙂

As for a truncated skewed normal distribution, I imagine there are ways, but at the moment, I’m away from a machine with SAS.

The moments of the truncated distribution are going to be different than the moments of the underlying normal. The only reason that the sample mean of the simulated data is close to zero is because you used a symmetric truncation region. The standard devitation of the truncated sample is less than 1 because you’ve chopped off the tails, thus reducing the variance.

Rick – indeed they are going to be different. The sample mean is indeed close to zero because the truncation was symmetric. Never the less, it’s a useful diagnostic check for something one knows should be true. I’ve had enough supposedly straightforward code go haywire for whatever reason, as as you mentioned, you can’t check the other moments – I’ve edited the original post somewhat.

Sure, bu the point is also doing it efficiently. If for instance one is interested in a region say (0-X), with X<< , this will be VERY inefficient; but then you probably would not uses SAS for that

Hans,

The post was *a* way to generate a truncated normal distribution, not necessarily the best way, or strictly the most efficient. But I’ve found it sufficient for my purposes, and more importantly for me, approachable to read and understand. A different algorithm, if it takes someone ten or fifteen minutes to wrap their head around, needs to be *much* more efficient if its only being used once in a blue moon.

But yes, there’s soom to optimize.