Efficiently Generating Gaussian Noise in an FPGA

Intro

Gaussian noise is required for noise modulation in the Harmon Instruments signal generator. There is a requirement for 1 billion samples per second of gaussian white noise. The commonly used Ziggurat and Box-Muller methods of generating a gaussian distribution from a uniform distribution would require a large amount of FPGA resources given that 4 parallel implementations would be required. Uniformly distributed are inexpensive to generate in terms of FPGA resources and power. Given this, a central limit theorem based approach will be used. Some error in the tails of the distribution is acceptable.

This is implemented in nMigen rather than Verilog as I have previously used.

Uniform Random Number Generator

A combined Tausworthe generator is used. The parameters in line 1 of Table 2 in the reference are used. The implementation in the Boost C++ library was used both as a starting point and for comparison in test coverage. The `PseudoRandomBitSequence` module accepts requests for streams of arbitrary numbers of bits and instantiates as many `PseudoRandomBitsequence64` modules as required to cover all of the requests. This is something enabled by nMigen.

```from nmigen import *
import random

class LFSR(Elaboratable):
def __init__(self, w, k, q, s, resetval=None):
if resetval is None:
resetval = random.randint(999, 2**w-1)
self.o = Signal(w, reset=resetval)
self.nextval = Cat(
self.o[k-(s+q):w-q] ^ self.o[k-s:], self.o[w-k:w-s])
def elaborate(self, platform):
m = Module()
m.d.sync += self.o.eq(self.nextval)
return m

class PsuedoRandomBitSeqence64(Elaboratable):
def __init__(self):
self.o = Signal(64)
def elaborate(self, platform):
m = Module()
g1 = m.submodules.gen1 = LFSR(64, 63,  5, 24)
g2 = m.submodules.gen2 = LFSR(64, 58, 19, 13)
g3 = m.submodules.gen3 = LFSR(64, 55, 24,  7)
m.d.sync += self.o.eq(g1.o ^ g2.o ^ g3.o)
return m

class PseudoRandomBitSequence(Elaboratable):
def __init__(self):
self.nbits = 0
self.requests = []
self.elaborated = False
def get(self, bits):
assert(self.elaborated==False)
self.nbits += bits
v = Signal(bits)
self.requests.append(v)
return v
def elaborate(self, platform):
m = Module()
print("PRBS: {} bits".format(self.nbits))
while (self.nbits & 63) != 0:
self.nbits += 1
urnd = Signal(self.nbits)
for i in range(self.nbits//64):
sm = m.submodules["prbs_"+str(i)] = PsuedoRandomBitSeqence64()
m.d.comb += urnd[64*i:64*(i+1)].eq(sm.o)
bit = 0
for r in self.requests:
m.d.comb += r.eq(urnd[bit:bit+len(r)])
bit += len(r)
self.elaborated = True
return m
```

Central Limit Theorem Gaussian Generation

The sum of uncorrelated random number generator outputs has a proability density function (PDF) equal to the convolution of all of the individual generator PDFs.

Subtracting two unsigned uniform random variables gives a triangular PDF centered about zero. The base value of our gaussian is the sum of two of these triangular PDF random variables.

Getting a reasonable approximation of the gaussian tails requires many more that 4 uniform inputs which would require significant FPGA resources. Using just 3 6 input FPGA LUTs, we can combine 6 binary uniform inputs to give a distribution ranging from -3 to 3. Using 16 of these 3 bit LUTs gives the sum of 96 1 bit uniform random variables giving a range of +- 48. This sum if shifted left to match the width of sum of 2 triangular distributions and added to those. After the left shift operation, there are gaps in the distribution, but these are filled in by the two triangular PDFs.

Sum of two shifted LUT output probability mass function

```class GaussLUT(Elaboratable):
def __init__(self, i, o):
self.i = i # 6 bits uniform random
self.o = o # signed, 3 bits range is -3 to 3
def elaborate(self, platform):
m = Module()
with m.Switch(self.i):
for i in range(64):
with m.Case(i):
sign = -1 if (i&1) else 1
if i < 20:
m.d.sync += self.o.eq(0)
elif i < 50:
m.d.sync += self.o.eq(1*sign)
elif i < 62:
m.d.sync += self.o.eq(2*sign)
else:
m.d.sync += self.o.eq(3*sign)
return m

class GaussGen(Elaboratable):
def __init__(self, prbs, enable, nbits=18):
self.i = prbs.get(96 + 4*(nbits-7))
self.o = Signal(signed(nbits))
def elaborate(self, platform):
m = Module()

gluto = Array(Signal(signed(3), name='gluto_{}'.format(a)) for a in range(16))
for i in range(16):
m.submodules["glut{}".format(i)] = GaussLUT(i=self.i[6*i:6*i+6], o=gluto[i], clk=self.clk)

for i in range(8):

for i in range(4):

for i in range(2):

add4 = Signal(signed(7)) # +- 48 range

# two triangular PDF signals:
tpdf = Array(Signal(signed(len(self.o)-6), name='tpdf_{}'.format(a)) for a in range(2))
i_tri = self.i[96:]
bits_tripart = len(self.o) - 7
for i in range(2):
i_tri2 = i_tri[2*bits_tripart*i:2*bits_tripart*i+2*bits_tripart]
m.d.sync += tpdf[i].eq(i_tri2[:bits_tripart] - i_tri2[bits_tripart:])

m.d.sync += self.o.eq((add4 << bits_tripart) + tpdf[0] + tpdf[1])

return m
```
Histogram from simulation

The orange in the middle is the triangular PDF created from two uniform distributions.