# Efficiently Generating Gaussian Noise in an FPGA

### Intro

Gaussian noise is required for noise modulation in the Harmon Instruments signal generator. There is a requirement for 1 billion samples per second of gaussian white noise. The commonly used Ziggurat and Box-Muller methods of generating a gaussian distribution from a uniform distribution would require a large amount of FPGA resources given that 4 parallel implementations would be required. Uniformly distributed are inexpensive to generate in terms of FPGA resources and power. Given this, a central limit theorem based approach will be used. Some error in the tails of the distribution is acceptable.

This is implemented in nMigen rather than Verilog as I have previously used.

### Uniform Random Number Generator

A combined Tausworthe
generator is
used. The parameters in line 1 of Table 2 in the reference are
used. The implementation in the Boost
C++
library was used both as a starting point and for comparison in test
coverage. The `PseudoRandomBitSequence`

module accepts requests for
streams of arbitrary numbers of bits and instantiates as many
`PseudoRandomBitsequence64`

modules as required to cover all of the
requests. This is something enabled by nMigen.

from nmigen import * import random class LFSR(Elaboratable): def __init__(self, w, k, q, s, resetval=None): if resetval is None: resetval = random.randint(999, 2**w-1) self.o = Signal(w, reset=resetval) self.nextval = Cat( self.o[k-(s+q):w-q] ^ self.o[k-s:], self.o[w-k:w-s]) def elaborate(self, platform): m = Module() m.d.sync += self.o.eq(self.nextval) return m class PsuedoRandomBitSeqence64(Elaboratable): def __init__(self): self.o = Signal(64) def elaborate(self, platform): m = Module() g1 = m.submodules.gen1 = LFSR(64, 63, 5, 24) g2 = m.submodules.gen2 = LFSR(64, 58, 19, 13) g3 = m.submodules.gen3 = LFSR(64, 55, 24, 7) m.d.sync += self.o.eq(g1.o ^ g2.o ^ g3.o) return m class PseudoRandomBitSequence(Elaboratable): def __init__(self): self.nbits = 0 self.requests = [] self.elaborated = False def get(self, bits): assert(self.elaborated==False) self.nbits += bits v = Signal(bits) self.requests.append(v) return v def elaborate(self, platform): m = Module() print("PRBS: {} bits".format(self.nbits)) while (self.nbits & 63) != 0: self.nbits += 1 urnd = Signal(self.nbits) for i in range(self.nbits//64): sm = m.submodules["prbs_"+str(i)] = PsuedoRandomBitSeqence64() m.d.comb += urnd[64*i:64*(i+1)].eq(sm.o) bit = 0 for r in self.requests: m.d.comb += r.eq(urnd[bit:bit+len(r)]) bit += len(r) self.elaborated = True return m

### Central Limit Theorem Gaussian Generation

The sum of uncorrelated random number generator outputs has a proability density function (PDF) equal to the convolution of all of the individual generator PDFs.

Subtracting two unsigned uniform random variables gives a triangular PDF centered about zero. The base value of our gaussian is the sum of two of these triangular PDF random variables.

Getting a reasonable approximation of the gaussian tails requires many more that 4 uniform inputs which would require significant FPGA resources. Using just 3 6 input FPGA LUTs, we can combine 6 binary uniform inputs to give a distribution ranging from -3 to 3. Using 16 of these 3 bit LUTs gives the sum of 96 1 bit uniform random variables giving a range of +- 48. This sum if shifted left to match the width of sum of 2 triangular distributions and added to those. After the left shift operation, there are gaps in the distribution, but these are filled in by the two triangular PDFs.

##### Sum of two shifted LUT output probability mass function

class GaussLUT(Elaboratable): def __init__(self, i, o): self.i = i # 6 bits uniform random self.o = o # signed, 3 bits range is -3 to 3 def elaborate(self, platform): m = Module() with m.Switch(self.i): for i in range(64): with m.Case(i): sign = -1 if (i&1) else 1 if i < 20: m.d.sync += self.o.eq(0) elif i < 50: m.d.sync += self.o.eq(1*sign) elif i < 62: m.d.sync += self.o.eq(2*sign) else: m.d.sync += self.o.eq(3*sign) return m class GaussGen(Elaboratable): def __init__(self, prbs, enable, nbits=18): self.i = prbs.get(96 + 4*(nbits-7)) self.o = Signal(signed(nbits)) def elaborate(self, platform): m = Module() gluto = Array(Signal(signed(3), name='gluto_{}'.format(a)) for a in range(16)) for i in range(16): m.submodules["glut{}".format(i)] = GaussLUT(i=self.i[6*i:6*i+6], o=gluto[i], clk=self.clk) add1 = Array(Signal(signed(4), name='add1_{}'.format(a)) for a in range(8)) for i in range(8): m.d.sync += add1[i].eq(gluto[2*i] + gluto[2*i+1]) add2 = Array(Signal(signed(5), name='add2_{}'.format(a)) for a in range(4)) for i in range(4): m.d.sync += add2[i].eq(add1[2*i] + add1[2*i+1]) add3 = Array(Signal(signed(6), name='add3_{}'.format(a)) for a in range(2)) for i in range(2): m.d.sync += add3[i].eq(add2[2*i] + add2[2*i+1]) add4 = Signal(signed(7)) # +- 48 range m.d.sync += add4.eq(add3[0] + add3[1]) # two triangular PDF signals: tpdf = Array(Signal(signed(len(self.o)-6), name='tpdf_{}'.format(a)) for a in range(2)) i_tri = self.i[96:] bits_tripart = len(self.o) - 7 for i in range(2): i_tri2 = i_tri[2*bits_tripart*i:2*bits_tripart*i+2*bits_tripart] m.d.sync += tpdf[i].eq(i_tri2[:bits_tripart] - i_tri2[bits_tripart:]) m.d.sync += self.o.eq((add4 << bits_tripart) + tpdf[0] + tpdf[1]) return m

##### Histogram from simulation

The orange in the middle is the triangular PDF created from two uniform distributions.

##### Ideal and actual PDF