Efficiently Generating Gaussian Noise in an FPGA

Intro

Gaussian noise is required for noise modulation in the Harmon Instruments signal generator. There is a requirement for 1 billion samples per second of gaussian white noise. The commonly used Ziggurat and Box-Muller methods of generating a gaussian distribution from a uniform distribution would require a large amount of FPGA resources given that 4 parallel implementations would be required. Uniformly distributed are inexpensive to generate in terms of FPGA resources and power. Given this, a central limit theorem based approach will be used. Some error in the tails of the distribution is acceptable.

This is implemented in nMigen rather than Verilog as I have previously used.

Uniform Random Number Generator

A combined Tausworthe generator is used. The parameters in line 1 of Table 2 in the reference are used. The implementation in the Boost C++ library was used both as a starting point and for comparison in test coverage. The PseudoRandomBitSequence module accepts requests for streams of arbitrary numbers of bits and instantiates as many PseudoRandomBitsequence64 modules as required to cover all of the requests. This is something enabled by nMigen.

from nmigen import *
import random

class LFSR(Elaboratable):
    def __init__(self, w, k, q, s, resetval=None):
        if resetval is None:
            resetval = random.randint(999, 2**w-1)
        self.o = Signal(w, reset=resetval)
        self.nextval = Cat(
            self.o[k-(s+q):w-q] ^ self.o[k-s:], self.o[w-k:w-s])
    def elaborate(self, platform):
        m = Module()
        m.d.sync += self.o.eq(self.nextval)
        return m

class PsuedoRandomBitSeqence64(Elaboratable):
    def __init__(self):
        self.o = Signal(64)
    def elaborate(self, platform):
        m = Module()
        g1 = m.submodules.gen1 = LFSR(64, 63,  5, 24)
        g2 = m.submodules.gen2 = LFSR(64, 58, 19, 13)
        g3 = m.submodules.gen3 = LFSR(64, 55, 24,  7)
        m.d.sync += self.o.eq(g1.o ^ g2.o ^ g3.o)
        return m

class PseudoRandomBitSequence(Elaboratable):
    def __init__(self):
        self.nbits = 0
        self.requests = []
        self.elaborated = False
    def get(self, bits):
        assert(self.elaborated==False)
        self.nbits += bits
        v = Signal(bits)
        self.requests.append(v)
        return v
    def elaborate(self, platform):
        m = Module()
        print("PRBS: {} bits".format(self.nbits))
        while (self.nbits & 63) != 0:
            self.nbits += 1
        urnd = Signal(self.nbits)
        for i in range(self.nbits//64):
            sm = m.submodules["prbs_"+str(i)] = PsuedoRandomBitSeqence64()
            m.d.comb += urnd[64*i:64*(i+1)].eq(sm.o)
        bit = 0
        for r in self.requests:
            m.d.comb += r.eq(urnd[bit:bit+len(r)])
            bit += len(r)
        self.elaborated = True
        return m

Central Limit Theorem Gaussian Generation

The sum of uncorrelated random number generator outputs has a proability density function (PDF) equal to the convolution of all of the individual generator PDFs.

Subtracting two unsigned uniform random variables gives a triangular PDF centered about zero. The base value of our gaussian is the sum of two of these triangular PDF random variables.

Getting a reasonable approximation of the gaussian tails requires many more that 4 uniform inputs which would require significant FPGA resources. Using just 3 6 input FPGA LUTs, we can combine 6 binary uniform inputs to give a distribution ranging from -3 to 3. Using 16 of these 3 bit LUTs gives the sum of 96 1 bit uniform random variables giving a range of +- 48. This sum if shifted left to match the width of sum of 2 triangular distributions and added to those. After the left shift operation, there are gaps in the distribution, but these are filled in by the two triangular PDFs.

Sum of two shifted LUT output probability mass function

LUT output

class GaussLUT(Elaboratable):
    def __init__(self, i, o):
        self.i = i # 6 bits uniform random
        self.o = o # signed, 3 bits range is -3 to 3
    def elaborate(self, platform):
        m = Module()
        with m.Switch(self.i):
            for i in range(64):
                with m.Case(i):
                    sign = -1 if (i&1) else 1
                    if i < 20:
                        m.d.sync += self.o.eq(0)
                    elif i < 50:
                        m.d.sync += self.o.eq(1*sign)
                    elif i < 62:
                        m.d.sync += self.o.eq(2*sign)
                    else:
                        m.d.sync += self.o.eq(3*sign)
        return m

class GaussGen(Elaboratable):
    def __init__(self, prbs, enable, nbits=18):
    self.i = prbs.get(96 + 4*(nbits-7))
        self.o = Signal(signed(nbits))
    def elaborate(self, platform):
        m = Module()

        gluto = Array(Signal(signed(3), name='gluto_{}'.format(a)) for a in range(16))
        for i in range(16):
            m.submodules["glut{}".format(i)] = GaussLUT(i=self.i[6*i:6*i+6], o=gluto[i], clk=self.clk)

        add1 = Array(Signal(signed(4), name='add1_{}'.format(a)) for a in range(8))
        for i in range(8):
            m.d.sync += add1[i].eq(gluto[2*i] + gluto[2*i+1])

        add2 = Array(Signal(signed(5), name='add2_{}'.format(a)) for a in range(4))
        for i in range(4):
            m.d.sync += add2[i].eq(add1[2*i] + add1[2*i+1])

        add3 = Array(Signal(signed(6), name='add3_{}'.format(a)) for a in range(2))
        for i in range(2):
            m.d.sync += add3[i].eq(add2[2*i] + add2[2*i+1])

        add4 = Signal(signed(7)) # +- 48 range
        m.d.sync += add4.eq(add3[0] + add3[1])

        # two triangular PDF signals:
        tpdf = Array(Signal(signed(len(self.o)-6), name='tpdf_{}'.format(a)) for a in range(2))
        i_tri = self.i[96:]
        bits_tripart = len(self.o) - 7
        for i in range(2):
            i_tri2 = i_tri[2*bits_tripart*i:2*bits_tripart*i+2*bits_tripart]
            m.d.sync += tpdf[i].eq(i_tri2[:bits_tripart] - i_tri2[bits_tripart:])

        m.d.sync += self.o.eq((add4 << bits_tripart) + tpdf[0] + tpdf[1])

        return m
Histogram from simulation

The orange in the middle is the triangular PDF created from two uniform distributions.

Distribution

Ideal and actual PDF

PDF

PDF

Implementing a True Random Number Generator on Xilinx 7 Series FPGAs

Intro

The Harmon Instruments platform uses a Xilinx Zynq (ARM Cortex A9 + FPGA) running Linux. Booting from a RAM disk and having a very deterministic boot process results in little entropy available network services are starting and generating encryption keys. There were numerous warnings like uninitialized urandom read in the boot logs and SSH logins were slow.

I wanted to use existing hardware to supply the Linux RNG. Possibilities included the Zynq's built in ADC with it's noisy temperature sensor, the embedded controller's built in ADC which measures power supply voltages or doing something in the FPGA fabric.

Doing it in the FPGA seemed the best and cleanest for the software architecture as it could be accessed via a memory mapped register.

There are several FPGA logic based TRNG papers. This one seemed the most promising.

Random number generation is not a field in which I am an expert. My assumptions may be flawed. I expect it is a big improvement over what entropy Linux had available in the 3 seconds between beginning to boot and starting network services.

Approach

TRNG block diagram

Implementing the programmable delay lines in the paper mentioned above resulted in minimal delay variation. Perhaps, the LUTs on 7 series have more balanced delays than the older parts. I opted to use the MMCME2 variable fine phase shift available in the Xilinx 7 Series FPGAs. Presumably, the fine phase shift is implemented similarly to the delay in the IDELAY which uses a chain of inverters with a variable supply voltage. The supply voltage is adjusted by a PLL to achieve the required propagation delay. The IDELAY is explained in XAPP707.

The MMCME2 fine phase shift resolution is about 17.8 picoseconds. There is perhaps 50 ps of phase variation due to the switching power supply on VCCINT in addition to the random jitter of the MMCME2.

The carry chain of an adder is used to get a series of closely spaced taps in time. One adder input is all 1's and the other is all 0's except the least significant bit (LSB) which is connected to a 100 MHz clock via the MMCME2. The capture flip flops are clocked with the same 100 MHz clock as is provided to the MMCM input. The MMCM phase is adjusted so that the capture flip flops catch the carry propagation at approximately the half way point. The average incremental carry delay though the adder is about 25 ps per bit.

Raw adder output looks like this:

000111111111
000001011111
000101011111
000001011111
000101011111
000000011111
000000011111
000001011111
000001011111
000001011111
000001011111
000101011111
000111111111
000101011111
000001011111
000111111111
000000011111
000001011111
000101111111
000001011111
000101111111
000001011111
000000011111
000000011111
000001011111
000001011111
000111111111
000001011111
000001011111
000101011111
000001011111
000101011111
000001011111
000001011111
000001011111
000000011111
000000011111
000111111111
000000011111
000001011111
000001011111
000001011111
000000011111
000111111111
000101011111
000000011111
000101011111
000000011111
000000011111
000101011111
000101011111
000001011111
000101111111
000111111111
000111111111
000111111111
000111111111
000111111111
000001011111
000001011111
000000011111
000000011111
000000011111
000000011111

Noise source

The noise being measured derives from many sources, some of which are random and others which are not. The biggest random source is clock jitter added by the MMCME2 which is mostly due to thermal noise. The biggest deterministic source is power supply ripple on VCCINT. There appears to be little to no metastability in the capture flip flops which is surprising.

Post Processing

The addition results are reduced to a single bit after the capture flip flops by XORing all of the bits together.

A second stage of post processing XORs the single bit result of 32 addition operations together. This gives a final data rate of 3.125 Mb/s.

Analysis: Dieharder

This one requires a lot of data. I captured 8 GiB which took over 6 hours and over 2 trillion operations of the timing skewed adder. All tests pass. The -Y 1 options tells it to try again until it passes or fails in the case of a result labeled 'weak'. It did that in one case. Output is too much to list.

$ dieharder -a -g 201 -f ~/vna/software/zynqsw/dump -Y 1
#=============================================================================#
#            dieharder version 3.31.1 Copyright 2003 Robert G. Brown          #
#=============================================================================#
   rng_name    |           filename             |rands/second|
 file_input_raw|/home/dlharmon/vna/software/zynqsw/dump|  3.97e+07  |
#=============================================================================#
        test_name   |ntup| tsamples |psamples|  p-value |Assessment
#=============================================================================#
   diehard_birthdays|   0|       100|     100|0.66926960|  PASSED
      diehard_operm5|   0|   1000000|     100|0.75113475|  PASSED
  diehard_rank_32x32|   0|     40000|     100|0.92027413|  PASSED

A second run showed that one passing and a few others as "weak". It seems to report at least one test as weak even with the internal PRNGs or /dev/urandom.

Calling this one is tricky: $ dieharder -a -g 201 -f $DUMPFILE. Without the -g 201 which tells it to use raw binary data in the file specified with -f, it uses an internal PRNG. With just -f, it silently uses an internal PRNG.

Analysis: ent

$ ent $DUMPFILE -b
Entropy = 1.000000 bits per bit.

Optimum compression would reduce the size
of this 68719476736 bit file by 0 percent.

Chi square distribution for 68719476736 samples is 3.29, and randomly
would exceed this value 6.99 percent of the times.

Arithmetic mean value of data bits is 0.5000 (0.5 = random).
Monte Carlo value for Pi is 3.141582468 (error 0.00 percent).
Serial correlation coefficient is 0.000001 (totally uncorrelated = 0.0).

And without the second stage of post processing:

Entropy = 0.994544 bits per bit.

Optimum compression would reduce the size
of this 1048576 bit file by 0 percent.

Chi square distribution for 1048576 samples is 7920.65, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bits is 0.4565 (0.5 = random).
Monte Carlo value for Pi is 3.474296178 (error 10.59 percent).
Serial correlation coefficient is -0.001865 (totally uncorrelated = 0.0).

Analysis: rngtest

Some failures are expected with completely random data.

$ rngtest <$DUMPFILE
rngtest 2-unofficial-mt.14
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

rngtest: starting FIPS tests...
rngtest: entropy source exhausted!
rngtest: bits received from input: 68719476736
rngtest: FIPS 140-2 successes: 3433210
rngtest: FIPS 140-2 failures: 2763
rngtest: FIPS 140-2(2001-10-10) Monobit: 342
rngtest: FIPS 140-2(2001-10-10) Poker: 338
rngtest: FIPS 140-2(2001-10-10) Runs: 1064
rngtest: FIPS 140-2(2001-10-10) Long run: 1038
rngtest: FIPS 140-2(2001-10-10) Continuous run: 1
rngtest: input channel speed: (min=1.488; avg=8354.665; max=19073.486)Mibits/s
rngtest: FIPS tests speed: (min=1.364; avg=79.565; max=93.497)Mibits/s
rngtest: Program run time: 832147763 microseconds

Feeding /dev/random

Simply writing to /dev/random does not increase the systems entropy bit count. Details are on this man page. I set the bits of entropy to the bit count of the data written divided by 8 to be conservative.

// Copyright 2019 Harmon Instruments, LLC
// MIT license

#include <linux/random.h>
#include <sys/ioctl.h>

class DevRandom {
private:
        int _fd;

public:
        DevRandom() {
                _fd = open("/dev/random", O_RDWR);
                if (_fd < 0) {
                        throw std::runtime_error("failed to open /dev/random");
                }
        }
        ~DevRandom() { close(_fd); }
        // returns the Linux entropy pool count in bits
        int get_entropy_count() {
                int rv = 0;
                int iocrv = ioctl(_fd, RNDGETENTCNT, &rv);
                if (iocrv != 0) {
                        perror("get_entropy_count failed");
                        throw std::runtime_error("get_entropy_count failed");
                }
                return rv;
        }
        // bits is the number of bits of entropy e is assumed to contain
        void add_entropy(std::vector<uint32_t> &e, uint32_t bits) {
                std::vector<uint32_t> d;
                d.resize(e.size() + 2);
                d[0] = bits;                    // number of bits of entropy
                d[1] = e.size() * sizeof(e[0]); // number of bytes in buffer
                for (size_t i = 0; i < e.size(); i++) // buffer
                        d[2 + i] = e[i];
                int iocrv = ioctl(_fd, RNDADDENTROPY, &d[0]);
                if (iocrv != 0) {
                        perror("add_entropy failed");
                        throw std::runtime_error("add_entropy failed");
                }
        }
};

Lightweight FPGA bitstream compression

Intro

The Harmon Instruments products have FPGAs on expansion boards which need to be configured at power on. These bitstreams are stored in files on an SD card on the main board and get sent over serial links to the expansion boards. It's desirable to minimize size of the data tranferred and hence the time required. The compression provided by Xilinx and Lattice in their tools is helpful, but only remove duplicated configuration frames. There are still long runs of zeros in the compressed bitstreams.

Algorithm

This needs to be kept simple for fast decompression by a low end microcontroller (STM32F030, ARM Cortex M0).

A byte between 1 and 127 encodes a run of 1 to 127 zeros. A byte greater than 128 encodes the start of a verbatim data block of 1 to 127 bytes. These bytes are written to the output and processing resumes with the following bytes. Byte values of 0 and 128 are no operation.

Compression ratio

As expected, this simple algorithm isn't the most efficient, but it's good enough and fast to extract. Icecompr, intended for use with Lattice iCE40 parts provides a better compression ratio and is a good option if small size is more important than decompression speed.

Both the Xilinx and Lattice FPGAs have a fairly low logic utilization. The Xilinx XC7A15 has the same die as the XC7A35T and XC7A50T and is limited by Vivado in how much logic can be used, making the bitstream even more sparse. The Lattice Diamond bitstream compression is more effective than that of Xilinx.

For the sake of minimizing configuration time, the FPGA vendor compression is used first.

Xilinx XC7A15T bitstream
Algorithm bytes
raw 2192012
Vivado 959472
Vivado + this 219236
Vivado + icecompr 149316
Vivado + gzip 103820
Vivado + xz -e 73072
Lattice LFE5U-12 bitstream
Algorithm bytes
raw 582675
Diamond 109207
Diamond + this 60060
Diamond + icecompr 57474
Diamond + gzip 25194
Diamond + xz -e 22800

Speed

The STM32F030 configures the XC7A15T by JTAG from the compressed bitstream in less than 360 ms. It's mostly limited by the 24 Mb/s SPI, achieving a throughput of 21 Mb/s. The bitstream is transferred by serial wire debug into a ring buffer in the STM32. This core is used to do those writes with the SWD clock running at 31.25 MHz.

Code

#!/usr/bin/env python3

# FPGA bitstream compression with an emphasis on decompression speed and simplicity

import sys

with open(sys.argv[1], "rb") as f:
    bs = bytearray(f.read())

print("input size = {} bytes".format(len(bs)))

# return number of zeros starting at position i of BS
def get_zeros(i):
    rv = 0
    maxbits = min(len(bs)-i, 127)
    while rv < maxbits:
        if bs[rv+i] != 0:
            break
        rv += 1
    return rv

def get_data(i):
    rv = 0
    maxbits = min(len(bs)-i, 127)
    while rv < maxbits:
        if bs[i+rv] == 0:
            if rv+2 < maxbits:
                if (bs[i+rv+1]) != 0 or (bs[i+rv+2] != 0):
                    rv += 2
                    continue
            return rv
        rv += 1
    return rv

odata = bytearray()

i=0

while i < len(bs):
    while True:
        z = get_zeros(i)
        if z == 0:
            break
        i += z
        odata.append(z)

    while True:
        dc = min(127,get_data(i))
        if dc == 0:
            break
        odata.append(dc+128)
        odata += bs[i:i+dc]
        i += dc
while len(odata) & 3 != 0:
    odata.append(0)

print("compressed length = {}".format(len(odata)))

with open(sys.argv[2], "wb") as f:
    f.write(odata)

# now extract it and make sure it matches
rc = bytearray()

i = 0
while i < len(odata):
    if odata[i] < 128:
        rc += bytearray(odata[i])
        i += 1
        continue
    else:
        count = odata[i] - 128
        rc += odata[i+1:i+1+count]
        i += count + 1
        continue

print("extracted length = {}, match {}".format(len(rc), rc==bs))

Linux Userspace IO Interrupts on Xilinx Zynq

Intro

It's now possible to use a complex peripheral with interrupts and DMA under Linux using UIO and the generic-uio driver rather than having to write a kernel module.

There was a kernel module for the Harmon Instruments VNA which was working well, but using UIO means that will not have to be maintained.

This example uses 3 system calls per interrupt due to limitations in generic-uio. With a kernel module, only 1 system call per interrupt would be required for this hardware. It might be desirable to add an ioctl() call to the UIO driver which simply blocks until the interrupt count is at least 1 more than the last call, leaves the interrupt enabled and has a timeout. In the Harmon Instruments VNA case, only about 50 interrupts per second are processed and the time per interrupt is about 20 microseconds, so less than 0.1 % of a CPU is being wasted.

Linux 4.9 is used in the example.

FPGA

The logic in the Zynq PL (FPGA side) presents a rising edge to the IRQ_F2P[0] port on the Zynq PS (CPU side) on completion of DMA operations. One IRQ_F2P interrupt is enabled of a possible 16. Vivado confusingly states "The MSB is assigned the highest interrupt ID of 91" and "[91:84], [68:61]" for the interrupt numbers. Here's a Xilinx Answer Record that implies it should be interrupt 91. After bit of greping the generated wrapper, I find there's a parameter, IRQ_F2P_MODE which is set to "DIRECT" by default in my case, but can be set to "REVERSE" in which case it would be interrupt 91. Following the logic in the generated wrapper and consulting the documentation, it is evident it is interrupt 61. Xilinx should have just brought out all 16 interrupts avoiding the interrupt mapping confusion of the wrapper.

Kernel Config

The UIO driver needs to be enabled in the kernel configuration. It is under Device Drivers -> Userspace I/O drivers in menuconfig.

CONFIG_UIO=y
CONFIG_UIO_PDRV_GENIRQ=y

There is also CONFIG_UIO_DMEM_GENIRQ if dynamically allocated DMA buffers are required. This example uses a static DMA buffer allocation at boot time.

Device Tree

chosen {
        bootargs = "console=ttyPS0,115200 earlyprintk root=/dev/ram mem=384M uio_pdrv_genirq.of_id=generic-uio";
        linux,stdout-path = "/amba/serial@e0000000";
        stdout-path = "serial0:115200n8";
};

&amba {
      hififo: hififo@40000000 {
                      compatible = "generic-uio";
                      interrupt-parent = <&intc>;
                      interrupts = <0 29 1>;
                      reg = <0x40000000 0x1000 0x18000000 0x8000000>;
              };
};

The "mem=384M" in bootargs tells the system to only use the lower 384 MiB of the 512 MiB of available RAM. A large contiguous DMA buffer is required and this is a convenient way to get it on an embedded system. The "uio_pdrv_genirq.of_id=generic-uio" is required to enable the generic-uio driver as explained in this Github discussion

Hififo is our device using the UIO driver. It has 4 kiB of registers at address 0x4000 (mmap index 0) and 128 MiB of DMA space at 0x18000000 (mmap index 1). Above, I stated our interrupt is 61, and here it is 29. There's an offset of 32 for a shared peripheral interrupt in the device tree. The 1 after the 29 indicates that it is a rising edge interrupt, triggering only once on the completion signal.

Userspace

This blog post was very helpful with the user space code. I chose to wrap the UIO syscalls in a C++ class to avoid "goto fail" like error handling.

#include <cstdint>
#include <cstddef>
class UIO {
private:
        int _fd;

public:
        explicit UIO(const char *fn);
        ~UIO();
        void unmask_interrupt();
        void wait_interrupt(int timeout_ms);
        friend class UIO_mmap;
};

class UIO_mmap {
private:
        size_t _size;
        void *_ptr;

public:
        UIO_mmap(const UIO &u, int index, size_t size);
        ~UIO_mmap();
        size_t size() const { return _size; }
        void *get_ptr() const { return _ptr; }
};
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <cstdlib>
#include <cstdint>
#include <poll.h>
#include <errno.h>
#include <stdexcept>

UIO::UIO(const char *fn) {
        _fd = open(fn, O_RDWR);
        if (_fd < 0)
               throw std::runtime_error("failed to open UIO device");
}

UIO::~UIO() { close(_fd); }

void UIO::unmask_interrupt() {
        uint32_t unmask = 1;
        ssize_t rv = write(_fd, &unmask, sizeof(unmask));
        if (rv != (ssize_t)sizeof(unmask)) {
                perror("UIO::unmask_interrupt()");
        }
}

void UIO::wait_interrupt(int timeout_ms) {
        // wait for the interrupt
        struct pollfd pfd = {.fd = _fd, .events = POLLIN};
        int rv = poll(&pfd, 1, timeout_ms);
        // clear the interrupt
        if (rv >= 1) {
               uint32_t info;
               read(_fd, &info, sizeof(info));
        } else if (rv == 0) {
               // this indicates a timeout, will be caught by device busy flag
        } else {
               perror("UIO::wait_interrupt()");
        }
}

UIO_mmap::UIO_mmap(const UIO &u, int index, size_t size) : _size(size) {
        _ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
                u._fd, index * getpagesize());
        if (_ptr == MAP_FAILED) {
                perror("UIO_mmap");
                std::runtime_error("UIO_mmap construction failed");
        }
}

UIO_mmap::~UIO_mmap() { munmap(_ptr, _size); }

A struct with std::atomic members is a convenient way to represent the device registers. This provides memory barriers to ensure proper ordering of reads and writes.

#include <cstdint>
#include <atomic>
struct FPGAregs {
        std::atomic<uint32_t> interrupt_status;
        std::atomic<uint32_t> busy;
        std::atomic<uint32_t> trigger_dma_ops;
        std::atomic<uint32_t> led;
};

This is a simplified example of usage:

UIO uio("/dev/uio0");
UIO_mmap mmap_regs(uio, 0, 4096);
UIO_mmap mmap_dma(uio, 1, 128*1024*1024);

// access the hardware registers
FPGAregs *regs = reinterpret_cast<FPGAregs *>(mmap_regs.get_ptr());
regs->led = 1;

// DMA operations with interrupt
// not shown: write outgoing data to the DMA buffer
uio.unmask_interrupt(); // we need to unmask first so we don't miss the interrupt
regs->trigger_dma_ops = 1; // trigger our hardware, creating an interrupt on completion
uio.wait_interrupt(1000);
// check the registers to verify operation completed and access result in the DMA buffers
if(regs->busy)
       ...

OSHPark 4 Layer PCB Measurements

Intro

OSHPark PCBs are convenient for various RF test boards at a low cost given the relatively low loss Isola FR408 material. Trying an idea is much more appealing when the cost is $50 for the board and parts versus $1000+ for a custom RF material stack up prototype. As you will see there is some variation in properties over the board. These are built with prepreg for outer layers where RF and microwave PCBs typically have cores for outer layers for consistent dielectric thickness. These have been used succesfully at Harmon Instruments to beyond 50 GHz. Loss is extremely high at those frequencies, so lines must be kept short. There is a bypassable frequency doubler in the VNA that measured this data on an OSHPark PCB, providing stimulus from 26.5 GHz to 50 GHz.

Thanks to OSHPark for providing this board and the previous revisions at no charge.

The test board

the test PCB

the test PCB

The first revision used line widths calculated from the values on the OSHPark site. Impedances were consistenly high (56 ohms typical) over two orders of the board. This one uses the assumption of 0.19 mm thick prepreg with an Er of 3.3 (versus 0.17 mm and Er of 3.66) given in the OSHPark docs. See the previous blog post for the Er of just the material. The missing corner of this board was the sample used there. The prepreg in the cross section measures approximately 0.177 mm, but likely varies throughout the board. The total thickness varies about 25 µm over a board.

Microstrip

Time Domain

Time domain of 53 mm unmasked microstrip

0.42 mm wide lines are used for 50 ohm microstrip on this board. As can be seen in the plot above, the impedanace is, on average, a bit low. Perhaps, 0.41 mm would be a good design value. PCB #1 had the most variation and the impedance correlates with PCB thickness variation along the line. There is a short thick area around the high impedance peak. The other two show a thickness taper along the length. I repeated the measurement of PCB #1 after #2 and #3 and got data nearly indistinguishable from the original measurement.

Time domain of 53 mm masked microstrip

Using the same 0.42 mm width, but adding solder mask, gives a lower impedance of around 47 ohms. A good design value to achieve 50 ohms might be 0.39 mm. Covering microstrip in solder mask does save some insertion loss, but the difference is minimal, so unmasked may still be preferred.

Insertion loss (no solder mask)

Microstrip insertion loss plotMicrostrip insertion loss plot

Insertion loss (with solder mask)

Microstrip insertion loss plotMicrostrip insertion loss plot

Effective Relative Permittivity (Er)

Effective Er is simply the square of the speed of light divided by the propagation velocity in the line.

Microstrip Er plotMicrostrip Er plotMicrostrip Er plot

The unmasked microstrip should have an effective Er of around 2.5 given the assumptions about Er and the line geometry. It measures much higher. I presume this is due to metal loss and surface texture, but am not certain of it. Comments are welcome. I measure ErEff=1.002 for 3.5 mm air lines at 20 GHz using the same differential phase calculation. EM simulation shows that a textured conductor has slower propagation than a smooth one and this is rather rough copper. The plot below shows time domain step responses of both 53 mm lines with the masked line reaching 50% at 309.4 ps and the unmasked at 296.4 ps. Those correspond to effective Er values of 3.07 and 2.81. The effect of the mask slowing propagation is clearly visible. The stimulus rise time (10-90%), (set by the window function) is about 14 ps, masked is 25 ps and unmasked is 36 ps.

time domain

Stripline

The striplines on this board are 0.3 mm wide, on inner2, with copper on inner1 and the bottom as ground references. It's very close to 50 ohms. The vias need a bit more clearance (0.2 mm used on this board) as can be seen by the low impedance spikes in the time domain plot.

Time Domain

Stripline time domain plot

The 2.4/1.85 dip is the discontinuity resulting from the connection of a 2.4 mm and 1.85 mm connector. A similar discontinuity results from connecting a 2.92 to a 3.5 mm connector. The Harmon Instruments VNA supports removal of these discontinuities, but it was not enabled here. The via dips are the vias transitioning from microstrip to stripline.

Insertion Loss

Stripline insertion loss plot

The spikes at ~ 2 GHz intervals are artifacts of the LRL on PCB calibration (frequencies where the line phase difference is 180 degrees) and should be ignored. More lines could have been added to avoid these.

Effective Er

Stripline Er

The Er of the core is about 4.0 and the prepreg 3.3. Electromagnetic simulation without metal loss gives an effective Er of 3.55 for this geometry. Like the microstrip case, these values are higher than expected. The values are higher at low frequencies primarily due to metal losses. Stripline is often stated to be free of dispersion, but that assumes lossless materials.

Components

50 ohm series resistor time domain

This one is simply a 49.9 ohm 0402 resistor in series with 1 mm of 50 ohm line on either side and was done as a check on the impedance calibration. The result is as expected.

Murata BLM15GG471 ferrite bead s11 vs current

Murata BLM15HD601 ferrite bead s11 vs current

I'd previously measured these two ferrite beads with no DC bias, but had questioned how much that would change the inductance. The effect is significant at low frequency and small at higher frequencies. It's interesting that the order of the traces reverses above the self resonant frequency as the decreasing shunt inductance reduces the effective capacitance. These ferrite beads are mostly capacitive above the parallel resonance. The VNA bias tees contribute only a negligible amount (0.002 dB insertion loss increase at 200 mA) to this data.

Murata BLM15GG471

Murata BLM15HD601

These parts are both capacitive at higher frequencies and the above performance could be improved by adding short high impedance sections to either side of the ferrite bead. About 120 pH on either side should give the BLM15GG471 20 dB return loss to 30 GHz.

Measuring Material Properties in Waveguide

I saw Dielectric constant measurement for thin material at microwave frequencies and wanted to give it a try. This was done in WR-42 waveguide from 18 to 26.5 GHz. Samples were cut to 0.17 x 0.42 inches to fit the waveguide.

Relative Permittivity (Er)

Er of samples

Material Thickness Er (measured) Er (ref)
Polypropylene (pp) 736 μm 2.14 2.26 at 9.4 GHz
Polycarbonate (pc) 583 μm 2.7 2.77 at 11 GHz
Rogers RO4350B 509 μm 3.59 3.48/3.66 at 10 GHz
OSHPark 4 layer prepreg 205 μm 3.3 3.26 - 4.01
OSHPark 4 layer core 531 μm 4.02 3.26 - 4.01
GE Silicone II (white) 1.875 mm 2.64 unspecified
OSHPark PCB

The OSHPark 4 layer PCBs are made from Isola FR408. As shown in this document, the thicker cores and prepregs have higher glass content and thus higher Er. The 47 mil thick core is not listed in the above PDF, but would presumably be the heaviest glass and have an Er of around 4.0. The prepreg used for the outer layers is a fine glass weave with high resin content, so the measured value of 3.3 seems reasonable. The prepreg sample is very thin and has some taper, so that value may be the least reliable of the group. The core and prepreg samples were obtained by mechanically delaminating a finished PCB.

RO4350

The RO4350B has a 10 GHz dielectric constant of 3.48 +- 0.05 using a clamped stripline method that tends to underestimate Er on hard materials like RO4350B. The recommended value for design is 3.66, so this measured value of 3.59 seems reasonable.

Silicone II

This is white caulk from a hardware store which I had purchased for my bathtub. The data on this one is questionable as it was hard to cut a precise block of this soft material. A silicone might be interesting as a directional coupler overlay.

Loss Tangent

These values are less reliable and noisy as the dissipation loss of the samples is on the order of 0.015 dB. The polypropylene has such low loss that errors are causing it to appear to have a negative loss tangent. The waveguide shim used as a sample holder was bare brass which certainly didn't help repeatability. The spikes in the silicone and osh_core traces are likely due to resonance within the sample as the section of guide containing the sample can support higher order modes.

loss tangent of samples

Approximation not valid

Plot of
  differing results with approximate solution and exact solution

The plot above shows the amount of error introduced by the assumption in the paper that the samples are thin. The data with _est uses the approximations for small values, \(\sin(kz_2\tau) \approx kz_2\tau\). With the smaller guide and higher frequencies used here versus the original paper, the assumption is not valid with reasonable thickness samples.

The value of Er is solved for using SciPy optimize.minimize. It's a bit ugly as it doesn't directly handle complex numbers, but gets the job done.

Future work

Try to do something similar in 7 mm coax.

Code

The SParameter library (used for opening .s2p files, wgsection, S parameter concatenation) has not yet been released, so this is only useful for reference. The library is in C++ with SWIG wrappers and will be released as open source (GPLv3) when it is ready.

#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append("../../")
import SParameter as scpp
from scipy.constants import c, inch
import scipy.optimize

a = 0.42 * inch # wide waveguide dimension in meters

samples = [ # file name, thickness in meters
    ['pc.s2p', 583.0e-6],
    ['pp.s2p', 736.0e-6],
    ['ro4350b.s2p', 509e-6],
    ['osh_prepreg.s2p', 205e-6],
    ['osh_core.s2p', 531e-6],
    ['silicone.s2p', 1875e-6],
]

for sample in samples:
    fn = sample[0]
    t = sample[1] # thickness
    sp = scpp.SParameter(fn)
    sp.reducefreqrange(50e6, 18e9, 26.5e9)
    # add an air filled section of guide the length of the sample
    te = scpp.wgsection(sp, a, t)
    sp = scpp.SParameter(te, sp)
    s = np.array(sp.s)
    s21 = 0.5 * (s[1::4] + s[2::4]) # average of s21, s12
    f = np.array(sp.f)
    e = np.zeros(len(f), dtype=complex)

    for i in range(len(f)):
        wavelength = c / f[i] # free space wavelength
        kz = (np.pi/(wavelength*a)) * np.sqrt( 4.0*np.square(a)-np.square(wavelength))

        def func(z):
            c = s21[i] - 1.0 / (np.cos(z) + 0.5j*kz*t*(1+np.square(z/(kz*t))) * np.sin(z)/(z))
            return c.real*c.real + c.imag*c.imag

        b0 = [a.real, a.imag] # estimate, assumes Er = 1.0
        res = scipy.optimize.minimize(lambda z: func(complex(z[0],z[1])), b0, tol=1e-6)
        kz2 = complex(res.x[0], res.x[1])/t
        e[i] = (np.square(kz2 * wavelength * a / np.pi) + np.square(wavelength))/(4*a*a)

    if sys.argv[1] != 'loss':
        plt.plot(f/1e9,np.real(e), label=fn)
        plt.ylabel('Er')
    else:
        losstan = -np.imag(e)/np.real(e)
        plt.plot(f/1e9,losstan, label=fn)
        plt.ylabel('loss tangent')

plt.xlabel('Frequency (GHz)')
plt.legend(loc=4)
plt.grid(True)
plt.show()

Unit Testing VNA Calibration Code

The Problem

There are an alphabet soup of VNA calibration types (SOLT, TRL, LRL, TRM, SOLR, QSOLT, many more), some of which will be supported by the Harmon Instruments VNA and run by the end user. Testing all of the supported calibration types through the normal user interface and measuring verification components for each could easily take an entire day. Using physical calibration standards, uncertainties would be limited to those of the standards. An additional concern with testing on hardware is the wear and tear on the fragile and expensive calibration standards.

Photo of VNA Calibration Standards

Breaking it Down

The data path from ADC values to corrected S-parameters for display is split into layers with independent test coverage. The lowest level, implemented in an FPGA, taking signals from the ADCs has test coverage using the excellent CocoTB package. The next layer, written in C++, takes integer vector voltage data from the FPGA and produces raw S-parameters. These first two layers are relatively simple mathematical operations. The final layer, correcting the raw S-parameters, is where the complexity lies and is what will be discussed here.

Modeling the VNA and Calibration Standards

For simplicity, we will focus on a 2 port VNA calibrations using an 8 term error model (TRL, SOLR, similar). This model contains a fictitious 2 port "error adapter" to either side of the device under test (DUT). These represent internal leakages, losses, gain imbalances, etc as well as the test port cables. The raw S-parameters are simply those of the DUT in cascade with the two error adapters. In a calibration, sufficient standards are measured to solve for the error adapters.

Knowing the contents of the error adaptors, it is possible to run the calibration backwards and calculate the raw S-parameters from the true S-parameters of the devices being measured. Realistic models of the calibration standards are used with airlines, opens and shorts having loss. Simulated raw data is genereated for the required set of standards as well as a few challenging DUTs and the data is sent throught the calibration acquisition and correction software. The corrected data is compared with the originals and verified to be within reasonable rounding error. The tests on the calibration algorithms complete in less than 1 second and run every time the software is built causing the build to fail if any test does.

For the purposes of testing the calibration software, the VNA is presumed to be linear and time invariant. In the future, for uncertainty analysis, it may be desirable to add drift and nonlinearity based on characterization of real hardware.

Offline Processing of Acquired Data

By saving raw data for many types of calibration and verification standards, it is possible to generate corrected S-parameters using any calibration type using those standards. QSOLT had not yet been coded when this data set was aquired, but it is possible to generate QSOLT corrected data as all of the required standards are present in the data dumps. This is mostly useful to see how the algorithms compare with imperfect standards.

Plot of
  Maury Microwave 60 mm airline S22 with QSOLT, TRL, TRM and data
  based QSOLT calibrataions

The above plot shows significant differences between calibrations assuming the load standards are ideal (QSOLT, TRM), TRL which uses airlines rather than loads and a data based QSOLT calibration which uses more accurate models of the short, open and load. These differences are not due to software, but the physical standards and their definitions.

Implementation

The S-parameter handling code (C++) on the Harmon Instruments VNA has a SWIG generated Python wrapper and the Python unit testing framework is used to run the tests. Google Test would have been another option. Python works well here due to libraries like NumPy, SciPy and Matplotlib providing mathematical functions and plotting. For instance, to test a C++ implementation of the Kaiser window, I added a test to compare it with one generated by SciPy.