Lightweight FPGA bitstream compression

Intro

The Harmon Instruments products have FPGAs on expansion boards which need to be configured at power on. These bitstreams are stored in files on an SD card on the main board and get sent over serial links to the expansion boards. It's desirable to minimize size of the data tranferred and hence the time required. The compression provided by Xilinx and Lattice in their tools is helpful, but only remove duplicated configuration frames. There are still long runs of zeros in the compressed bitstreams.

Algorithm

This needs to be kept simple for fast decompression by a low end microcontroller (STM32F030, ARM Cortex M0).

A byte between 1 and 127 encodes a run of 1 to 127 zeros. A byte greater than 128 encodes the start of a verbatim data block of 1 to 127 bytes. These bytes are written to the output and processing resumes with the following bytes. Byte values of 0 and 128 are no operation.

Compression ratio

As expected, this simple algorithm isn't the most efficient, but it's good enough and fast to extract. Icecompr, intended for use with Lattice iCE40 parts provides a better compression ratio and is a good option if small size is more important than decompression speed.

Both the Xilinx and Lattice FPGAs have a fairly low logic utilization. The Xilinx XC7A15 has the same die as the XC7A35T and XC7A50T and is limited by Vivado in how much logic can be used, making the bitstream even more sparse. The Lattice Diamond bitstream compression is more effective than that of Xilinx.

For the sake of minimizing configuration time, the FPGA vendor compression is used first.

Xilinx XC7A15T bitstream
Algorithm bytes
raw 2192012
Vivado 959472
Vivado + this 219236
Vivado + icecompr 149316
Vivado + gzip 103820
Vivado + xz -e 73072
Lattice LFE5U-12 bitstream
Algorithm bytes
raw 582675
Diamond 109207
Diamond + this 60060
Diamond + icecompr 57474
Diamond + gzip 25194
Diamond + xz -e 22800

Speed

The STM32F030 configures the XC7A15T by JTAG from the compressed bitstream in less than 360 ms. It's mostly limited by the 24 Mb/s SPI, achieving a throughput of 21 Mb/s. The bitstream is transferred by serial wire debug into a ring buffer in the STM32. This core is used to do those writes with the SWD clock running at 31.25 MHz.

Code

#!/usr/bin/env python3

# FPGA bitstream compression with an emphasis on decompression speed and simplicity

import sys

with open(sys.argv[1], "rb") as f:
    bs = bytearray(f.read())

print("input size = {} bytes".format(len(bs)))

# return number of zeros starting at position i of BS
def get_zeros(i):
    rv = 0
    maxbits = min(len(bs)-i, 127)
    while rv < maxbits:
        if bs[rv+i] != 0:
            break
        rv += 1
    return rv

def get_data(i):
    rv = 0
    maxbits = min(len(bs)-i, 127)
    while rv < maxbits:
        if bs[i+rv] == 0:
            if rv+2 < maxbits:
                if (bs[i+rv+1]) != 0 or (bs[i+rv+2] != 0):
                    rv += 2
                    continue
            return rv
        rv += 1
    return rv

odata = bytearray()

i=0

while i < len(bs):
    while True:
        z = get_zeros(i)
        if z == 0:
            break
        i += z
        odata.append(z)

    while True:
        dc = min(127,get_data(i))
        if dc == 0:
            break
        odata.append(dc+128)
        odata += bs[i:i+dc]
        i += dc
while len(odata) & 3 != 0:
    odata.append(0)

print("compressed length = {}".format(len(odata)))

with open(sys.argv[2], "wb") as f:
    f.write(odata)

# now extract it and make sure it matches
rc = bytearray()

i = 0
while i < len(odata):
    if odata[i] < 128:
        rc += bytearray(odata[i])
        i += 1
        continue
    else:
        count = odata[i] - 128
        rc += odata[i+1:i+1+count]
        i += count + 1
        continue

print("extracted length = {}, match {}".format(len(rc), rc==bs))