LBNL LLRF in Verilog Larry Doolittle, December 2002 - January 2003 Overview: Target devices are the XC2S150 or XC2S200, low cost members of Xilinx's original Virtex chip family. The primary input data stream is 4 12-bit 40 MS/s ADCs, the primary output data stream is a single 12-bit 80 MS/s DAC. The signal processing datapath between inputs and output runs on the 40 MHz ADC clock, with a final interpolation stage for the output DAC. Both input and output are narrow band signals based on a 50 MHz carrier; in the datapath they are isomorphic to a 10 MHz signal, also known as quadrature sampling: I/Q/-I/-Q. At its heart, the signal processing implements a Proportional-Integral control loop, operating on complex numbers. A host interface is provided to control the chip; that portion of the logic runs at 25 MHz. Clock-domain crossing between the host (25 MHz) and dataflow (40 MHz) section of the chip is intended to happen at 60 Hz. The dataflow is active during the pulse; all host activity is triggered by an interrupt at the end of the pulse. That interrupt routine should complete before the start of the next pulse. Pulses are triggered by an externally supplied logic level input, which also gates the RF output (hardwired TTL logic, external to the FPGA). I attempt to keep this code within the boundaries of IEEE1364-1995, also known as Verilog-1995. Of course, the various Xilinx hardware primitives are not synthesizable. I have tested the code in simulation using Icarus Verilog (http://icarus.com/, also see the README file in the test directory). I have synthesized using Xilinx ISE 4.2 Foundation (see ISE4.2-setup); it easily makes the required static timing. I have never observed discrepancies between simulated and in-chip behavior. Code layout: test/ directory has its own documentation. xilinx/ directory has Verilog simulations of the relevant Xilinx primitives. Some are written from scratch, others are the official Xilinx versions. adctest.v is the top level Verilog module, that has the device pins as its I/O. It also includes the host interface, and a few tidbits that were too small to deserve their own module. Verilog modules that "plug in" to adctest.v (the top-level Verilog file mentioned above): dds.v setpoint generator fdbk_loop.v main PI feedback computations history.v history buffer afterburner.v interpolater to convert 40 MS/s to 80 MS/s data stream feedforward.v feedforward table and sequencing rf_timer.v programmable sequencer triggered from external input flasher.v driver for the heartbeat LED ds2401.v driver for the board's serial number chip altsport.v drives four low speed peripherals: ADC, DAC, PLL, thermometer kcm.v constant coefficient multiplier, configurable from host; the multipliers themselves are used in the feedforward and afterburner modules. additional support Verilog modules: srl16x16e.v srl16x24e.v ramdp1024x12.v ramdp1024x32.v error3.v cordic.v Constraints file (pins and timing): adctest.ucf Other high-level notes: The Verilog code here totals about 1200 lines of non-comments, plus another 300 lines or so of comments (see the "loc" shell script). A previous version written in VHDL consumed twice as much code, and did less. The register map visible from the host interface is documented in, duh, the file called "register-map". Three of these Verilog files (adctest.v, history.v, and dds.v) include preprocessor directives to provide compile-time configurability (see additional note below on what can be configured this way). While the Xilinx XST synthesizer in ISE 4.2 appears able to grok these directives, the automated hierarchy generator can not. So to complete this project in XST, you should use the preprocessed versions (adctest_p.v, history_p.v, and dds_p.v) that I have run through iverilog -E (see Makefile). The dds.v module converts set_i and set_q to the I/Q/-I/-Q waveform. If compiled with DDS support, the output frequency is adjustable. Set freq=0 to get the normal behavior. See cordic.v for the guts of the polar/rectangular conversion. Unfortunately, a quirk of that conversion function means that when compiled with DDS support, set_i and set_q are scaled by 1.64676. The fdbk_loop.v module uses error3.v as a submodule. Those two should go either in the same file, or just merge the code. If I knew how to properly pull the host interface into a module, or otherwise clean up the interface between the host and the various virtual peripherals, I'd do it. Brainless attempts would create an interface definition that was just as large and error-prone as the code itself. I'm curious to see just what has to change to support the famous Stettler/LANL/VME local bus. The list of modules above includes a set that are machine generated: srl16x16e.v, srl16x24e.v are generated by srle.pl; ramdp1024x12.v, ramdp1024x32.v, and ramdp4096x12.v are generated by bram.pl. These .v files lack any intellectual content. Their only function is to aggregate Xilinx SRL16E and Block RAM primitives, respectively, into convenient packages. There is one exception to the "cross clock domains at 60 Hz" rule, and that is to set and read back the timing configuration registers. See more commentary in rf_timer.v. Craig Swanson showed me how to write arithmetic saturation using a function, see check-funct.v. It looks nice and simulates properly, but I haven't yet tested this with synthesis and hardware. Plus, I'd like to not have to write a saturation function for each combination of input and output bit widths. If I get squeezed for logic cells, div4080.v could drop into flasher.v, although I would include conditional compilation logic to revert to the extant implementation (avoiding any perception that you need Xilinx patent 5,343,406 or 6,262,597 to flash a lamp). sportx.v is an alternative implementation of altsport.v. It is leaner code, more obviously synthesizable, and Xilinx XST synthesizes it to about a third the size of altsport.v. It is also totally untested against real hardware, and has a different register-level interface. altsport.v (and sportx.v) are nearly useless without a tested software layer that provides abstraction from the register-level programming. For altsport.v, that layer is given in altsport.c, which also includes code for a shell-level utility to access the serial port functions. Likewise, and even more so, ds2401.v needs a host-level driver. That is given in ds2401.c. Both ds2401.c and altsport.c require some low-level help connecting to the hardware. Specifically, they use inw() and outw() calls to read and write registers, given an offset into the FPGA's address space. Implementing these functions is operating-system specific, while the ds2401.c and altsport.c routines themselves are extremely generic (i.e., will probably work on everything but Microsoft Windows). An interface for embedded StrongARM Linux, also used by the final EPICS driver, involves llrf.h, getram.h, and getram_big.c. Look for those files on the SNS EPICS CVS server. When the processor actually goes to use the serial devices, the step that consumes the most CPU cycles is reading the temperature (not counting reading the serial number, which only happens once on bootup). That takes about 75 uS (300 uS in altsport.v until I install and test the latest update from Ming Choy), limited by the TCN75 thermometer chip itself. The current code requires CPU intervention at the 25 and 50 uS mark, making it impractical to give the CPU other work to do while the serial transaction completes. The most valuable upgrade to the whole serial system would be to read the temperature in a single step, so the CPU can do something else useful in that 75 uS. The design_top.v file is a historical artifact. In VHDL form, it was the debugged top of the design for the first generation hardware. The Verilog form you see here has never been debugged, and is superseded by adctest.v. I have "issues" with the typical ASIC-oriented Verilog convention of providing a global reset for every flip-flop: always @(posedge clk or reset) begin if (reset) begin reg1 = 0; reg2 = 0; end else begin reg1 <= expression1; if (condition) reg2 <= expression2; end end In my mind, this nearly doubles the amount of language overhead to perform real work. A Xilinx FPGA doesn't really need this; all the flip-flops turn on to a state specified in the configuration bitstream, typically zero (and this state _should_ be specified by "initial" statements, as implemented in Icarus Verilog's synth1 engine). So, the code in front of you does not follow this convention, although this is one windmill I don't have energy to tilt at. So re-writes to the stupid conventional style will happen, sooner if someone else (who actually thinks it's proper) donates patches. "Compile-time" configuration is provided for PRODUCT_REV (adctest.v) 16 bit serial number for the firmware revision, visible from the host interface. TEST_BUFFERS (history.v) one 4096 buffer for system test, instead of the multiple 512 long decimated and single 1024 long leading edge buffers. GEN2_BOARD (adctest.v) support the second generation LBNL board. The older board has a different analog output implementation and a different complement of low-speed peripherals. Support for the first generation board is obviously broken at this time. FEEDBACK (adctest.v) otherwise a simple test pattern generator (dac_testpattern.v) drives the output. FEEDFORWARD (adctest.v) otherwise zeros are substituted; you can't enable this option and TEST_BUFFERS simultaneously on the XC2S150, or you run out of Block RAM. UNADJUSTABLE (dds.v) run the DDS at 10 MHz only. The binary configuration options are rendered in the status register, so host software can determine which configuration of firmware it has to work with. Porting notes to target Virtex-II: Virtex series chips have their RAM in the form of 4096 bit dual-port RAM blocks, that I instantiate as, e.g., RAMB4_S4_S4. The Virtex-II has 18432 bit dual-port RAM blocks called, e.g., RAMB18_S9_S9(??). Fortunately, the number of such blocks has not gone down, and Xilinx provides compatibility library entries (that waste 75% of the RAM), so initially no changes are needed. Later on, some tedious but conceptually simple rearrangement of RAM primitives is needed, presumably by expanding the capabilities of bram.pl. The kcm multipliers ought to work as given, but are stupid in the context of the Virtex-II multiplier blocks. I think kcm.v can be rewritten to take advantage of those blocks, without disturbing the interface or the rest of the code. That approach would make it easy to keep the code base running on both chip families. Not yet included features; urgent or easy features are listed first. * averaged reference channel readout. * swap out the arcane multiplexed address lines in history.v for a real programmable boxcar averager. Has the side effect of doubling the number of useful data points stored, since I-(-I) baseline subtraction is done on the acquisition side of the memory. * feedback from the klystron channel * filtering to suppress nearby cavity modes * average dissipated cavity power (actually field squared) to support efficient SRF heater control. Tricky to do in a small number of cells. * wide 40 MHz cycle counter, can be used by the host as a time base that has zero long term drift with respect to the rest of the accelerator site. Only useful if we patch the operating system kernel to use it instead of a more conventional time source. Divide by 40 prescaler in the 40 MHz clock domain, cross over to 25 MHz host clock, 20 bit counter for microseconds, 4-bit latch to keep read sequences atomic, host keeps track of the seconds by updating state at the normal, nominal 100 Hz.