\documentclass{JAC2001}

%%
%%  Use \documentclass[boxit]{JAC2001}
%%  to draw a frame with the correct margins on the output.
%%
%%  Use \documentclass[acus]{JAC2001}
%%  for US letter paper layout
%%

\usepackage{graphicx}

% We make @ signs act like letters, temporarily, to avoid conflict
% between user names and internal control sequences of plain format.
\catcode`@=11

\def\ialign{\everycr{}\tabskip\z@skip\halign} % initialized \halign
\def\eqalign#1{\null\,\vcenter{\openup1\jot \m@th
  \ialign{\strut\hfil$\displaystyle{##}$&$\displaystyle{{}##}$\hfil
     \crcr#1\crcr}}\,}

\catcode`@=12 % at signs are no longer letters

\def\trademark{$^{\rm TM}$}
%%
%%   VARIABLE HEIGHT FOR THE TITLE BOX (default 35mm)
%%

\setlength{\titleblockheight}{25mm}

\begin{document}
\title{EMBEDDED NETWORKED FRONT ENDS - BEYOND THE CRATE\thanks{%
This work is supported by the Director, Office of Science, Office of
Basic Energy Sciences, of the U.S. Department of Energy under
Contract No. DE-AC03-76SF00098. The SNS project is being carried
out by a collaboration of six US Laboratories: Argonne National
Laboratory (ANL), Brookhaven National Laboratory (BNL), Thomas
Jefferson National Accelerator Facility (TJNAF), Los Alamos National
Laboratory (LANL), E. O. Lawrence Berkeley National Laboratory
(LBNL), and Oak Ridge National Laboratory (ORNL). SNS is
managed by UT-Battelle, LLC, under contract DE-AC05-00OR22725
for the U.S. Department of Energy.}}

\author{Lawrence R. Doolittle, LBNL, Berkeley, CA 94720, USA}

\maketitle

\begin{abstract}
The inexorable march of Moore's Law has given engineers the capability
to produce front end equipment with capabilities and complexity
unimaginable only a few years ago.  The traditional standardized
crate, populated with off-the-shelf general-purpose cards, is ill
suited to the next level of integration and miniaturization.  We
have reached the stage where the network protocol engine and digital
signal processing can, and should, directly adjoin the analog/digital
converters and the hardware that they monitor and control.

The current generation of Field Programmable Gate Arrays (FPGAs)
is an enabling technology, providing flexible and customizable
hard-real-time interfacing at the downloadable firmware level,
instead of the connector level.  By moving in the direction of a
system-on-a-chip, improvements are seen in parts counts, reliability,
power dissipation, and latency.

This paper will discuss the current state-of-the-art in embedded,
networked front end controllers, and gauge the direction of and
prospects for future development.
\end{abstract}

\section{THE CRATE AGE}

For decades, CAMAC\cite{camac} and VME\cite{vme} crates
have formed the basis for new
designs of accelerator front end equipment.  These designs still
make a certain amount of sense when 100\% of the desired
functionality can be assembled using off-the-shelf boards.

Crates have their origin in the times when no single board had,
or could have, enough interface gear to run a piece of equipment.
It was reasonable to line cards up
in a crate to get enough digital inputs, digital outputs, analog
inputs, and analog outputs to meet the needs of a system.

As a natural consequence of Moore's Law\cite{moores-law},
the amount of functionality available on a board has risen progressively.
Last year's system fits in today's crate, last year's crate fits on
today's circuit board, and last year's circuit board fits on today's chip.

A side effect of this progression is that, for the fixed form factor of
a crate,
the number of connection points to a board is larger, so more
wires are involved.  To justify having the large cost overhead
of a crate, people are motivated to ``fill it up.''  This in turn
leads to unmaintainably large cable bundles leading between each crate
and patch panels that act as antennas for crosstalk.  The cables and
patch panels are inevitably hand-wired and not self-checkable.
Worse, from a software perspective, is that unrelated systems are
often grouped in one control computer, aggravating problems of coordination.

\section{NETWORKED FRONT END CONCEPT}

It's always true that developing circuit boards
(including assembly, debugging, and calibration) is
more expensive than buying a ready-made board.  
Hand-wired transition assemblies between the connectors
as provided on ready-made boards and those on the equipment
that needs interfacing, however, is even more expensive and
notoriously unreliable.  People therefore put such transition
and signal conditioning equipment on circuit boards.  From there,
a slippery slope begins: the extra effort to add Analog/Digital
conversion chips to the board is fairly small, and places complete
control over the analog system performance in the hands of the
engineer.   The resulting large, and difficult to test, number of wires
between conversion chips and the control system logic can be
managed by connecting them directly to an FPGA.  Finally,
computer gear is sufficiently small and well understood that
it makes more sense to think of the computer as an add-on to
the custom hardware, than {\it vice versa}.
Each integration step reduces the number of connectors,
a perennial weak link in accelerator reliability.
It also reduces the number of unrelated clock domains.

\begin{figure}[htb]
\centering
\includegraphics*[width=0.9\hsize]{familiar.eps}
\caption{Familiar block diagram.}
\label{familiar}
\end{figure}

The familiar block diagram of figure~\ref{familiar} represents in most general
form the resulting structure of modern control hardware.  The FPGA provides
a consistent (and small) latency digital feedback path between
the ADCs (analog to digital converters) and DACs (digital to analog converters).
Different applications have varying requirements
for the speed, resolution, and channel count of ADC and DAC hardware.

While analog electronics has not shrunk as dramatically as digital,
it has proved possible in many cases to simplify the analog signal
path by pushing functionality into the digital domain\cite{dsp-instrumentation}.
This is an important step in bringing down the total hardware complexity,
since the digital processing involves no additional chips.

Low speed housekeeping hardware normally involves at least a
multi-channel ADC for power supply monitoring (including the
current drawn by the FPGA core), plus temperature
and electronic serial number.  Communication between the FPGA
and such housekeeping hardware normally takes place over bit-serial
interfaces such as SPI\trademark\cite{spi},
I$^2$C\trademark\cite{i2c}, or 1-Wire\trademark\cite{1-wire}.
While some might consider such housekeeping a frill, it has great value
in operating, remotely troubleshooting, and maintaining these devices.
The parts cost and board area required are minimal.

It is worth remembering why designs normally include both a
CPU (central processing unit) and an FPGA.  While both offer
programmable functionality, each has strengths and weaknesses:

\smallskip
\noindent{\bf Computers}
\begin{Itemize}
\item   multiple sources, very competitive market
\item   usually needs additional glue logic
\item   usually good throughput, but unpredictable latency
\item   widely understood, everybody thinks they know how to program one
\end{Itemize}

\noindent{\bf FPGA}
\begin{Itemize}
\item   chip design strongly protected by patents, two vendors dominate
\item   handles glue logic very well
\item   guaranteed latency normally designed-in
\item   reputation for being difficult to program
\end{Itemize}

\section{NETWORKED FRONT END EXAMPLES}

The following examples show single-purpose devices that place information
from, or control of, a device onto the Internet.
They each represent a variation on the theme shown in
figure \ref{familiar}.

\subsection{Accelerator LLRF control}

The SNS Interim Low Level RF system\cite{sns-llrf} is made up of
connectorized RF plumbing, a custom circuit board with 4 $\times$ 40MS/s
12-bit ADCs, a 12-bit 80 MS/s DAC, a Xilinx XC2S150 FPGA,
and a plug-on 200 MIPS single board computer.
The assembly is 2U high in a 19" rack, and
runs an EPICS server to provide network control over 100 MB/s Ethernet.
All the power supplies are linear, to minimize electrical noise in the
chassis.

\begin{figure}[htb]
\centering
\includegraphics*[width=0.9\hsize]{chassis_top_square.eps}
\caption{SNS LLRF Chassis top view.}
\label{llrf-picture}
\end{figure}

\subsection{Accelerator BPM readout}

The SNS Beam Position Monitor system\cite{sns-bpm}
is made up of 4 $\times$ 40MS/s 14-bit ADCs, 256K deep FIFOs,
and a direct connection to a PCI bus.  The chassis is a commercial
1U rack-mounted PC.  A Quicklogic CPLD provides the PCI interface
and custom interface logic.

% http://warrior.lbl.gov:7778/pacfiles/papers/WEDNESDAY/PM_POSTER
\begin{figure}[htb]
\centering
\includegraphics*[width=0.9\hsize]{bpm-pci.eps}
\caption{SNS BPM PCI card. Digitizers are in the center, downconversion
mixers and filters are to the right.}
\label{bpm-picture}
\end{figure}


\subsection{Network Camera}

The Elphel Model 303 High Speed Gated Intensified Camera\cite{network-camera}
is a nice example of compact electronics connecting all the way from
sensor array to Ethernet.  A Xilinx XC2S300E FPGA performs (among other things)
image compression from raw pixels to JPEG format.
An ETRAX 32-bit CPU bridges the data to Ethernet,
at a sustained rate of 15 frames (1.3 megapixels each) per second.
This camera is powered by 48VDC through the Ethernet cable, compliant
with the IEEE 802.3af standard. 

\begin{figure}[htb]
\centering
\includegraphics*[width=1.0\hsize]{cam_mcp_color.eps}
\caption{Camera assembly including CCD, acquisition, and networked computer}
\label{camera-picture}
\end{figure}

\subsection{Home Audio}

The LANPipe
\cite{network-audio} is a simple network to audio bridge device.
It includes a 16-bit CPU implemented on the XC2S30 FPGA, which
in turn runs a simple UDP/IP network stack.  The Ethernet
chip includes both the PHY (physical) and MAC (logical) layer.
% MAC == Media Access Control

\begin{figure}[htb]
\centering
\includegraphics*[width=0.8\hsize]{lanpipe.eps}
\caption{Annotated photo of LANPipe circuit board}
\label{lanpipe-picture}
\end{figure}


\section{BACK END IMPLICATIONS}

As the front end functionality of a large installation is
subdivided more finely, more network cables run back to the
Global Control System.  There is legitimate concern that
network architectures be available to support this.

An example of modern large scale networked computing
cluster is the University of Kentucky KASY0\cite{kasy0}, which
consists of 128 2.0\thinspace GHz Athlons.  Its network
has been demonstrated adequate to run software that
requires a high degree of coordination between processors.
With some additional of network gear (64 $\times$ 24-port switches),
the KASY0 could provide 1408 network plugs (100 BaseT) for under
US\$33/plug.  The hypothesized control system front ends
would then be backed up with 400 GFLOPs and 1.5 GByte/second
data throughput.

\section{SYSTEM ON A CHIP}

Normally this is a code word for combining the application specific
logic with a host processor on a single chip.  For technology reasons,
DRAM and analog circuitry is almost never included in this chip.
Both ASICs (Application
Specific Integrated Circuits) and FPGAs can be used in SOC mode.
Of course, the ASIC approach is only considered relevant when the
production volume is large, greater than 100,000.

Besides potential system cost and space savings, a prime
motivation for SOC architecture is to improve the interconnect
between the high speed application logic and the host CPU.
They are normally on different clock domains, and the
bus interfaces to the CPU core are normally very much
slower than the processor core.  Typical numbers are
200-500 ns to transfer 16-32 bits.  Compared to the speed
of both the processor core (200-2000 MHz) and the FPGA plane
(40-200 MHz), this is a serious bottleneck.

\begin{figure}[htb]
\centering  
\includegraphics*[width=1.0\hsize]{familiar-mod.eps}
\caption{System-on-chip adaptations of the familiar block diagram.} 
\label{familiar-mod}
\end{figure}

Figure~\ref{familiar-mod} shows two contrasting ways to adapt the familiar block
diagram of figure~\ref{familiar} to SOC.

Type A keeps the processor in the
datapath between the application hardware and the network.
This approach is normally imagined as a hardware change only,
where the existing software is moved onto the FPGA chip.
That software is an existing body of high level code, including network protocols
and a POSIX-capable operating system (like Linux).
This concept makes large demands on the CPU and memory (at least 8\thinspace MiB,
necessarily off-chip) by today's FPGA standards.

In the configuration of type B, the processor is by design kept
out of the data path between the application hardware and the network.
It is optional and (when used) reserved for local data manipulation,
where the algorithm complexity is higher than plausibly programmed
in dedicated HDL data path.

The high speed network connection is assumed to be simple, low latency
transfers of raw data.  The conversion to high level protocols can
take place on commodity workstation-class hardware on the other
end of the network cable.

Note there is community expertise running network
protocols (especially Ethernet UDP/IP) in FPGA hardware without
a traditional CPU.  As wire speeds climb, having a CPU in the
data path is likely to create more problems than it solves: at GB/s rates, even
workstation-class CPUs get overloaded, so modern high speed NICs (network
interface cards) are evolving into dedicated network co-processors.
``The cheapest, fastest and most reliable components of a computer system are
those that aren't there.''\cite{gordon-bell}

The flexibility and end-to-end integration of an FPGA-based SOC make
it plausible to use Ethernet with a hard real time mind-set that
is inconceivable using a CPU and a conventional MAC.
Frame preamble and header information can be sent down the
wire while results are still being acquired from the hardware.

There is not necessarily any hardware difference between
approaches A and B.  Approach A is more likely to work without
the external memory chip, in part because of its simpler scope. 

\subsection{Soft CPU Cores}

When the CPU is built with the FPGA fabric, just like the
rest of the chip's functionality, it is called a soft core.
Many such designs are published and/or sold, some of which
are listed here.
\smallskip
\halign{\quad#\hfil\quad&#\hfil\quad&#\hfil\quad&#\hfil\quad&#\hfil\quad&#\hfil\quad\cr
   name                         &   source    &   bits   & 4-LUTs & MHz\cr
   PicoBlaze\cite{picoblaze}    &   VHDL      &    8     &  152   & 40\cr
   SLC1657\cite{slc1657}        &   VHDL      &    8     &        &   \cr
   gr0040\cite{gr0040}          &   Verilog   &   16     &  257   & 50\cr
   xr16                         &   schem     &   16     &  392   & 65\cr
   MicroBlaze\cite{microblaze}  &   N/A       &   32     & 1050   & 75\cr
   NIOS-16\cite{nios}           &   N/A       &   16     & 1100   & 50\cr
   NIOS-32\cite{nios}           &   N/A       &   32     & 1700   & 50\cr
   LEON SPARC\cite{leon-sparc}  &   VHDL      &   32     & 4800   & 65\cr
   Aquarius\cite{aquarius}      &   Verilog   &   32     & 5506   & 21\cr
   or1k\cite{or1k}              &   VHDL      &   32     & 6000   & 33\cr}

\smallskip\noindent
None of these cores have an MMU (memory management unit).
The speeds (and, to a lesser extent,
the 4-LUT count) are only approximate since they depend on the
speed and capability of the underlying FPGA.  A `N/A' in the `source'
column indicates that the source is not published, limiting the
core's utility in a research context.

The advantages of a soft core are more competition, variety,
and adaptability to the actual problem at hand.

\subsection{Hard CPU Cores}

When the CPU is built by the chip manufacturer on the same
die as the FPGA fabric, it is called a hard core.
\smallskip
\halign{\quad#\hfil\quad&#\hfil\quad&#\hfil\quad&#\hfil\quad\cr
   CPU core &  chip                & bits & MHz \cr
   PowerPC  &  Xilinx Virtex-IIpro &  32  & 250 \cr
   ARM9     &  Altera Excalibur    &  32  & 200 \cr
   80C51    &  Triscend            &   8  &     \cr}
\smallskip
\noindent
The first two of these designs include an MMU.
Although less customizable, these cores have theoretically 
better cost/performance and speed-power product than a soft core.
In today's FPGA generation, hard cores with external SDRAM are
probably required for type A SOC implementations.


\section{NETWORK CHOICES}

At some point in the chain from hardware to operator,
standards (as published by sanctioned standards bodies)
are essential for communication between hardware
built by different people at different times.

There are many historical standards for parallel bus
attachments of peripherals to computers:
  CAMAC (IEEE-583),
  VME (IEEE-1014),
  VXI,
  GPIB (IEEE-488),
  SBUS (IEEE-796),
  ISA/AT,
  ATAPI (ANSI NCITS 317-1998 and later),
  PCI/cPCI.
At the time of this writing, all are considered obsolete or dying,
in many cases explicitly replaced with a serial equivalent.
PCI sees extremely wide use, but is also very political, and
many commercial interests appear eager to upgrade or replace it soon.
Modern serial buses include
  Ethernet (IEEE-802),
  Firewire (IEEE-1394),
  Fibre Channel,
  USB,
  CAN (ISO 11898),
  SATA, and
  ATM.

Ethernet is both the oldest and most vibrant.  It is in the heart
of the wireless storm.  Power Over Ethernet\cite{power-over-ethernet},
which provides up to 13\thinspace W for peripherals over the same
CAT5 cable as the network, is just taking off.  Fiber and twisted
pair transmission speeds are set for another jump in speed and/or
availability.  It's very hard to imagine any difficulty connecting
Ethernet-based gear to the Internet anytime in the next two decades.
The same cannot be said about {\it any} of the other listed protocols.

With ubiquitous CAT5 cable, 100BaseTX and 1000BaseT Ethernet will
reach 100\thinspace m.  On a fiber physical layer, 100BaseFX in
full-duplex mode will reach 2000\thinspace m, and 1000Base-LX on
a single mode fiber will reach 3000\thinspace m\cite{ethernet-lengths}.

While not normally thought of as a hard-real-time link, point-to-point
Ethernet does have deterministic latency.  Direct links between networked
front ends could take advantage of that to implement wide-area feedback
and interlocks.

\section{FIELD PROGRAMMING}

FPGAs are an enabling technology.  Their reconfigurability is
an essential feature, allowing bugs to be fixed and features
to be added to the hardware at a later date.
This flexibility comes with a hardware price: some means
of ``booting'' or ``configuring'' the FPGA must be included,
and (to avoid losing the very feature that is so attractive)
a mechanism must be included to make that configuration
remotely updatable.  When a conventional networked computer
is part of the equation, the solution can be relatively easy:
connect four JTAG leads to the computer's general purpose
port, and have the FPGA activate only after the computer
goes on line.  This avoids dedicated Flash memory chips and
all other hardware and software complexities.  Normal
software configuration control can place new FPGA configurations
on a network server, where it will take effect on the next chassis
reset or power cycle.

When interlocks are implemented with an FPGA, the equation changes:
it has to be treated as a non-programmable device, and changes
in functionality have to be accomplished by returning a unit
to the bench, reprogramming with specialized hardware, and
the result re-tested before returning it to the field.

When the host computer is inside the FPGA, the configuration step
involves a chicken-and-egg
problem: the very computer (or computer-free network stack) that is
needed to download FPGA configuration is implemented in the very
hardware that needs configuration!
FPGA vendors provide small and expensive Flash chips that
can self-configure an FPGA at power up, but the infrastructure
to reprogram these and restart the board (while leaving fail-safe
options in place) is not easily understood.
Since the chip count of an FPGA-based control board is normally
very low to begin with, it seems imbalanced to add complex and
fragile boot hardware.

\section{CONCLUSIONS}

Networked front end hardware has tremendous opportunity to
make accelerator electronics simpler, cheaper, more featureful,
better understood, and more reliable.  By distributing the
hardware closer to the gear it controls, field wiring becomes
quieter and more maintainable.
Standardized high speed network communications between front end
modules and the global control system maximizes short and long
term flexibility, and minimizes installation costs.

Since so much of the intellectual content of the
devices will reside in its programming, it is appropriate to suggest
widespread Internet-based collaboration within the community, as
exists now in the SNS project and the EPICS collaboration.
This ``many eyeballs'' approach can drive up quality and drive
down costs compared with the ``lock it in the desk drawer'' approach.

Many more changes are on the horizon.  Even with the demise
of Moore's Law looming in the next few years, imaginative
applications of programmable digital circuitry will continue to
enhance the performance and capabilities of front end hardware.

\section{ACKNOWLEDGEMENTS}

The author would like to thank the entire SNS team, especially
controls, instrumentation, and LLRF groups, for their help
as we all feel our way into the future.

%\begin{thebibliography}{9}   % Use for  1-9  references
\begin{thebibliography}{99} % Use for 10-99 references

\bibitem{camac}
IEEE-583-1982,
Standard Modular Instrumentation and Digital Interface System.

\bibitem{vme}
IEEE-1014, Standard for A Versatile Backplane Bus: VMEbus
% http://www.vita.com/

\bibitem{moores-law}
Gordon E. Moore, Cramming more components onto integrated circuits,
Electronics, Volume 38, Number 8, April 19, 1965.

\bibitem{dsp-instrumentation}
Digital Signal Processing in Beam Instrumentation,
M. E. Angoletta, DIPAC 2003, GSI, Mainz, Germany
% http://bel.gsi.de/dipac2003/papers/IT07.pdf

\bibitem{spi}
SPI is a trademark of Motorola, Inc.
M68HC11 Reference Manual
% http://www.epanorama.net/links/serialbus.html

\bibitem{i2c}
I$^2$C is a trademark of Philips Electronics N.V.
http://www.semiconductors.philips.com/buses/i2c/

\bibitem{1-wire}
1-Wire is a trademark of Maxim.\goodbreak
http://www.maxim-ic.com/1-Wire.cfm

\bibitem{kasy0}
http://aggregate.org/KASY0/

\bibitem{sns-llrf}
http://recycle.lbl.gov/~ldoolitt/llrf/

\bibitem{sns-bpm}
John Power, Beam Position Monitor Systems for the SNS LINAC, PAC2003

\bibitem{network-camera}
http://www.linuxdevices.com/articles/AT2441343146.html

\bibitem{network-audio}
http://www.thepowleys.com/lanpipe/index.php

\bibitem{gordon-bell}
Gordon Bell, DEC laboratories

\bibitem{picoblaze}
http://www.xilinx.com/ipcenter/-processor\_central/picoblaze/index.htm

\bibitem{slc1657}
http://www.silicore.net/pr090903.htm

\bibitem{gr0040}
http://www.fpgacpu.org/gr/index.html

\bibitem{microblaze}
http://www.xilinx.com/xlnx/-xil\_prodcat\_product.jsp?title=microblaze

\bibitem{nios}
http://www.altera.com/products/devices/nios/

\bibitem{leon-sparc}
http://www.gaisler.com/

\bibitem{aquarius}
http://www.opencores.org/projects/aquarius/

\bibitem{or1k}
http://www.opencores.org/projects/or1k/

\bibitem{power-over-ethernet}
http://www.poweroverethernet.com/

\bibitem{ethernet-lengths}
http://www.dslreports.com/faq/7800

\end{thebibliography}

\end{document}