Introduction: Portable Game Console ( GPU Team )


For our engineering project, our tutors wanted us to face the challenges of designing a real-time system with relatively high performance on limited resources (memory, bandwidth).

The specifications require a gaming platform using the following hardware:

  • a Digilent’s Nexys 3 board (for implementing a GPU on the FPGA),
  • a Keil’s MCBSTM32F400 board (for hosting the OS of the platform and storing the game data),
  • a DisplayTech DT035TFT LCD with a Novatek NT39016 driver (portable true colour display).

There are two teams of two students who are working on this project, one team is focused on the ARM MCU and the other one on the GPU.


The platform has to match the performance of a 16-bit commercial gaming platform such as SNES, Sega MegaDrive, with multilayer frames and scrolling. The platform consists of two main components: the MCU of the motherboard and the GPU connected to the video output.

  • The MCU specific requirements are graphics API for the GPU, audio API for the onboard audio codec, user IO, MCU/GPU interface, SD card interface. Programming of the video game. A module for configuring the LCD screen (brightness, contrast etc.) is also considered inside the GPU.
  • The GPU specific requirements are multilayer display, blending of different layers using transparency, 16-bit RGBA colours, multilayer scrolling, basic 2D operations (bitblit (copy), color fill, transparency modification, and their combination (clear, move, etc)), primitive generation (lines, circles, text). LCD and VGA video outputs. Graphics oriented memory controller with DMA access.

Implementation Plan

The two teams will need to collaborate regularly to develop the two main components previously mentioned. We have designed the overall architecture of the platform to ensure this (see the first step).

The graphics team will start by implementing video display HDL modules and test video outputs using static video data synthesized on the FPGA. This will be followed by the implementation and integration of the memory controller along with the frame buffer in order to display data stored in the video RAM.

Meanwhile instruction bus and register fetch&decode unit will be implemented in order to provide access to display registers.

The block processing unit will be implemented to provide basic 2D operation on the video data in the RAM. After the preliminary integration stage with the motherboard team, these functionalities will be tested and debugged.

The team will proceed to the implementation of the DMA controller in order to allow MCU to transfer video data from its SD card (instead of Digilent Adept tool). And remaining HDL modules will be implemented, such as the primitive generator.

Step 1: A Flexible Overall Architecture

Our team designed the overall architecture of the GPU

We started with a bibliographic research on commercial or academical designs used for 2D graphics acceleration.

The points of focus were:

  • how the video data are transferred to the display output?
  • how different image processing modules could operate in parallel without corrupting each other? how the access to the video data should be managed?
  • how the MCU should configure different modules of the GPU?
  • what should be the selected pixel format in order to meet the requirements?

We have chosen an architecture that is both generic and flexible, allowing place for further improvements on the project, and allowing us to easily add or remove different modules. The architecture provided in the image is inspired from a few existing ones from which we kept the aspects that seemed useful to our specifications.

In this architecture the use of a shared video memory bus and the use of a unifying register bus are the main components that provide huge flexibility for changes in the GPU.

To summarize the roles of different modules, the MCU Interface (MCU team)allows the STM32 to write into the registers of several modules, through a single master, multiple slaves Instruction Bus, the written data can configure different aspects of the GPU or launch an image processing operation. Among those modules we have Video Display Controller, which provides the right synchronisation signals for either VGA or LCD output, and the Frame Buffer will also be synchronised to this module.

The Frame Buffer is responsible for fetching lines to be displayed from memory, applying some blending and scrolling functionalities and, most importantly, it is responsible for providing the correct RGB data at the right moment (synchronized with the VDC). The line fetching is done through a graphics optimised memory bus, provided by the RAM Controller. The pixel size and display resolution are chosen to allow the FB to fetch four lines in a single horizontal blanking period (detailed later), leaving the longer vertical blanking period and the display periods for image processing operations.

This controller provides an priority-oriented shared memory bus that is used by all modules that require access to the RAM. Among those we have the Block Processing Unit, who can operate on rectangular image portions, Primitive Generator Unit, who can generate geometric figures at specified destination and DMA Controller provides a way to quickly transfer image data to the on-board RAM.

And finally, LCD Configuration Unit (MCU team) is used for making SPI data transfers into the
LCD Controller’s internal registers; these register can be altered to set the brightness, contrast and many other features of the LCD display.

With this structure, we managed to provide following main functionalities:

  • 16-bit RBGA (Red Green Blue Alpha) pixels displayed on 24-bit LCD, or on 8-bit VGA,
  • Frame Buffer supporting up to 4 independent, scrollable, display layers, configurable in size,
  • Fully customisable 16 MB RAM, display layer and sprite addresses are user defined,
  • SRAM-like interface with the STM32 MCU thanks to FSMC address translator,
  • Register Bus for easily configuring different modules from an MCU,
  • BPU: Bit-blit, Fill, Clear, alpha component modification
  • PGU: Primitive generator providing prixels, lines and circles and also text characters,
  • DMA: quick data transfer towards RAM in order to load texture, sprites etc.
  • LCD : configurable brightness, contrast, colour filtering through an SPI bus.

Step 2: Video Display Controller

Video display requires very precisely timed synchronisation signals along with the RGB data. As our specifications require that our GPU should be able to display primarily on the LCD and if possible on a VGA port, we will need two different signal generators for these sinks and multiplex them towards the frame buffer and then demultiplex the frame buffer’s RGB output.

The frame buffer module has to be synchronised with these video signals and provide the RGB data produced by the different display planes of the RAM.

The architecture above fulfills these specifications. The VideoClockGenerator provides the required 6.4MHz clock for the LCD Display Controller and the 25MHz clock for the VGA Controller. These both controllers will provide required row and column numbers to synchronise the frame buffer and they will overall provide a display with a refresh rate of 60Hz at 320x240 resolution (QVGA).

Doability of line fetching

For HW simplicity we decided to use a single buffer per display plane in the frame buffer, which requires us to load pixels during horizontal blanking times. By analysing the timing requirements and the bandwidth of the memory, we can find the theoritical limit of the number of horizontal lines (from different display planes) we can fetch.

For the LCD screen we find that this equals 4 planes, by starting fetching the next line when the VDC outputs 297th pixel of the current line.

For the VGA we find that only 3 planes are doable if the VDC issues a reload at the 285th pixel of each odd line (QVGA is implemented by using 640x480 resolution timings with each row and colum being doubled).

In a future version, we will switch to double-buffering to ease these specific constraints on line fetching.


In this demo, a basic operation of the VDC is shown. A fixed RGB image is generated in the Frame Buffer, the VDC synchronizes the Frame Buffer RGB with it video signals depending on whether LCD or VGA is used.

Note that the LCD has 24-bit RGB interface, but due to our chosen color format, we display at 16-bit colors. The VGA connector's DAC on Nexy 3 allow only 8-bit colors, so some notable quality will be lost on VGA.

Step 3: Memory Controller

The RAM contains all the graphic bitmaps required by the different modules of the GPU as well as some dummy-memory zones used by some algorithms. It is crucial that all modules can access it, when required, without causing too much distraction between each other. Especially, the frame buffer needs to access to the RAM periodically to load a line from each plane for displaying and the timing is critical.

Specialised graphics hardware usually possess VideoRAMs (VRAM) with multiple read/write ports for facilitating these issues, but the 16MB Micron Cellular RAM has only a single port PSRAM interface and is quite troublesome with its data refresh cycle collisions. Therefore there is clear need for a controller to ease the use of this device and make it more apt for a graphics accelerator.

Core Functionalities

The memory controller provides either:

  • asynchronous read access,
  • asynchronous write access,
  • burst read access,
  • burst write access.

These instructions are carried out by generating signals as specified in the datasheet of the Cellular RAM (see timing graphics).

Memory Bus and Arbitraton

We have implemented a pre-emptive bus arbitration logic for the controller’s memory bus. This would allow time-critical modules to have access to required data with ease, but it will require that all of the lower priority modules need to be ready to get interrupted during an access. There can be up to 5 external modules that can be connected to this arbitration logic.

The RAM Controller’s control and data bus is shared by the following components (ordered as priority decreases):

  1. FrameBuffer;
  2. DMAController;
  3. BlockProcessingUnit
  4. PrimitiveGeneratorUnit

Each of these modules is connected to the RAM Controller’s Memory Bus and the multiplexing of the bus is done by the arbitration logic around the controller. The memory bus is made of the following signals:

  • INSTR[3:0] : memory instruction
  • DATA_I[15:0] : data from module to the RAM
  • DATA_O[15:0] : data from RAM to the module
  • ADDR[22:0] : base address of the image,
  • HCNT[15:0] : horizontal offset counter,
  • VCNT[15:0] : vertical offset counter,
  • PICLEN[15:0] : horizontal size of the image,
  • HLEN[15:0] : horizontal size of the accessed block (portion of the image to be read/written),

If INSTR is an asynchronous instruction (read or write), then HCNT, VCNT, PICLEN and HLEN signals are not used. These signals are used along with burst instructions, to read/write from a line or a rectangular part of an image.

More precisely, during a correctly configured burst access, the accessed address sequence is given by the following formula:

ADDR + HCNT + VCNT × PICLEN, with HCNT: 0 to HLEN − 1, VCNT: 0 to VLEN − 1

(See the image above).

This admittedly complex memory bus is necessary to ensure the preemption of an operation carried out by the accessing module; these HCNT and VCNT counters allow modules to reschedule their operations if they get preempted or if CRAM refreshment is required. It also allows clipping of bitmaps quite easily which is thoroughly used by the Block Processing Unit (this will be revisited in BPU section).

But due to its complexity, all component connected to the Memory Bus will have to implement a quite complex algorithm (a few counters, one multiplication for checking refresh collisions). We are currently working on a new solution based on buffer synchronization to simplify the memory interface.

Step 4: Frame Buffer

This component is critical for allowing GPU to make accelerated operations on the image data stored in the RAM. The VDC requires image pixels to be provided at a fixed data rate, and providing those by accessing the RAM would overuse the RAM controller and not leave enough time for other modules to work on the image data. Therefore, we need to use a buffer to store pixels of the upcoming line to display; this data can be updated during the horizontal dead zone of the video display controller and then free the memory bus.

Apart from this main function of buffering, this module can also be used for creating multiple display planes and blending them with an appropriate transparency strategy. As a matter of fact, as the frame buffer needs to go and fetch a line at the size of the display format, the fetching address can be played with to provide acceleration on functions such as scrolling. The following component is the result of our design.

Our frame buffer can provide up to four independent display frames, each display layer is fully customizable in size and memory location. They are independently scrollable and they can be blended to video output either with binary transparency blending or with alpha blending.

Core Functionalities

The frame buffer provides the following operations:

  • fully customisable display planes (size, memory location),
  • scrolling,
  • display layer blending (binary transparency or alpha tranparency).


Basic operation of the Frame Buffer is shown in the video. Four layers are created from the imaged attached to this step. Layer 0 is 640x480, layers 1 to 3 are 320x240 and have a transparent background, also layer 3 has an alpha channel of 75%.

Scrolling, blending and alpha transparency are shown in the video.

Step 5: Communicating With the MCU


The MCU has to be able to write into GPU's registers or read from them in order to configure different internal modules. The MCU team has chosen to use the LCD controller interface of the Keil board to communicate with the Nexys 3 board. The LCD controller interface is more specifically an SRAM interface with a single address bit and 16 data bits.


Due to the lack of enough data and address bits for our 32-bit registers and 16-bit DMA data interface, we have used a protocol based on multiple address and data cycles. This protocol worked well, but on occassions causes hardware faults due to noisy signals or mishandled write/read cycles. We have developped a more basic and robust version recently, it should be documented here soon.

Instruction Decoder and Register Bus

The MCU Interface of the other team recovers instruction address and stores in a register and it is required to write to/read from the correct register.

Register Bus Master allows that in a very generic and scalable fashion, on a single master / multiple slaves bus, using slave address mask it multiplexes data towards the right module.

The Register Bus has been updated recently, the new documentation will be here soon.

Step 6: MCU Emulator

This modules allows us to debug and test different modules of the GPU without needing to connect to the STM32 to emit instructions. It was particularly useful when we didn’t have the API nor the STM32, and can be used to make small demonstrations of the GPU.

The emulator is really simple to use, it has a pre-loaded example instruction set for each of the main functionalities of the GPU, and you can use the buttons to select an operation and apply it.

Depending on the selected buttons, the emulator generates the SRAM signals thar are fed into MCU Interface for decoding.

Many videos with different modules demonstration using the MCU Emulator are published on this instructable. In these demonstrations operations are hard coded on to Nexys 3's boards and we use the sprites and backgrounds from the SNES game "Zelda - A Link to the Past".

Note: This game is still under license but it is commercially unavailable, therefore using its content in practice is tolerated.

Step 7: Block Processing Unit

The Block Processing Unit module is an image processing module designed to be used on rectangular blocks of images. It profits from the burst memory access to the RAM provided by the RAM Controller, which is great for working on horizontal lines due to its high speed. Therefore is also useable rectangular zones: the memory bus allows easy selection of a zone using HLEN, VLEN and PICLEN, that’s why providing special accelerated operations on them is an advantageous feature. Hence the name “block” processing unit.

Core Functionalities

The frame buffer provides the following operations:

Core Functionalities
The frame buffer provides the following operations:

  • BitBlit: The most common 2D graphics operation, it consist of copying an image or a
    rectangular zone from an image and pasting it to its right destination in the memory. It the context of a game, it can be used to build a map using tiles, or placing objects on an environment, for getting different sprites.

  • Move: It is similar to the BitBlit, but the original image is removed and replaced by a fully transparent black colour. This can be used, mixed with the BitBlit, to produce an animations a moving character in a video game.

  • Fill: a rectangular zone in the memory is filled with a uniform colour. This can be used to build a GUI for instance, by setting different rectangular zones to write data or display icons over.

  • Clear: this is basically the same thing as Fill but with a dark transparent colour. Will be needed a lot when managing multiple display layers.


In the video you can basic examples of operations such as fill, bit-blit, move or alpha transparency modification.

Step 8: Direct Memory Access

The DMA Controller gives access to the GPU’s RAM (the Micron chip on the Nexys 3 board) directly from the SRAM interface. This module would be mostly needed to transfer bitmap image data like the background, tiles, sprites to the CRAM. In order to minimize the wait on the loading screen of the games, we will have to find a way to maximise the data transfer rate through this controller.

We have designed this module, considering the possibility that the input data rate might go up to 28MHz (in case where a write cycle to the MCU Interface took 6 AHB clocks for the STM32), but in practice, due to the noisy signals on the MCU Interface, the data rate would never go as high that (would be more likely between 3MHz to 10MHz).

Aside from register management module, this controller can be separated into two submodules; the MCU Interface Receiver which is tasked with monitoring the transfer’s status and buffering incoming data if there is an ongoing DMA transfer. The buffering is done thanks to a total of six 16kB BlockRAMs, and these buffers are regrouped into one single address bus on the side of the write port. The second sub-module of this controller is the RAM Controller Emitter, this module is tasked with “emptying” these buffers by writing to the RAM when they are filled with enough data.

(Detailed operating modes and theoritical bandwith limit will be developed here soon.)


In the video, the MCU Emulator is used to feed a 16-bit counter's output as an incoming data source from the MCU and this data is transferred to the 300x200 zone of the Layer 0. You can find images of data before and after the DMA operation.

Step 9: Primitive Generator Unit

The Primitive Generator Unit (PGU) is a block we designed to give our graphics card the ability to generate a set of pixels forming either a line, a circle or a letter from the alphabet and display it on the LCD. It can also set the colour of a single pixel to allow the MCU to do more complex pritimive generation (non accelerated).


The video shows demonstration of operations such as pixel, line, circle or character generation.

Step 10: First Demonstrations

Here's a progress video with two demonstration of the combined MCU and GPU system.

First demonstration : Animation

In this demo there are two display layers in the frame buffer, the background is an image of stars at 320x240 and the foreground image is a 3200x240 with a fixed background colour that is set to a transparent colour while converting the bmp file into our format.

The MCU scrolls periodically on the foreground image to create the animated movement.

Second demonstration : A short gameplay

In this demo we provide a short gameplay using sprites and background images from the Street of Rage (abandonware).

In this case, animations are created using bitblits on the foreground and the movement of the character is created using scrolling. You can also see primitive generation at the end to display a message.