For our engineering project, our tutors wanted us to face the challenges of designing a real-time system with relatively high performance on limited ressources (memory, bandwidth).
The specifications require a gaming platform using the following hardware:
- a Digilent's Nexys 3 board (for implementing a GPU on the FPGA).
- a Keil's MCBSTM32F400 board (for hosting the OS of the platform and storing the game data).
- a Display Tech DT035TFT LCD with a Novatek NT39016 driver (protable true colour display).
There are two teams of two students who are working on this project. One team is focused on the ARM MCU and the other one on the GPU.
The platform has to match the performance of a 16-bit commercial gaming platform such as SNES, Sega MegaDrive, with the multilayer frames and scrolling. The platform consists of two main components: the MCU of the motherboard and the GPU connected to the video output.
- The MCU specific requirements are graphics API for the GPU, audio API for the onboard audio codec, user IO, MCU/GPU interface, SD card interface. Programming of the video game. A module for configuring the LCD screen (brightness, contrast, etc.) is also considered inside the GPU.
- The GPU specific requirements are multilayer display, blending of different layers using transparency, 16-bit RGBA colours, multilayer scrolling, basic 2D operations (bitblit (copy), color fill, transparency modification, and their combination (clear, move, etc), primitive generation (lines, circles, text), LCD and VGA video outputs. Graphics oriented memory controller with DMA access.
The two teams will need to collaborate regularly to develop the two main components previously mentioned. We have designed the architecture of the platform to ensure this.
Our team will start the implementation by providing all the required interfaces surrounding the GPU, such as the support for the LCD and the connection towards the MCU board. This will be developed in parallel with the design of the HDL modules associated to these interfaces. At this point, a preliminary integration with the GPU will take place in order to ensure the consistency and interoperability of both modules. This will be followed by software design on the MCU of required peripherals drivers, audio and video API and finally the RTOS. After the final integration with the graphics team, involving all the GPU modules, the planned game will be implemented and tested.
Before going through the details, you can check out this Youtube link containing a brief summary of our project and a video showing what we managed to do this far. The project is not yet complete but we will keep updating this page anytime a new feature is added to the project.
First demonstration : Animation
In this demo there are two display layers in the frame buffer, the background is an image of stars at 320x240 and the foreground image is a 3200x240 with a fixed background colour that is set to a transparent colour while converting the bmp file into our format. The MCU scrolls periodically on the foreground image to create the animated movement.
Second demonstration : A short gameplay
In this demo we provide a short gameplay using sprites and background images from the Street of Rage (abandonware). In this case, animations are created using bitblits on the foreground and the movement of the character is created using scrolling. You can also see primitive generation at the end to display a message.
Step 1: Materials
In order to work on this project, you will need these following materials:
MCBSTM32F400 - ARM CORTEX M3
This MCU board is the host of our real time operating system, the high-level graphics API and the high-level audio API.
Key features regarding our project :
Audio CODEC with Line-In/Out and Speaker/Microphone is available on the MCU board and will be used for in game audio.
2.4 inch Color QVGA TFT LCD with resistive touchscreen: this LCD screen will be removed from the MCU Board revealing a 34 pins connector that will use to connect the Nexys 3 board to the MCU.
Flexible Static Memory Controller (FSMC): The embedded in the MCU board. It has four Chip Select outputs supporting the following modes: PCCard/Compact Flash, SRAM, PSRAM, NOR Flash and NAND Flash. For our application, we will use the SRAM mode in order to transfer data between the FPGA board and the MCU board.
DMA Controller: The devices feature two general-purpose dual-port DMAs with 8 streams each. They are able to manage memory-to-memory, peripheral-to-memory and memory-to-peripheral transfers. We will use the DMA controller to do quick and direct transfers of sprites and background images to the FPGA memory (video RAM).
MicroSD Card Interface: The SD Card Slot available in the MCBSTM32F400 board will be used to load any game to be run on our portable console.
Push Buttons and 5-position Joystick: The MCU ARM CORTEX M3 we used have two Push-buttons and a 5-position Joystick that we can actually use to play any game on our console.
FGPA - Xilinx Spartan 6
Our GPU will be implemented on the Nexys 3 board.
Key features regarding our project:
- 16Mbyte Micron Cellular RAM: The Cellular RAM can carry out asynchronous operations with a 70 access time, and burst acess operations up to 80 MHz rate.
- 8-bit VGA: The VGA port will be used for debugging purpose. Actual application will be displayed on the Display Tech DT035TFT LCD.
- Four double-wide Pmod™ connectors: These connectors will be used to connect the MCU board with the Nexys 3 board.
- VHDC connector: This connector will be used to connect the LCD with the FPGA board.
Display Tech DT035TFT LCD:
This LCD will be replacing the one integrated with the MCU board. It is a more powerful 24 bit RGB LCD with Novatek NT39016 driver.
LCD - Nexys 3 PCB:
The main purpose of this PCB is to connect the FPGA to the LCD using the VHDC connector the FPGA.
The first thing to do is to connect the data signals coming for the NOVATEK chip of the LCD to the connector where the VHDC connector will be plugged. The ground will be directly connected to the supply ground. In order to generate the 18 V, used to power up the backlight of the LCD, we used the a variable tension regulator to convert the 24 V generated by the power supply to the 18 V connected directly to the LCD.
To generate the 3.3 V used as an power supply for the LCD, we used another fixed tension regulator. Since this regulator generates the 3.3 V from a 15 V voltage, we used a voltage divider bridge to generate a 15V voltage from the 24 V brougth by the power supply.
Nexys 3 - MCU PCB:
In order to connect the STM32 microcontroller to the the FPGA, we designed a very simple PCB containing only two connectors. The first one is connected to the LCD connector pins on the STM32. These pins are directly connected to the FSMC peripheral. The second connector is connected to the Pmod connectors of the FPGA.
So we can summarise this PCB as a simple circuit directing the signals coming from the FSMC to the FPGA, more precisely the MCU interface implemented on the FPGA.
Step 2: Bitmap Conversion Software
Since the beginning of the project, we knew that it was impossible for us to use all kinds of color formats on the images that we want to display. In order to fulfil the specifications and respect the technical constraints, we needed to chose a fixed format and stick with it. The color format required is the RGBA with 16 bit pixels and 12 bit color and 4 bit transparency component. Knowing that this format is not a standard one, we had to develop a conversion software in order to create images of this format. Another advantage of such a software is the ability to modify any some characteristics of the image, such as the transparence.
We chose the C++ as a programming language and we used the QT creator for the graphics library. The C++ is a software development language that we are used to and we knew that file streamss would be easy to handle. We managed to read the images that we wanted to modify without any problem. Thanks to the QT Creator software and the Qt library, we created a very simple graphic interface hence making the use of the software very intuitive.
The software that we designed is very useful for changing the format of the images if we want to successfully use them in the FPGA card. The images to convert must be in the BITMAP 24 bit format because of three reasons:
- This type of format do not compress the images .Since the image that we want to store in the FPGA will be not be compressed, a decompressed image will be necessary.
- FPGA image processing for compression/decompression (JPEG/MPEG) already exists as open core IPs but it is very hard to implement. This is why it is easier to process images that are already decompressed. We should note that the excessive size of this type of images is not a problem because of available memory zone and the speed of data transfer via the DMA.
- Its quality is superior to 16 bits.
- It is available everywhere (a lot of available software like “Paint” can convert any type of images in BITMAP 24 bit format).
- Transparency is not available. Which gives us more flexibility in dealing with the transparency in our own way.
We mentioned earlier the transparency management. In a matter of fact, our software was also created in order to be able to set the transparency levels of the colors of a given image. Since our graphic card can handle up to 4 independent display layers, it is crucial for us to be able to change the transparency of an image or set a transpareny colour on itn otherwise multilayer display will bring no profit.
We have two different options for this transparency:
- The first one consists of making opaque a series of color (5 maximum). Ex: Make the background of a sprite transparent
- The second one consists of choosing the transparency of all the colors not taking in consideration in the number one option. Ex: Make the image of a fire 50% transparent in order to refine its animation.
First step: Loading the image and choosing the right parameters of configuration.
We begin by choosing the image to convert and the path to where we want to save it. Then, we put the transparency parameters that we explained in the previous section. When the image has finished loading, the software will start by reading the first bytes of the given image. These bytes contains the dimensions of the image and will not be copied in the output file because the FPGA card do not take into consideration these data but uses only the data corresponding to a pixel. After the acquisition of the first bytes of the image, the software can start the conversion.
Second step: Converting the image.
In this second phase the software will only read every byte defining the colors of a pixel in order to put it in a 16 bit format. It is a simple process using a right binary shift of 4 bit to be able to have 4 bit per color instead of 8 bit. We the add to these 12 bit og colors 4 bit of transparency using the parameters defined at the beginning of the procedure.
Third step : Setting the right format of the image.
This final step consists on adjusting the data. The BITMAP matrix stores the pixels in decreasing order considering the lines level. Since the specifications of the image format we want impose a increasing order of the pixels, we have to rearrange the pixels in the right order. We encountered the same problem for the bit order. We corrected it by changing 0xRGBA to 0xGRAB.
Once the image is in the wanted format, the only thing left is to transfer the image to the RAM of the FPGA using the STM32 microcontroller.
Step 3: Architecture
We chose an architecture that is both generic and flexible, allowing place for further improvements on the project, and allowing us to easily add or remove different modules. The architecture presented in the image is inspired from a few existing ones from which we kept the aspects that seemed useful to our specifications.
In this architecture the use of a shared-memory bus and the use of module-specific register maps provides huge flexibility for changes in the GPU.
To summarize the roles of different modules, the MCU Interface allows the STM32 to write into the registers of several modules, regrouped into Register Maps, the written data can configure different aspects of the GPU or launch an image processing operation.
Among those modules we have Video Display Controller, which provides the right synchronization signals for either VGA or LCD output, and the Frame Buffer will also be synchronized to this module.
The Frame Buffer is responsible for fetching lines to be displayed from memory, apply some blending and scrolling functionality and most importantly for it is responsible for providing the correct RGB data at the right moment. The line fetching is done through a graphics optimized memory bus, provided by the RAM Controller.
This controller provides an priority-oriented shared memory bus that is used by all modules that require access to the RAM. Among those we have the Block Processing Unit, who can operate on rectangular image portions, Primitive Generator Unit, who can generate geometric figures at specified destination and DMA Controller provides a way to quickly transferr image data to the on-board RAM.
And finally, LCD Configuration Unit is used for making SPI data transfers into the LCD Controller’s internal registers; these register can be altered to set the brightness, contrast and many other features of the LCD display.
Concerning the modules integrated in the MCU board:
The Real Time Operating System is responsible of the management of timing constraints regarding the video games.
The High-level Graphics API helps the user to easily control the graphics car, by creating primitives, structures and macros.
The High-level Audio API helps the user to play any music previously created on a PC.
In the following, sections, detailed explanation of the MCU Team related modules will be provided.
Step 4: MCU Interface - FPGA Xilinx Spartan 6
We designed the MCU interface in a way to be able to share data between the STM32 microcontroller and the Nexys 3 FPGA. To insure a fully functional graphics card, data received from the STM32 must be directed toward the correct register or toward the DMA controller without any discontinuity or lost data. The STM32 should also be able to read the data from the registers without jeopardizing the write process.
The MCU Interface Protocol
The LCD connector used to connect the STM32 microcontroller to the Nexys 3 FPGA is a 17 by 2 board to board connectors.
In order to use the FSMC asynchronous SRAM protocol, we need:
- A 16 bit data bus, available directly from the LCD connector (D0 to D15)
- The NOE, NWE and NE4 signals, also available from the LCD connector (RD, WR and CS)
- A 26 bit address not available on the LCD connector, there only the A bit. This why we had to create our own protocol so we can use the FSMC to transfer data from the STM32 microcontroller.
To cover for the unavailability of the 26 bit address bus, we decided to divide a transaction (read or write) into three successive transactions. The first one is a write transaction containing the address of the register we want to read or write in. The second one is a read or a write transaction depending on the type of operation that we need to do. In case of a write transaction, the data bus will be containing the data that we want to write in the registers which address is specified in the previous transaction. Since the GPU registers are 32 bit registers, we need two write (or read) 16 bit transaction.
To resume the protocol, in order to write in a register, we need three write transaction. The first one holding the address and the two others containing the 32 bit data (LSB then MSB). In case we want to read the data from the registers, the first transaction will be a write one containing the address. The other two hold the data from the registers (LSB then MSB).
Data and address bus management
The data and address bus management block
updates the 32 bit data buffer and the 16 bit address buffer. On the first transaction from the STM32, the 16 bit data are transferred to the address buffer. The following two transactions are transferred to 32 bit data buffer.
In case of a DMA controller data transfer, there is no need for a three write transaction since there is no address and the data bus is only 16 bit. In that case, the STM32 16 bit data bus are transferred to the DMA data bus. We used the LSB of the 26 bit address bus of the FSMC, named RS, for the sake of identifying the location of the data transfer (DMA controller or Register Map).
Data are transferred from the buffers to the available bus according to the type of the transfer, detected using the NOE, NWE signals from the FSMC, as shown in the previous section. Since these signals are asynchronous ones, we added to the MCU interface a Synchronous signal generator that can be used to synchronise the other blocks, but with a delay of 10 to 20 ns.
Bus Request Management
In case of a data transfer from or to the Registers Map block, the data bus has to be granted for the transaction to be successfully processed. If the bus is unavailable, a request must be sent to RegisterMap and the STM32 must stay idle until the bus becomes available.
That is why used one of the general purpose input/output (GPIO), available from the LCD connector, is configured to send a signal named BUSY that is tells the STM32 that the FPGA is busy and cannot pursue the transaction until the BUSY signal is put again to ‘0’. This simple procedure guarantee that every transfer to or from the Registers Map will be processed successfully and without any data loss. When the bus is available, an output enable is sent to registers in the case of a read transaction. If it is a write transaction, a load signal is sent to registers in the same time as the address bus.
In this demo you will see how we managed turn on and off LEDs on the Nexys 3 board from the MCU board.
Step 5: LCD Configuration Unit - FPGA Xilinix Spartan 6
This block is designed with the purpose of configuring the LCD the way we want to. We could for example change the contrast or the brightness of the LCD.
The purpose of this block is to update the registers contained in the LCD configuration module. When a data is written in the LCD registers of the register map, the “Set Data” signal is sent to this module in order to activate the update process. For each register in the LCD Configuration module, the corresponding address is sent to the Register Map and the register is updated. Of course the bus must be granted and the output enable signal must be sent with the address bus. If not, the bus request signal will be set and the module will stay idle until the bus is granted.
Everytime the update process is done, the RegMap communication block compares the new data received with the old data stored in a buffer. If a change has been made, the LCD SPI Bus Management block will be informed. The address of the register changed, as specified in the NOVATEK data sheet, along with the data changed will be stored in a Buffer to be accessed later by the LCD SPI Bus Management block.
LCD SPI Bus Management
This block is designed in order to send configuration data to the LCD. In a matter of fact, the LCD is connected to the NOVATEK NT39016 chip which uses the 3 wire Serial Port Interface (SPI) for all the internal parameter configuration.
Step 6: Kernel and Middleware - MCU ARM CORTEX M3
High level Graphics API
Our goal is to implement a graphics card using the Nexys 3 FPGA. The STM32 microcontroller will be the one running the operating system and sending the commands to the FPGA. The commands are sent via a bus assuring the communication of both of the cards using the FSMC peripheral. In order to make this process easier, we created primitives, structures and macros. These three utilities in addition to a DOXYGEN documentation help the user to easily control the graphics card. It is the same idea as for the STM libraries.
The configuration of a system require a really hard work of research in the different documentation and source files available. Of course the functions of the source files can be directly modified by the user, but we should avoid that. Any modification can cause a dysfunction in any previous program using these source files and written by another user. To avoid all these risques, we added in a file (macro) the information needing modification in order to make the program work.
There are two main categories for the macros:
- Configuration Macros :
These macros give the user the ability of modifying the different peripherals put to use during the communication. By doing this, there is no need to change anything in the source file. For example, this macro can fully configure the duration of cycles of the FSMC.
- Macros defining the technical specifications :
The main purpose of this macro is to indicate the specification of our system, like the screen size, the initial addresses of the planes and the addresses of the FSMC. These macros garanties an API that can be used anywhere. We can change for example a bigger screen by just modifying the right constant.
The Doxygen documentation includes all the necessary explications in order to use the macros.
With same idea as the drivers created for the STM32, all the structures created for this API are designed to make the use of the primitives and the data storage a lot easier. It is important for instance to keep in mind in which address of the RAM an image has been stored. As a matter of fact, the address of an image can possibly change in the FPGA after an operation such as the Bit Blit for example. If the original address is not stored somewhere, the user will lose the image.
- Image Structure:
It is used for every image created. The user identify its name, size and its address. All we have to do after this is to put this structure as a parameter in a move type primitive in order for the operation on the image to take place. The update of these parameters is included in the operations.
- Color Structure
It is very useful for the functions using colors. This structure is applied to avoid giving the 3 primary colors and the alpha transparency level at each use.
- Display Plane Structure
Our GPU can handle up to 4 planes. That is why we created 4 plane type structure for each of the 4 display planes. We declared the reference in a global variable. It is not necessary to create a new display plane on this FPGA, because 4 is the maximum number allowed. This structure contains all the data required to configure a layer: Width, length, RAM address, scrolling ovec X and over Y.
- ConfPlane Structure
This is the configuration structure for the 4 layers. There is only one type of this structure declared in the programme, with a reference as a global variable. It helps the user choose which layer to activate, to activate the transparency or not, to launch a test procedure for the communication and to activate the 4 display planes or
These 4 structures are not created to be directly modified. It is highly advised to use the associated functions of each structure in order to initialize or modify them. As a matter of fact, the written data in these structure are just the reflexion of what is inside the FPGA card. Changing the parameters only in the STM is useless. For Example, if we want to manually modify the size of a layer, it will not be changed on the GPU because the command will not be sent. The associated functions also helps establishing the communication between both of the cards.
This layer is used to configure the different peripherals used for the communication between the two cards. For example, we can use it to initialize the FSMC. The FSMC is initialized in a way to have the same behavior as described in the Section 2.b. A GPIO is also initialized in order to play the role of the busy signal used in the MCU interface.
We can also find in this layer the original functions used to read and write in the associated address of the PIN connected to the FPGA. The parameters of these functions are the address of the register to be modified in the FPGA and the data to be written in that register.
The associated functions for these drivers are not made to be used directly. They are already used in the service layer. Nevertheless, the user can in some cases access these functions in order to change the configuration of the software.
The service functions are the main functions of our graphics card. In this layer, we can find all the functions necessary for the execution of the different operations of the GPU.
The details of these functions will be available in the doxygen documentation section. We should just clarify that it is possible to place an over-layer in order to fulfill more complex operations using the original functions written in the service layer. For example it is possible to create a function in order to be able to handle an animation. This function will be using the move operation in order to do that.
The high level audio API and Real time Operating system are not yet implemented.