What you need to start programming in RISC-V assembler

  • Ready environment: Jupiter for practicing ASM, riscv32-none-elf for compiling and GHDL+GtkWave for simulating.
  • Master loops, conditionals, and functions with RISC-V ABI and proper stack handling.
  • ECALL according to environment: Jupiter (simple codes) vs Linux (a0..a2 and a7 with syscalls).
  • Take the leap: compile C/C++ to binary, generate ROM, and run on an RV32I CPU in FPGA.

RISC-V assembler

If you're curious about low-level programming and want to learn assembly programming on modern architectures, RISC-V is one of the best entry points. This open ISA, with great traction in industry and academia, allows you to practice from simple simulators to running it on an FPGA, through complete toolchains to compile C/C++ and examine the generated ASM.

In this practical guide I tell you, step by step and with a very earthly approach, What you need to start programming in RISC-V assembler: the tools, the workflow, key examples (conditionals, loops, functions, system calls), typical lab exercises, and, if you're up for it, a look at how an RV32I CPU is implemented and how to run your own binary on an FPGA-synthesized core.

What is RISC-V assembler and how does it relate to machine language?

RISC-V defines an open instruction set architecture (ISA): The RV32I base repertoire includes 39 instructions Very orthogonal and easy to implement. Assembler (ASM) is a low-level language that uses mnemonics like add, sub, lw, sw, jal, etc., aligned with that ISA. The underlying machine code is the bits the CPU understands; the assembler is its human-readable representation. closer to the hardware than any high-level language.

If you're coming from C, you'll notice that ASM doesn't run as is: it must be assembled and linked to produce a binary. In return, it allows you to control registers, addressing modes, and system calls with surgical precision. And if you work with a teaching simulator, you'll see "ecall" as an input/output and termination mechanism, with specific conventions depending on the environment (e.g., Jupiter vs. Linux).

Tools and environment: simulators, toolchain and FPGA

For a quick start, the Jupiter graphics simulator is ideal. It is an assembler/simulator designed for teaching, inspired by SPIM/MARS/VENUS and used in university courses. With it, you can write, assemble, and run RV32I programs without configuring an entire toolchain from scratch.

If you want to go a step further, you might be interested in bare-metal toolchain: riscv32-none-elf (GCC/LLVM) to compile C/C++ to RISC-V binaries, and utilities like objdump for disassembly. For hardware simulation, GHDL lets you compile VHDL, execute it, and dump signals into a .ghw file for inspection with GtkWave. And, if you're up for real hardware, You can synthesize an RV32I CPU into an FPGA with manufacturer environments (e.g. Intel Quartus) or free toolchains.

Getting Started with Jupiter: Basic Flow and Assembler Rules

Jupiter simplifies the learning curve. You create and edit files in the Editor tab, and every program starts at the global __start tag. Make sure you declare it with a .globl directive (yes, it's .globl, not .global). Tags end with a colon, and comments can begin with a # or ;.

A couple of useful rules of the environment: a single instruction per line, and when you're ready, save it and press F3 to assemble it and run it. Programs must end with an exit call ecall; in Jupiter, setting 10 to a0 signals the end of the program, similar to an "exit."

Minimally, your ASM skeleton on Jupiter might look like this, with the entry point clear and the termination by ecall: It is the basis of the rest of the exercises.

.text
.globl __start
__start:
  li a0, 10     # código 10: terminar
  ecall         # finalizar programa

Calling conventions (ABI) and stack management

Programming functions in assembler requires respecting the convention: Arguments usually arrive in a0..a7The result is usually returned in a0, and calls must preserve return addresses (ra) and saved registers (s0..s11). To do this, the stack (sp) is your friend: it reserves space on entry and restores it on exit.

Some instructions you'll use all the time: li and la to load immediates and addresses, add/addi for addition, lw/sw for memory access, unconditional jumps j/jal and returns jr ra, as well as conditionals like beq/bne/bge. Here's a quick reminder with typical examples:

# cargar inmediato y una dirección
li t1, 5
la t1, foo

# aritmética y actualización de puntero de pila
add t3, t1, t2
addi sp, sp, -8   # reservar 8 bytes en stack
sw ra, 4(sp)      # salvar ra
sw s0, 0(sp)      # salvar s0

# acceso a memoria con base+desplazamiento
lw t1, 8(sp)
sw a0, 8(sp)

# saltos y comparaciones
beq t1, t2, etiqueta
j etiqueta
jal funcion
jr ra

A classic loop in RISC-V can be structured clearly, separating condition, body and step. In Jupiter, you can also print values ​​with ecall based on the code you load into a0:

.text
.globl __start
__start:
  li t0, 0      # i
  li t1, 10     # max
cond:
  bge t0, t1, endLoop
body:
  mv a1, t0     # pasar i en a1
  li a0, 1      # código ecall para imprimir entero
  ecall
step:
  addi t0, t0, 1
  j cond
endLoop:
  li a0, 10     # código ecall para salir
  ecall

For recursive functions, take care of saving/restoring registers and ra. Factorial is the canonical example which forces you to think about the stack frame and returning control to the correct address:

.text
.globl __start
__start:
  li a0, 5          # factorial(5)
  jal factorial
  # ... aquí podrías imprimir a0 ...
  li a0, 10
  ecall

factorial:
  # a0 trae n; ra tiene la dirección de retorno; sp apunta a tope de pila
  bne a0, x0, notZero
  li a0, 1          # factorial(0) = 1
  jr ra
notZero:
  addi sp, sp, -8
  sw s0, 0(sp)
  sw ra, 4(sp)
  mv s0, a0
  addi a0, a0, -1
  jal factorial
  mul a0, a0, s0
  lw s0, 0(sp)
  lw ra, 4(sp)
  addi sp, sp, 8
  jr ra

Input/Output with ecall: Differences between Jupiter and Linux

The ecall instruction is used to invoke services from the environment. In Jupiter, simple codes in a0 (e.g., 1 print integer, 4 print string, 10 exit) control the available operations. In Linux, however, a0..a2 typically contain parameters, a7 the syscall number, and the semantics correspond to kernel calls (write, exit, etc.).

This “Hello World” for Linux illustrates the pattern: you prepare the registers a0..a2 and a7 and you run ecall. Note the .global directive and the _start entry point:

# a0-a2: argumentos; a7: número de syscall
.global _start
_start:
  addi a0, x0, 1     # 1 = stdout
  la a1, holamundo   # puntero al mensaje
  addi a2, x0, 11    # longitud
  addi a7, x0, 64    # write
  ecall
  addi a0, x0, 0     # return code 0
  addi a7, x0, 93    # exit
  ecall
.data
holamundo: .ascii "Hola mundo\n"

If your goal is to practice control logic, memory and functions, Jupiter gives you instant feedback And many labs include an autograder to validate your solution. If you want to practice interacting with the real system, you'll compile for Linux and use kernel syscalls.

Getting started exercises: conditionals, loops, and functions

A classic set of exercises to get started in RISC-V ASM covers three pillars: conditionals, loops, and function calls, with a focus on proper register and stack management:

  • Negative: function that returns 0 if the number is positive and 1 if it is negative. Receives the argument in a0 and returns in a0, without destroying non-volatile records.
  • Factor: Loop through the divisors of a number, printing them at runtime and returning the total amount. You will practice cycles, division/mod and calls to ecall to print.
  • Upper: Given a pointer to a string, traverse it and convert lowercase to uppercase in-place. Return the same address; if you move the pointer during the loop, reset it before returning.

For all three, it respects the convention of passing parameters and returning, and ends the program with exit ecall when you try it on Jupiter. These exercises cover control flow, memory, and stateful functions.

Digging deeper: from the RV32I ISA to a synthesizable CPU

RISC-V stands out for its openness: anyone can implement an RV32I core. There are educational designs that demonstrate step-by-step how to build a base CPU that runs real programs, compiled with GCC/LLVM for riscv32-none-elfExperience teaches you a lot about what happens "under the hood" when you run your assembler.

The typical implementation includes a memory controller that abstracts ROM and RAM, interconnected with the core. The interface of that controller usually has:

  • AddressIn (32 bits): address to access. Defines the origin of the access of instruction or data.
  • DataIn (32 bits): Data to be written. For halfwords, only 16 LSB bits are used; for bytes, 8 LSB bits are used. Ignored in reading.
  • WidthIn: 0=byte, 1=half word (16 bits), 2 or 3=word (32 bits). Size control.
  • ExtendSignIn: Whether to extend the sign in DataOut when reading 8/16 bits. It is ignored in writings.
  • WEIn: 0=read, 1=write. Access direction.
  • StartIn: start edge; setting it to 1 starts the transaction, synchronized to the clock.

When ReadyOut=1, the operation is complete: On reading, DataOut contains the data (with sign extension if applicable); on write, the data is already in memory. This layer allows you to swap internal FPGA RAM, SDRAM, or external PSRAM without touching the core.

A simple teaching organization defines three VHDL sources: ROM.vhd (4 KB), RAM.vhd (4 KB) and Memory.vhd (8 KB) which integrates both with a contiguous space (ROM at 0x0000..0x0FFF, RAM at 0x1001..0x1FFF) and a GPIO mapped at 0x1000 (bit 0 to a pin). The MemoryController.vhd controller instantiates "Memory" and provides the interface to the kernel.

About the core: The CPU contains 32 32-bit registers (x0..x31), with x0 tied to zero and not writable. In VHDL it is common to model them with arrays and generate blocks. to avoid replicating logic by hand, and a 5-to-32 decoder to select which register receives the output from the ALU.

The ALU is implemented combinationally with a selector (ALUSel) for operations such as addition, subtraction, XOR, OR, AND, displacements (SLL, SRL, SRA) and comparisons (LT, LTU, EQ, GE, GEU, NE). To save LUTs in FPGAs, a popular technique is to implement 1-bit shifts and repeat them N cycles using the state machine; this increases latency, but resource consumption is reduced.

Control is articulated with multiplexers for ALU inputs (ALUIn1/2 and ALUSel), destination register selection (RegSelForALUOut), signals to the memory controller (MCWidthIn, MCAddressIn, MCStartIn, MCWEIn, MCExtendSignIn, MCDataIn), and special registers PC, IR, and a Counter for counting shifts. All of this is controlled by a state machine with ~23 states.

A key concept in that FSM is “delayed loading”: The effect of selecting a MUX input materializes at the next clock edgeFor example, when loading IR with the instruction arriving from memory, the sequence goes through the states of fetch (launching a read at address PC), waiting for ReadyOut, moving DataOut to IR, and, in the next cycle, decoding and executing.

The typical fetch path: on reset you force PC=RESET_VECTOR (0x00000000), then you configure the driver to read 4 bytes at address PC, ReadyOut is waited for and IR is loadedFrom there, different states manage single-cycle ALUs, multi-cycle shifts, loads/stores, branches, jumps, and "specials" (a teaching implementation can cause ebreak to halt the processor on purpose).

Compile real code and run it on your RISC-V

A very educational "proof of concept" route is to compile a C/C++ program with the riscv32-none-elf cross compiler, generate the binary and dump it to a VHDL ROM. Then you simulate in GHDL and analyze signals in GtkWave; if everything goes well, you synthesize in an FPGA and see the system running in silicon.

First, a linker script adapted to your map: ROM from 0x00000000 to 0x00000FFF, GPIO at 0x00001000 and RAM from 0x00001001 to 0x00001FFF. For simplicity, you can put .text (including a .startup section) in ROM and .data in RAM, leaving out the data initialization if you want to keep the first version shorter.

With that map, a minimalist bootstrap routine places the stack at the end of SRAM and invokes main; marked as "naked" and in the .startup section to place it in RESET_VECTOR. After compiling, objdump lets you see the actual ASM your CPU will execute (lui/addi to build sp, jal to main, etc.).

A classic blinker example is to toggle bit 0 of the mapped GPIO: a short wait to debug in the simulator (GHDL+GtkWave) and, on real hardware, increase the count so that the flickering is noticeable. The Makefile can produce a .bin and a script that converts that binary into ROM initialization.vhd; once integrated, You compile the entire VHDL, simulate, and then synthesize..

This teaching approach works even on older FPGAs (e.g., an Intel Cyclone II), where the internal RAM is inferred using the recommended template and the design can be around 66% resource-efficient. The pedagogical benefit is enormous: see how PC progresses, how reads are triggered (mcstartin), ReadyOut validates data, IR captures instructions and how each jump or jump is propagated through the FSM.

Readings, Practices, and Autograder: A Roadmap

In academic settings, it is common to have clear objectives: Practice conditionals and loops, write functions respecting the convention and manage memory. The guides usually provide templates, a simulator (Jupiter), installation instructions, and an autograder for correction.

To prepare your environment, accept the assignment in Github Classroom if prompted, clone the repository, and open Jupiter. Remember that __start must be global, that comments can be # or ;, that there's one instruction per line, and that you must end with ecall (code 10 in a0). Compile with F3 and run tests. If it doesn't boot, the classic remedy is to reboot the machine.

Regarding the expected format of each exercise, many guides include screenshots and specify: For example, Factor prints divisors separated by spaces and returns the count; Upper should loop through the string and transform only lowercase letters to uppercase, without touching spaces, digits, or punctuation marks, and return the original pointer.

The evaluation usually distributes points per series (10/40/50) and You can run a check to see the autograder score.When you're satisfied, do add/commit/push and upload the repo URL wherever indicated. This lifecycle discipline gets you used to rigorous validation and delivery.

More exercises to strengthen: Fibonacci, Hanoi, and keyboard reading

Once you've got the basics down, work on three additional classics: fibonacci.s, hanoi.sy syscall.s (or another variant that reads from the keyboard and repeats a string).

  • Fibonacci: You can make it recursive or iterative; if you make it recursive, Be careful with the cost and with preserving ra/s0; iterative exercises you loops and additions.
  • Hanoi: Translating the recursive function to ASM. Preserves context and arguments between calls: disciplined stack frame. Prints “origin → destination” movements with ecall.
  • Read and Repeat: Read an integer and a string, and print the string N times. On Jupiter, use the appropriate ecall codes available in your practice; on Linux, prepare a7 and a0..a2 for read/write.

These exercises consolidate parameter passing, loops, and I/O. They force you to think about the interface with the environment (Jupiter vs Linux), and structure the ASM to be readable and maintainable.

Fine implementation details: registers, ALUs, and states

Returning to the RV32I educational core, it's worth reviewing several fine details that align what you see when programming with how the hardware executes: the ALU operation table selected by ALUSel (ADD, SUB, XOR, OR, AND, SLL, SRL, SRA, signed and unsigned comparisons), the “identity” as the default case, and the “trick” of using a counter to accumulate multi-cycle shifts.

Register logic with generate produces a 5→32 decoder, and the case RegSelForALUOut=00000 does nothing (x0 is not writable, it always equals zero). The PC, IR and Counter have their own MUXs, orchestrated by the FSM: from reset, fetch, decode/execute (one-cycle ALUs or shift loops), loads/stores, conditional branches, jal/jalr, and specials like ebreak.

In data memory access, MUX→Controller coordination is essential: MCWidthIn (8/16/32 bits), MCWEIn (R/W), MCAddressIn (from registers or PC), MCExtendSignIn (for signed LB/LH) and MCStartIn. Only when ReadyOut=1 should you capture DataOut and advance state. This aligns your ASM programmer mindset with the temporal hardware reality.

All of this connects directly with what you observe in the simulation: every time PC advances, an instruction read is triggered, MCReadyOut tells you that you can load IR, and from there the instruction takes effect (e.g., «lui x2,0x2» followed by «addi x2,x2,-4» to prepare sp, «jal x1, …” to call main). Seeing this in GtkWave is very addictive.

Resources, dependencies and final tips

To reproduce this experience you need a few dependencies: GHDL for compiling VHDL and GtkWave for analyzing signalsFor the cross-compiler, any GCC riscv32-none-elf will do (you can compile your own or install a pre-built one). To port the kernel to an FPGA, use your manufacturer's environment (e.g., Quartus on Intel/Altera) or free toolchains compatible with your device.

In addition, it is worth reading RISC-V guides and notes (e.g., how-tos and green cards), consulting programming books, and practice with Labs including Jupiter and Autograder. Maintain a routine: plan, implement, test with edge cases, and then integrate into larger projects (like the Blinker on FPGA).

With all this information, you already have the essentials to start: why assembler is used versus machine code, how to set up an environment with Jupiter or Linux, the loop patterns, conditionals, and functions with correct stack handling, and a window into the hardware implementation to better understand what happens when you execute each instruction.

If learning by doing is your thing, start with Negative, Factor, and Upper, then move on to Fibonacci/Hanoi and a keyboard-reading program. When you're comfortable, compile a simple C++, dump the ROM into VHDL, simulate in GHDL, and then jump to FPGA. It is a journey from less to more in which each piece fits with the next., and the satisfaction of seeing your own code moving a GPIO or blinking an LED is priceless.

best programming books
Related article:
Best programming books for each programming language