A modern-day ZX Spectrum OS rewritten from scratch in ARM assembly (aarch64) to run natively on RPi +3B.
If you are not familiar with the ZX Spectrum computer from the 1980's, there are some excellent emulators around, such as Retro Virtual Machine, and even some very good online emulators that you can run directly in your browser.
This repo just contains notes about the development of the project. The actual code lives at https://github.com/spectrum4/spectrum4.
Do you ever miss single-tasking computing, when you could write and run programs on your computer, and the computer didn't do anything else at the same time? You could simply and easily control all the peripherals of the computer, such as the screen and the sound chip, and you didn't have to worry if this might conflict with another program. The whole computer belonged to your program, and your program only! There were far fewer layers in the software stack, and it was much easier to get started as a programmer.
The Spectrum +2A/+3 had 4 16K ROMs, which contained the Operating System. It was written in Z80 assembly. This project aims to adapt the original Spectrum +2A/+3 ROMs to ARM assembly, to run natively on the Raspberry Pi 3B. There exist plenty of ZX Spectrum emulators, however they cannot take advantage of the superior hardware that is available today. Therefore rather than writing yet another emulator, I instead wanted to rewrite the ROM as if Sinclair or Amstrad were releasing the latest and greatest Spectrum onto the market, running on a Raspberry Pi. In other words, a Spectrum whose Operating System is in keeping with the original Operating System, but that takes advantage of the improved hardware that is available today, such as higher screen resolution, improved sound capabilities, and more memory.
I would like the ZX Spectrum +4 to support saving to and loading from cassette tape, like the original versions, mostly for nostalgia. I anticipate adding support for saving to and loading from SD card / USB storage too, but I'm keen to recreate the original tape loading/saving experience with the stripy screen borders.
This is the sequence of steps I intend to follow:
- RPi 3B kernel bootloading over home network
- Document compiled assembly of Zoltan Baldaszti's RPi 3B framebuffer tutorial
- Render updated ZX Spectrum +2/+3 main menu as a simple bitmap
- Understand how interrupts are configured and work on the ZX Spectrum
- Get a USB keyboard example working on bare metal
- Get USB keyboard input working from assembly program
- Decide on a toolchain to use (GNU binutils/FASMARM/...)
- Understand how interrupts are configured and work on the RPi
- Port some initial routines that don't generate or interpret sound
- Get sound output working via headphone socket
- Get sound output working via HDMI
- Get sound output working via USB
- Write assembly to generate sound
- Get sound input working via USB
- Write assembly that can sample/interpret audio from real spectrum tape
- Get JTAG working
- Port ZX Spectrum 48K ROM
- Port remaining ZX Spectrum +2/+3 ROMs
- Render graphics using GPU rather than writing directly to framebuffer with CPU
- Try to rewrite the code to use all four cores
- Rewrite the USB driver in assembly
- Write custom gpu firmware in VC4 assembly
My first goal was to implement RPi kernel bootloading over my home network, so that I could develop on my Mac, restart my RPi 3B, and have my changes reloaded, without needing to physically remove and reinsert the SD card from my RPi. That is now done, see RPi 3B Bootloading below.
Zoltan Baldaszti has kindly created a RPi 3B bare metal tutorial which contains an exercise that paints Homer Simpson to the display. It uses the aarch64 instruction set, so should offer a good introduction to rendering pixels to a display using aarch64 instruction set, which will be key for the screen drawing activities we will be faced with later.
I have built his example kernel, disassembled it, and am going through the disassembly line by line, to understand how it works. I have begun documenting the disassembled code below. I have been referring to the following useful documents to guide me:
- ARM Cortex-A53 MPCore Processor Technical Reference Manual
- The ARMv8 Instruction Set Overview
- The A64 Instruction Set
- ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile
In addition, chapters 16 to 20 of The armasm User Guide have provided a more detailed and complete reference of available aarch64 instructions. The armasm assembly syntax appears to be mostly compatible with the disassembly generated by the GNU binutils objdump utility, which is the tool I used for generating the disassembled code below.
In particular, instructions are provided in alphabetical order:
- General aarch64 instructions
- Data transfer aarch64 instructions
- Floating-point aarch64 instructions
- SIMD scalar aarch64 instructions
- SIMD vector aarch64 instructions
Alternatively, The A64 Instruction Set Reference seems to provide similar information.
Here is the commented disassembly so far:
kernel8.elf: file format elf64-littleaarch64
Disassembly of section .text:
0000000000080000 <_start>:
; This section will trigger cores 1,2,3 to sleep, and allow core 0 to continue running
; Read MPIDR_EL1 register into X1.
; This is the Multiprocessor Affinity Register which tells us which core is
; executing this instruction. All cores start executing code at 0x80000
; when the kernel is booted, but we are going to deactivate three of them
; so that only one of them renders Homer Simpson.
;
; See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500j/BABHBJCI.html
80000: d53800a1 mrs x1, mpidr_el1
; We can ignore 62 of the 64 bits, and just look at bits 0 and 1 to get a
; core number between 0 and 3. So logical AND with #3 and get the result back
; in x1 register.
80004: 92400421 and x1, x1, #0x3
; Jump to 0x80014 if current core is core 0
80008: b4000061 cbz x1, 80014 <_start+0x14>
; The next two instructions put the remaining cores (1, 2, 3) in an infinite
; loop of waiting for an interrupt, and if they are woken up, to go back to
; sleep again.
; Wait for an interrupt event
8000c: d503205f wfe
; Return to 0x8000c
80010: 17ffffff b 8000c <_start+0xc>
; This locates stack pointer at 0x80000, which is where code starts, so presumably
; the stack grows downwards, with the first entry's last byte located at 0x7ffff
; since executable code starts at 0x80000.
; Load x1 with contents of 0x80040, which is 0x0000000000080000 = 0x80000
80014: 58000161 ldr x1, 80040 <_start+0x40>
; Set the stack pointer to 0x80000. This is the start address of this assembly,
; but that does not matter since the stack grows downwards, immediately before
; this memory location, so memory location 0x80000 is not used by the stack.
80018: 9100003f mov sp, x1
; Zero out 176 bytes of space (22 double words) for statically allocated
; variables, at 0x866f0, i.e. 0x866f0 - 0x8679f
;
; 0x866f0-0x866f7: (8 bytes): framebuffer address
; 0x866f8-0x866fb: (4 bytes): framebuffer pitch (bytes per line)
; 0x866fc-0x866ff: (4 bytes): screen height
; 0x86700-0x86703: (4 bytes): screen width
; 0x86704-0x8670f: (12 bytes): < --- purpose unknown --- >
; 0x86710-0x8679b: (140 bytes): framebuffer mailbox request
; 0x8679c-0x8679f: (4 bytes): < --- purpose unknown --- >
; Load x1 with contents of 0x08848, which is 0x00000000000866f0 = 0x866f0
; This address maps to the bss section where statically allocated variables
; will be stored.
8001c: 58000161 ldr x1, 80048 <_start+0x48>
; Load w2 (32 bit register) with contents of 0x8003c, which is 0x00000016 = 0x16 (=22)
80020: 180000e2 ldr w2, 8003c <_start+0x3c>
; Jump to 0x80034 if w2 == 0
80024: 34000082 cbz w2, 80034 <_start+0x34>
; Store the 64 bit zero register in (address stored in x1 register)
; and then add 8 to x1
80028: f800843f str xzr, [x1], #8
; w2--
8002c: 51000442 sub w2, w2, #0x1
; Jump to 0x80024 if w2 != 0
80030: 35ffffa2 cbnz w2, 80024 <_start+0x24>
; Call routine at 0x803d0 (i.e. 'main')
80034: 940000e7 bl 803d0 <main>
; Send core 0 to sleep too - all done!
80038: 17fffff5 b 8000c <_start+0xc>
; Data block here, used by above code
8003c: 00000016 .word 0x00000016
80040: 00080000 .word 0x00080000
80044: 00000000 .word 0x00000000
80048: 000866f0 .word 0x000866f0
8004c: 00000000 .word 0x00000000
; This function calls `nop` for the number of times specified in w0. Note,
; the compiler has performed a strange "optimisation" - instead of checking if
; w0 == 0 it essentially checks if w0 - 1 + 1 == 0.
0000000000080050 <wait_cycles>:
; Jump to 0x80068 if w0 == 0, i.e. return from function if w0 == 0
80050: 340000c0 cbz w0, 80068 <wait_cycles+0x18>
; w0--
80054: 51000400 sub w0, w0, #0x1
; Do nothing (padding)
80058: d503201f nop
; w0--
8005c: 51000400 sub w0, w0, #0x1
; Compare w0 + 1 to zero
80060: 3100041f cmn w0, #0x1
; If zero flag is not set, jump to 80058
80064: 54ffffa1 b.ne 80058 <wait_cycles+0x8> ; b.any
; Return from function
80068: d65f03c0 ret
; Do nothing - although this is never executed, since no execution
; paths ever reach this address
8006c: d503201f nop
; This function waits the number of microseconds (10^-6 seconds)
; specified in w0 register. The disassembly is kind of elaborate
; in that division by 1,000 is implemented as multiplication by
; the integer part of (2^71 / 1,000) followed by division by 2^71.
;
; The following example parameters are used in the instruction
; comments, to illustrate how the calculations are performed.
;
; Example 1)
; wait time = 50,000 (50 milliseconds)
; clock freq = 250,000,000 (250 MHz),
; start count = 21,600,000,000,000 (uptime 1 day)
; end count = 21,600,012,500,000 (1 day + 50 milliseconds)
0000000000080070 <wait_msec>:
; Store w0 (number of microseconds to wait) in w2, since we are going to use
; x0 to store the physical count of the system counter. Only support 32 bit
; unsigned values, so discard upper 32 bits. Register w0 *is* the lower 32
; bits of x0, similarly w2 is the lower 32 bits of x2.
;
; Example 1)
; w0 = 50,000
; w2 = 50,000
; x2 = 50,000 (since this `mov` sets upper 32 bits of x0 to 0)
80070: 2a0003e2 mov w2, w0
; Move register cntfrq_el0 into x1. This tells us the system counter frequency in Hz.
;
; See https://static.docs.arm.com/ddi0487/da/DDI0487D_a_armv8_arm.pdf?_ga=2.11893491.637195177.1541534329-2041545432.1541106125#E29.AArch64cntfrqel0
;
; Example 1)
; x1 = 250,000,000
80074: d53be001 mrs x1, cntfrq_el0
; Move register cntpct_el0 into x0. This tells us the physical count of the system counter.
;
; See https://static.docs.arm.com/ddi0487/da/DDI0487D_a_armv8_arm.pdf?_ga=2.11893491.637195177.1541534329-2041545432.1541106125#G29.5472014
;
; Example 1)
; x0 = 21,600,000,000,000
80078: d53be020 mrs x0, cntpct_el0
; In order to divide by 1,000 this code first multiplies by the integer part
; of 2^71/1,000 and then divides by 2^71 by shifting bits 71 places to the
; right. In order to multiply by int(2^71/1,000) (0x20c49ba5e353f7cf), we
; need to get 0x20c49ba5e353f7cf into a 64 bit register (register x3) 16
; bits at a time. Here we set bits 0-15.
;
; Note, aarch64 has the `udiv` unsigned divide instruction, but the compiler
; has chosen not to use it. Perhaps it offers no performance benefit? I've
; no idea, and haven't taken any benchmarks to find out. Perhaps the
; the compiler doesn't know about it, or its use has to be enabled with a
; compiler flag etc.
;
; x3 = 0x000000000000f7cf
8007c: d29ef9e3 mov x3, #0xf7cf ; #63439
; Set bits 16-31 of x3 to 0xe353.
;
; x3 = 0x00000000e353f7cf
80080: f2bc6a63 movk x3, #0xe353, lsl #16
; x1 = x1 / 2^3
; = clock freq / 2^3
;
; Example 1)
; x1 = 31,250,000
80084: d343fc21 lsr x1, x1, #3
; Set bits 32-47 of x3 to 0x9ba5.
;
; x3 = 0x00009ba5e353f7cf
80088: f2d374a3 movk x3, #0x9ba5, lsl #32
; Set bits 48-63 of x3 to 0x20c4
;
; x3 = 0x20c49ba5e353f7cf
; = 2361183241434822607
; = int(2^71/1,000)
8008c: f2e41883 movk x3, #0x20c4, lsl #48
; x1 = bits 64-127 of (x1 * x3)
; = x1 * x3 / 2^64
; = clock freq * 2^4 / 1,000
;
; Example 1)
; x1 = bits 64-127 of 0x3d09000000000000487ab0
; = 0x3d0900
; = 4,000,000
80090: 9bc37c21 umulh x1, x1, x3
; x1 = x1 / 2^4
; = clock freq / 1,000
;
; Example 1)
; x1 = 250,000
80094: d344fc21 lsr x1, x1, #4
; x1 = x1 * x2
; = x1 * w2
; = clock freq * wait time / 1,000
;
; Example 1)
; x1 = 250,000 * 50,000
; = 12,500,000,000
80098: 9b027c21 mul x1, x1, x2
; x1 = x1 / 2^3
; = clock freq * wait time / (2^3 * 1,000)
;
; Example 1)
; x1 = 1,562,500,000
8009c: d343fc21 lsr x1, x1, #3
; x1 = bits 64-127 of (x1 * x3)
; = x1 * x3 / 2^64
; = clock freq * wait time * 2^71 / (1,000 * 1,000 * 2^(64 + 3))
; = clock freq * wait time * 2^4 / 1,000,000
;
; Example 1)
; x1 = bits 64-127 of 0xbebc200000000000e27f660
; = 0xbebc200
; = 200,000,000
800a0: 9bc37c21 umulh x1, x1, x3
; x0 = x0 + x1 / 2^4
; = start count + (clock freq * wait time / 1,000,000)
;
; Example 1)
; x0 = 21,600,000,000,000 + 12,500,000
; = 21,600,012,500,000
800a4: 8b411000 add x0, x0, x1, lsr #4
; Move register cntfrq_el0 into x1 (as before) to get the system counter
; frequency in Hz again.
800a8: d53be021 mrs x1, cntpct_el0
; Compare x0 and x1
800ac: eb01001f cmp x0, x1
; If x0 > x1 jump back to 0x800a8
800b0: 54ffffc8 b.hi 800a8 <wait_msec+0x38> ; b.pmore
; Return from function
800b4: d65f03c0 ret
; 8 bytes padding since functions need to start on 16 byte boundaries.
; See page 14 of
; https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf
; "Although not available as a general purpose register, the Stack Pointer must
; be 16-byte aligned at any public interface. It must also be 16-byte aligned at
; any point where it is used to access memory. This is enforced in hardware. Note
; that the alignment check is on the stack pointer and not on the address which
; is actually accessed."
800b8: d503201f nop
800bc: d503201f nop
; This function puts the system timer counter in x0. This counter increments
; every microsecond (1,000,000 increments per second, i.e. 1MHz). I'm not sure
; how much error margin there is, and whether the clock has a variable speed or
; a constant speed.
00000000000800c0 <get_system_timer>:
; Put 0x3f003008 in x2
;
; This is the MMIO address of the 'CHI: System Timer Counter Higher 32 bits'
; register of the System Timer Registers on page 172 of
; https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
; Note, the peripheral base address in BCM2835 is located at 0x7e000000 but is
; located at 0x3f000000 in BCM2837, hence why this is 0x3f003008 and not
; 0x7e003008.
800c0: d2860102 mov x2, #0x3008 ; #12296
800c4: f2a7e002 movk x2, #0x3f00, lsl #16
; Put 0x3f003004 in x3
;
; This is the MMIO address of the 'CLO: System Timer Counter Lower 32 bits'
; register to complement CHI in x2 above.
800c8: d2860083 mov x3, #0x3004 ; #12292
800cc: f2a7e003 movk x3, #0x3f00, lsl #16
; Load w0 (32 bit register) with CHI System Timer register
800d0: b9400040 ldr w0, [x2]
; Load w1 (32 bit register) with CLO System Timer register
800d4: b9400061 ldr w1, [x3]
; Load w4 (32 bit register) with CHI System Timer register, to check if it
; changed in between reading CLO register
800d8: b9400044 ldr w4, [x2]
; Check the two reads of CHI gave consistent results, and if so, jump to
; 0x800ec
800dc: 6b00009f cmp w4, w0
800e0: 54000060 b.eq 800ec <get_system_timer+0x2c> ; b.none
; The consecutive reads os CHI gave different results, so read CHI, CLO again.
; This time we can be confident that CHI didn't change, because that should only
; happen every 2^32 microseconds, i.e. about every 35 minutes, and we know it
; only just changed a couple of instructions ago.
800e4: b9400040 ldr w0, [x2]
800e8: b9400061 ldr w1, [x3]
; This sets the upper 32 bits of x1 to zero, which is needed for subsequent
; ORR statement. See
; http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf#I7.5.1043011
;
; "Reads from W registers disregard the higher 32 bits of the corresponding
; X register and leave them unchanged. Writes to W registers set the higher 32
; bits of the X register to zero. That is, writing 0xFFFFFFFF into W0 sets X0 to
; 0x00000000FFFFFFFF."
800ec: 2a0103e1 mov w1, w1
; This composes the CHI and CLO 32 bit values into the single 64 bit
; register x0.
800f0: aa008020 orr x0, x1, x0, lsl #32
; Return from function
800f4: d65f03c0 ret
; Padding to reach 16 byte boundary for function.
800f8: d503201f nop
800fc: d503201f nop
; This function waits the number of microseconds (10^-6 seconds)
; specified in w0 register.
0000000000080100 <wait_msec_st>:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; The following block is inlined from function <get_system_timer> above
; (0x800c0 - 0x800f3) but puts the system clock counter in x1 instead of x0.
; See 0x800c0 - 0x800f3 for a commentary of these instructions.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
80100: d2860103 mov x3, #0x3008 ; #12296
80104: f2a7e003 movk x3, #0x3f00, lsl #16
80108: d2860084 mov x4, #0x3004 ; #12292
8010c: f2a7e004 movk x4, #0x3f00, lsl #16
80110: b9400065 ldr w5, [x3]
80114: b9400082 ldr w2, [x4]
80118: b9400061 ldr w1, [x3]
8011c: 6b0100bf cmp w5, w1
80120: 54000060 b.eq 8012c <wait_msec_st+0x2c> ; b.none
80124: b9400061 ldr w1, [x3]
80128: b9400082 ldr w2, [x4]
8012c: 2a0203e2 mov w2, w2
80130: aa018041 orr x1, x2, x1, lsl #32
;;;;;;; end of inlined (duplicated) code from 0x800c0 - 0x800f3 ;;;;;;;
; Jump to 0x8017c if x1 == 0, i.e. return from function if x1 == 0
; This is here in case this code is run in QEMU, where the system
; counter is not available. On a real functioning RPi 3B this
; should never be zero.
80134: b4000241 cbz x1, 8017c <wait_msec_st+0x7c>
; x0 = x1 + w0
; = start microsecond clock counter + microseconds to wait
; = target end counter
;
; For an explanation of 'uxtw' see
; https://static.docs.arm.com/100898/0100/the_a64_Instruction_set_100898_0100.pdf#%5B%7B%22num%22%3A35%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C54%2C629%2C0%5D
80138: 8b204020 add x0, x1, w0, uxtw
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; The following block is inlined again from function <get_system_timer> above
; (0x800c0 - 0x800f3) but puts the system clock counter in x1 instead of x0.
; See 0x800c0 - 0x800f3 for a commentary of these instructions.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
8013c: d2860103 mov x3, #0x3008 ; #12296
80140: f2a7e003 movk x3, #0x3f00, lsl #16
80144: d2860085 mov x5, #0x3004 ; #12292
80148: f2a7e005 movk x5, #0x3f00, lsl #16
8014c: d503201f nop
80150: b9400064 ldr w4, [x3]
80154: b94000a2 ldr w2, [x5]
80158: b9400061 ldr w1, [x3]
8015c: 6b01009f cmp w4, w1
80160: 54000060 b.eq 8016c <wait_msec_st+0x6c> ; b.none
80164: b9400061 ldr w1, [x3]
80168: b94000a2 ldr w2, [x5]
8016c: 2a0203e2 mov w2, w2
80170: aa018041 orr x1, x2, x1, lsl #32
;;;;;;; end of inlined (duplicated) code from 0x800c0 - 0x800f3 ;;;;;;;
; If x0 <= x1, repeat above block, but skip initialising x3 and x5 again
; since they are unchanged, so jump to 0x80150 instead of 0x8013c.
80174: eb00003f cmp x1, x0
80178: 54fffec3 b.cc 80150 <wait_msec_st+0x50> ; b.lo, b.ul, b.last
; Return from function, since the clock counter has reached the target
; value.
8017c: d65f03c0 ret
; This function initialises the frame buffer
0000000000080180 <lfb_init>:
; Transfer x29 (Frame Pointer) and x30 (Procedure Link Register) to 32-17
; bytes before the stack pointer. Update stack pointer to new location. Note
; this is a downward-growing stack.
80180: a9be7bfd stp x29, x30, [sp, #-32]!
; w1 = 140
80184: 52801181 mov w1, #0x8c ; #140
; w2 = 0x48003
80188: 52900062 mov w2, #0x8003 ; #32771
8018c: 72a00082 movk w2, #0x4, lsl #16
; Put stack pointer (sp) in frame pointer (x29)
80190: 910003fd mov x29, sp
; Store x19 on the *top* of the stack (not the bottom)
; Leave stack pointer pointing below the stack.
80194: f9000bf3 str x19, [sp, #16]
; From https://quequero.org/2014/04/introduction-to-arm-architecture:
; "The ADRP instruction permits the calculation of the address at a 4KB
; aligned memory region. In conjunction with an ADD(immediate) instruction,
; or a Load/Store instruction with a 12-bit immediate offset, this allows
; for the calculation of, or access to, any address within ±4GB of the
; current PC."
;
; At first I didn't quite understand the instruction, so I looked in the
; "ARM® Architecture Reference Manual ARMv8, for ARMv8-A architecture profile"
; to find out what the instruction actually does at a bit level.
;
; See https://yurichev.com/mirrors/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf#E14.A64instructionsADRP
;
; From the disassembly we can see that the raw instruction is 0xd0000020. From
; this, following the manual, we can extract the parameters of the instruction:
;
; machine code instruction
; = 0xd-->0-->0-->0-->0-->0-->2-->0-->
; = 0b11010000000000000000000000100000
; |<><---><-----------------><--->
; o i 1 i R
; p m 0 m d
; m 0 m
; l 0 h
; o 0 i
; =>
; op = 1
; page = 1
; immlo = 0b10
; immhi = 0b0000000000000000001
; d = 0b 0 0000
; imm = SignExtend(immhi:immlo:Zeros(12), 64);
; = 0b 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0110 0000 0000 0000
;
; The manual then explains what this operation does at a bit level:
;
; bits(64) base = PC[];
; =>
; base = 0x80198 (the program counter is just the address of this instruction)
; = 0b 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1000 0000 0001 1001 1000
;
; if page then
; base<11:0> = Zeros(12)
; =>
; base = 0b 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1000 0000 0000 0000 0000
;
; X[d] = base + imm;
; =>
; x0 = 0b 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1000 0110 0000 0000 0000
; = 0x86000
;
; So in the end this instruction is equivalent to `mov x0, 86000`, but the major difference is
; if you moved this code to a different memory location, the value of x0 would change. That
; isn't the case if you hardcode 86000 into the instruction, which is maybe why the compiler
; opted to use `adrp`, so the code is easier to relocate to a different memory location, if
; someone chose to do so. Note, not all the assembly is relocatable, e.g. there are hardcoded
; addresses in other memory locations, such as in 0x80048.
;
; The immediate (imm) is 24K (i.e. six 4K pages), so really this instruction is just saying
; "Put the page address of the 4K page in x0, which is six 4K pages higher than the page of
; the current instruction". Presumably the C compiler worked out that six 4K pages later is
; far enough away from the compiled code, that it is available for use.
80198: d0000020 adrp x0, 86000 <uart_hex+0x59a0>
; x19 = 0x86710
;
; Address range 0x86710-0x8679b is going to be where the data is to be stored for initialising
; the framebuffer. Store the base address is x19.
8019c: 911c4013 add x19, x0, #0x710
; The *only thing* the following section does is initialise the following memory block, which
; is a property channel (mailbox channel 8) message to the GPU to initialise a framebuffer.
;
; See
; * https://jsandler18.github.io/extra/prop-channel.html
; * https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface
; * https://github.com/BrianSidebotham/arm-tutorial-rpi/tree/master/part-5#mailbox-property-interface
;
; x19: 0x86710: 140 Buffer size
; x19+4: 0x86714: 0 Request/response code
; x19+8: 0x86718: 0x48003 Tag 0 - Set Screen Size
; x19+12: 0x8671c: 8 value buffer size
; x19+16; 0x86720: 8 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+20; 0x86724: 1024 request: width response: width
; x19+24; 0x86728: 768 request: height response: height
; x19+28; 0x8672c: 0x48004 Tag 1 - Set Virtual Screen Size
; x19+32; 0x86730: 8 value buffer size
; x19+36; 0x86734: 8 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+40; 0x86738: 1024 request: width response: width
; x19+44; 0x8673c: 768 request: height response: height
; x19+48; 0x86740: 0x48009 Tag 2 - Set Virtual Offset
; x19+52; 0x86744: 8 value buffer size
; x19+56; 0x86748: 8 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+60; 0x8674c: 0 request: x offset response: x offset
; x19+64; 0x86750: 0 request: y offset response: y offset
; x19+68; 0x86754: 0x48005 Tag 3 - Set Colour Depth
; x19+72; 0x86758: 4 value buffer size
; x19+76; 0x8675c: 4 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; 32 bits per pixel => 8 red, 8 green, 8 blue, 8 alpha
; See https://en.wikipedia.org/wiki/RGBA_color_space
; x19+80; 0x86760: 32 request: bits per pixel response: bits per pixel
; x19+84; 0x86764: 0x48006 Tag 4 - Set Pixel Order (really is "Colour Order", not "Pixel Order")
; x19+88; 0x86768: 4 value buffer size
; x19+92; 0x8676c: 4 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+96; 0x86770: 1 request: 0 => BGR, 1 => RGB response: 0 => BGR, 1 => RGB
; x19+100; 0x86774: 0x40001 Tag 5 - Get (Allocate) Buffer
; x19+104; 0x86778: 8 value buffer size (response > request, so use response size)
; x19+108; 0x8677c: 8 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+112; 0x86780: 4096 request: alignment in bytes response: frame buffer base address
; x19+116; 0x86784: 0 request: padding response: frame buffer size in bytes
; x19+120; 0x86788: 0x40008 Tag 6 - Get Pitch (bytes per line)
; x19+124; 0x8678c: 4 value buffer size
; x19+128; 0x86790: 4 request: should be 0 response: 0x80000000 (success) / 0x80000001 (failure)
; x19+132; 0x86794: 0 request: padding response: bytes per line
; x19+136; 0x86798: 0 End Tags
; x19: 0x86710: 140
;
; Note, x0 + 1808 = x0 + 0x710 = 0x86710 = x19. Strangely the compiler used
; `str w1, [x0, #1808]` but could have also used `str w1, [x19]`.
801a0: b9071001 str w1, [x0, #1808]
; w1 = 8
801a4: 52800101 mov w1, #0x8 ; #8
; w3 = 1024
801a8: 52808003 mov w3, #0x400 ; #1024
; w0 = 768
801ac: 52806000 mov w0, #0x300 ; #768
; x19+4: 0x86714: 0
801b0: b900067f str wzr, [x19, #4]
; w4 = 0x48004
801b4: 52900084 mov w4, #0x8004 ; #32772
801b8: 72a00084 movk w4, #0x4, lsl #16
; x19+8: 0x86718: 0x48003
801bc: b9000a62 str w2, [x19, #8]
; w10 = 0x48009
801c0: 5290012a mov w10, #0x8009 ; #32777
801c4: 72a0008a movk w10, #0x4, lsl #16
; x19+12: 0x8671c: 8
801c8: b9000e61 str w1, [x19, #12]
; w9 = 0x48005
801cc: 529000a9 mov w9, #0x8005 ; #32773
801d0: 72a00089 movk w9, #0x4, lsl #16
; x19+16; 0x86720: 8
801d4: b9001261 str w1, [x19, #16]
; w2 = 4
801d8: 52800082 mov w2, #0x4 ; #4
; x19+20; 0x86724: 1024
801dc: b9001663 str w3, [x19, #20]
; w8 = 32
801e0: 52800408 mov w8, #0x20 ; #32
; x19+24; 0x86728: 768
801e4: b9001a60 str w0, [x19, #24]
; w7 = 0x48006
801e8: 529000c7 mov w7, #0x8006 ; #32774
801ec: 72a00087 movk w7, #0x4, lsl #16
; x19+28; 0x8672c: 0x48004
801f0: b9001e64 str w4, [x19, #28]
; w6 = 1
801f4: 52800026 mov w6, #0x1 ; #1
; x19+32; 0x86730: 8
801f8: b9002261 str w1, [x19, #32]
; w5 = 0x40001
801fc: 52800025 mov w5, #0x1 ; #1
80200: 72a00085 movk w5, #0x4, lsl #16
; x19+36; 0x86734: 8
80204: b9002661 str w1, [x19, #36]
; w4 = 4096
80208: 52820004 mov w4, #0x1000 ; #4096
; x19+40; 0x86738: 1024
8020c: b9002a63 str w3, [x19, #40]
; w3 = 0x40008
80210: 52800103 mov w3, #0x8 ; #8
80214: 72a00083 movk w3, #0x4, lsl #16
; x19+44; 0x8673c: 768
80218: b9002e60 str w0, [x19, #44]
; w0 = w1 = 8 - this is MBOX_CH_PROP
8021c: 2a0103e0 mov w0, w1
; x19+48; 0x86740: 0x48009
80220: b900326a str w10, [x19, #48]
; x19+52; 0x86744: 8
80224: b9003661 str w1, [x19, #52]
; x19+56; 0x86748: 8
80228: b9003a61 str w1, [x19, #56]
; x19+60; 0x8674c: 0
8022c: b9003e7f str wzr, [x19, #60]
; x19+64; 0x86750: 0
80230: b900427f str wzr, [x19, #64]
; x19+68; 0x86754: 0x48005
80234: b9004669 str w9, [x19, #68]
; x19+72; 0x86758: 4
80238: b9004a62 str w2, [x19, #72]
; x19+76; 0x8675c: 4
8023c: b9004e62 str w2, [x19, #76]
; x19+80; 0x86760: 32
80240: b9005268 str w8, [x19, #80]
; x19+84; 0x86764: 0x48006
80244: b9005667 str w7, [x19, #84]
; x19+88; 0x86768: 4
80248: b9005a62 str w2, [x19, #88]
; x19+92; 0x8676c: 4
8024c: b9005e62 str w2, [x19, #92]
; x19+96; 0x86770: 1
80250: b9006266 str w6, [x19, #96]
; x19+100; 0x86774: 0x40001
80254: b9006665 str w5, [x19, #100]
; x19+104; 0x86778: 8
80258: b9006a61 str w1, [x19, #104]
; x19+108; 0x8677c: 8
8025c: b9006e61 str w1, [x19, #108]
; x19+112; 0x86780: 4096
80260: b9007264 str w4, [x19, #112]
; x19+116; 0x86784: 0
80264: b900767f str wzr, [x19, #116]
; x19+120; 0x86788: 0x40008
80268: b9007a63 str w3, [x19, #120]
; x19+124; 0x8678c: 4
8026c: b9007e62 str w2, [x19, #124]
; x19+128; 0x86790: 4
80270: b9008262 str w2, [x19, #128]
; x19+132; 0x86794: 0
80274: b900867f str wzr, [x19, #132]
; x19+136; 0x86798: 0
80278: b9008a7f str wzr, [x19, #136]
; ------------------------------------------------------------------
; --------- Memory block (0x86710-0x8679b) now initialised ---------
; ------------------------------------------------------------------
; call mbox_call
; Note w0 = x0 = 8
8027c: 94000061 bl 80400 <mbox_call>
; Test call to mbox_call was successful (non-zero value in w0)
; if mbox_call was unsuccessful (w0 == 0), skip to failure message block below
80280: 34000080 cbz w0, 80290 <lfb_init+0x110>
; Test 32 bit colour depth accepted
; Read from address (x19 + 80) into w0. From above, this is the mailbox
; response for 'bits per pixel':
; x19+80; 0x86760: 32 request: bits per pixel response: bits per pixel
80284: b9405260 ldr w0, [x19, #80]
; Compare if it is 32 (i.e. that we successfully set colour depth to 32)
80288: 7100801f cmp w0, #0x20
; If address colour depth successfully set, skip to next test
; below to test framebuffer base address has been set
8028c: 540000e0 b.eq 802a8 <lfb_init+0x128> ; b.none
; Write "Unable to set screen resolution to 1024x768x32" to UART0 (serial
; connection)
; Reinstate original x19 from the stack (that was stored in instruction at
; 0x80194). We must do this since x19-x28 are callee saved registers.
80290: f9400bf3 ldr x19, [sp, #16]
; x0 = 806b8 = address of message "Unable to set screen resolution ...."
80294: 90000000 adrp x0, 80000 <_start>
80298: 911ae000 add x0, x0, #0x6b8
; Reinstate x29 (frame pointer) and x30 (procedure link register) from the
; stack (they were stored in instruction at 0x80180).
8029c: a8c27bfd ldp x29, x30, [sp], #32
; Jump directly (no stack update) to <uart_puts> method
; this works since there is nothing left to do in this function, so when
; uart_puts returns, it is ok to return to the function that called this one
802a0: 140000d4 b 805f0 <uart_puts>
802a4: d503201f nop
; Test framebuffer base address returned in mailbox
; read returned base address of framebuffer from mailbox into w0
802a8: b9407260 ldr w0, [x19, #112]
; if it isn't set (=0) go to section above to write failure message to UART0
802ac: 34ffff20 cbz w0, 80290 <lfb_init+0x110>
; If we get this far, then mailbox call was successful
; This line seems pretty redundent, as we had it two instructions before
802b0: b9407260 ldr w0, [x19, #112]
; Set x2, x4 and x6 to address 86000, which will be used as a base address
; for calculating address inside bss section later.
802b4: d0000026 adrp x6, 86000 <uart_hex+0x59a0>
802b8: d0000024 adrp x4, 86000 <uart_hex+0x59a0>
802bc: d0000022 adrp x2, 86000 <uart_hex+0x59a0>
; Unset bits 30, 31 of the framebuffer base address. These should in any
; case be unset since the RPi 3B only has 1GB RAM. Probably this is just
; a safety check to make sure the address is in range.
802c0: 12007400 and w0, w0, #0x3fffffff
; This writes back the address to memory, where it was read from.
802c4: b9007260 str w0, [x19, #112]
; Set x1 also to address 86000 for same reason as x2, x4, x6 above.
802c8: d0000021 adrp x1, 86000 <uart_hex+0x59a0>
; w7 = screen width
802cc: b9401667 ldr w7, [x19, #20]
; w5 = screen height
802d0: b9401a65 ldr w5, [x19, #24]
; w3 = pitch (bytes per line)
802d4: b9408663 ldr w3, [x19, #132]
; This seems like a wasted instruction, we just stored w0 here, and now are
; reading it back again, even though it didn't change in between.
802d8: b9407260 ldr w0, [x19, #112]
; Store screen width at 0x86700.
802dc: b90700c7 str w7, [x6, #1792]
; Reinstate original x19 from the stack (that was stored in instruction at
; 0x80194). We must do this since x19-x28 are callee saved registers.
802e0: f9400bf3 ldr x19, [sp, #16]
; Ensure upper 32 bits of x0 are zero, but I'm pretty sure they are
; guaranteed to be already, due to the earlier w0 instructions, such as
; the mov instruction at 0x801ac.
802e4: 2a0003e0 mov w0, w0
; Store screen height at 0x866fc
802e8: b906fc85 str w5, [x4, #1788]
; Store pitch at 0x866f8
802ec: b906f843 str w3, [x2, #1784]
; Store 64 bit address of framebuffer in 0x866f0
802f0: f9037820 str x0, [x1, #1776]
; Reinstate x29 (frame pointer) and x30 (procedure link register) from the
; stack (they were stored in instruction at 0x80180).
802f4: a8c27bfd ldp x29, x30, [sp], #32
; Return from function
802f8: d65f03c0 ret
802fc: d503201f nop
0000000000080300 <lfb_showpicture>:
; x1 = 0x86000
80300: d0000021 adrp x1, 86000 <uart_hex+0x59a0>
; w6 = screen width (stored in 0x86700)
80304: b9470026 ldr w6, [x1, #1792]
; x0 = 0x86000
80308: d0000020 adrp x0, 86000 <uart_hex+0x59a0>
; w0 = screen height
8030c: b946fc00 ldr w0, [x0, #1788]
; x10 = 0x86000
80310: d000002a adrp x10, 86000 <uart_hex+0x59a0>
; w2 = pitch (bytes per line)
80314: b946f942 ldr w2, [x10, #1784]
; w6 = screen width - homer width (96)
80318: 510180c6 sub w6, w6, #0x60
; w0 = screen height - homer height (64)
8031c: 51010000 sub w0, w0, #0x40
; w1 = 2 * (screen width - homer width)
; = 4 * offset pixels from left of screen
; = x byte offset (4 bytes per pixel)
80320: 0b0600c1 add w1, w6, w6
; w0 = (screen height - homer height)/2 = number of screen rows above image
80324: 53017c00 lsr w0, w0, #1
; x3 = 0x86000
80328: d0000023 adrp x3, 86000 <uart_hex+0x59a0>
; x6 = frame buffer address in 0x866f0
8032c: f9437866 ldr x6, [x3, #1776]
; x9 = 0x806e8 = base address of GIMP header image file format (RGB) of homer image
80330: 90000009 adrp x9, 80000 <_start>
80334: 911ba129 add x9, x9, #0x6e8
; w0 = w0 * w2 + w1
; = pitch * number of screen rows above image + x byte offset
; = total frambuffer offset for start of image inside framebuffer
80338: 1b020400 madd w0, w0, w2, w1
; x11 = 86000
8033c: d000002b adrp x11, 86000 <uart_hex+0x59a0>
; x10 = 866f8 = address of framebuffer pitch
80340: 911be14a add x10, x10, #0x6f8
; x11 = 0x866e8 = address immediately after end of encoded image
80344: 911ba16b add x11, x11, #0x6e8
; x6 = framebuffer base address + offset address
; = address of first homer pixel
80348: 8b0000c6 add x6, x6, x0
8034c: d503201f nop
;;;; outer loop (y axis) starts here
; x8 is the indirect result register
; See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch09s01s01.html
;
; x8 = address immediately after last pixel in row
80350: 910600c8 add x8, x6, #0x180
; x2 = base address of encoded image
80354: aa0903e2 mov x2, x9
;;;; inner loop (x axis) starts here
; w1 = byte at (x2+1) ... that is data[1] from
; https://github.com/bztsrc/raspi3-tutorial/blob/7ace64ba9ff98011d37c74bba20890ccbd663ccb/09_framebuffer/homer.h#L9
; Note, 'ldrb immediate unsigned offset' accepts an immediate from 0 to 4095
; bytes.
80358: 39400441 ldrb w1, [x2, #1]
; x2 += 4
; Strangely the compiler has prematurely bumped x2 here, so subsequent byte loads relating
; to the current iteration have negative offsets. If it had bumped x2 after reading the
; image bytes, they would have all had positive offsets.
8035c: 91001042 add x2, x2, #0x4
; w3 = byte at (x2-4) ... data[0]
; Note, ldurb only has 'unsigned immediate offset' variant, and accepts an
; immediate from -256 to 255 bytes.
80360: 385fc043 ldurb w3, [x2, #-4]
; w0 = byte at (x2-2) ... data[2]
80364: 385fe040 ldurb w0, [x2, #-2]
; w1 = w1 - 33
; = data[1] - 33
80368: 51008421 sub w1, w1, #0x21
; w7 = w1 >> 4
; = (data[1] - 33) >> 4
8036c: 13047c27 asr w7, w1, #4
; w3 = w3 - 33
; = data[0] - 33
80370: 51008463 sub w3, w3, #0x21
; w0 = w0 - 33
; = data[2] - 33
80374: 51008400 sub w0, w0, #0x21
; w3 = w7 || (w3 << 2)
; = ((data[1] - 33) >> 4) || ((data[0] - 33) << 2)
; = pixel[0]
80378: 2a0308e3 orr w3, w7, w3, lsl #2
; w5 = byte at (x2-1) ... data[3]
8037c: 385ff045 ldurb w5, [x2, #-1]
; w7 = w0 >> 2
; = (data[2] - 33) >> 2
80380: 13027c07 asr w7, w0, #2
; w1 = w7 || (w1 << 4)
; = ((data[2] - 33) >> 2) || ((data[1] - 33) << 4)
; = pixel[1] + possible set bit 9 / possible set bit 10
80384: 2a0110e1 orr w1, w7, w1, lsl #4
; copy bits 0-7 of w3 into bits 0-7 of w4
; w4 = pixel[0] + random bits 8-31
80388: 33001c64 bfxil w4, w3, #0, #8
; w5 = w5 - 31
; = data[3] - 31
8038c: 510084a5 sub w5, w5, #0x21
; w0 = w5 || (w0 << 6)
; = (data[3] - 33) || ((data[2] - 33) << 6)
; = pixel[2] + random bits 8-13
80390: 2a0018a0 orr w0, w5, w0, lsl #6
; copy bits 0-7 of w1 into bits 8-15 of w4
; w4 = pixel[0] in bits 0-7
; pixel[1] in bits 8-15
; random bits 16-31
80394: 33181c24 bfi w4, w1, #8, #8
; copy bits 0-7 of w0 into bits 16-23 of w4
; w4 = pixel[0] in bits 0-7
; pixel[1] in bits 8-15
; pixel[2] in bits 16-23
; random bits 24-31
80398: 33101c04 bfi w4, w0, #16, #8
; store w4 (pixel data) in x6 (pixel address in framebuffer) and increase x6 by 4
8039c: b80044c4 str w4, [x6], #4
; Compare pixel address with x8. x8 was calculated as x6 + 0x180 (384) earlier, so
; this happens after 96 loop interations (x6 increased by 4 each iteration) - which
; is the width in pixels of the image.
803a0: eb0800df cmp x6, x8
; if x6 != x8, repeat from 0x80358
; this will keep repeating until pixel row is completed
803a4: 54fffda1 b.ne 80358 <lfb_showpicture+0x58> ; b.any
; pixel row completed
; load framebuffer pitch in w0
803a8: b9400140 ldr w0, [x10]
; x9 now is address of first pixel in next row of encoded image
803ac: 91060129 add x9, x9, #0x180
; check whether we've reached end of encoded image
803b0: eb0b013f cmp x9, x11
; subtract image width in bytes from framebuffer width in bytes
803b4: 51060000 sub w0, w0, #0x180
; add this difference to x6, which tracks the location in the framebuffer we
; are updating, so that we wrap from the end of one row, to the start of the
; next row
803b8: 8b0000c6 add x6, x6, x0
; if we haven't reached the end of the image, repeat the procedure for the
; next row.
803bc: 54fffca1 b.ne 80350 <lfb_showpicture+0x50> ; b.any
; Return from function
803c0: d65f03c0 ret
...
00000000000803d0 <main>:
; Push frame pointer (x29) and procedure link register (x30) onto the
; (downward-growing) stack, and update the stack pointer.
;
; See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJAIFJII.html
803d0: a9bf7bfd stp x29, x30, [sp, #-16]!
; Move stack pointer address into frame pointer
803d4: 910003fd mov x29, sp
; Call uart_init function
803d8: 94000026 bl 80470 <uart_init>
; Call lfb_init function
803dc: 97ffff69 bl 80180 <lfb_init>
; Call lfb_showpicture function
803e0: 97ffffc8 bl 80300 <lfb_showpicture>
; padding, so next function sits on 64 bit boundary
803e4: d503201f nop
; Call uart_getc function
803e8: 94000072 bl 805b0 <uart_getc>
; Unset 24 most significant bits in w0, leaving 8 least significant bits.
; Presumably uart_getc set w0 to something interesting for us.
803ec: 12001c00 and w0, w0, #0xff
; Call uart_send function
803f0: 94000064 bl 80580 <uart_send>
; Infinite loop - return to 0x803e8 above (uart_getc)
803f4: 17fffffd b 803e8 <main+0x18>
...
; See https://jsandler18.github.io/extra/mailbox.html for an overview of
; the mailbox peripheral interface.
0000000000080400 <mbox_call>:
; Retain only bits 0-3 of the channel stored in w0.
; The channel should probably only be 4 bits anyway.
80400: 12000c00 and w0, w0, #0xf
; x4=0x86000
80404: d0000024 adrp x4, 86000 <uart_hex+0x59a0>
; x2=0x86710
; Set x2 to address where framebuffer initialisation block is stored.
80408: 911c4082 add x2, x4, #0x710
; Logical OR w0 (channel) and w2 (framebuffer base address) and put results in w2
; Presumably, the base address has to be 16 byte aligned (which it is in this
; case) so that the address and the channel can be stored in 32 bits.
8040c: 2a020002 orr w2, w0, w2
; x1 = 0x3f00b898
; Set x1 to Mailbox peripheral base address + 24 (0x18) which is the mailbox
; status register.
80410: d2971301 mov x1, #0xb898 ; #47256
80414: f2a7e001 movk x1, #0x3f00, lsl #16
; Padding
80418: d503201f nop
; Read mailbox status register into w0
8041c: b9400020 ldr w0, [x1]
; If bit 31 of mailbox status register (write register) isn't zero, repeat
; last step.
; See https://jsandler18.github.io/extra/mailbox.html#writing-to-the-mailbox
80420: 37ffffc0 tbnz w0, #31, 80418 <mbox_call+0x18>
; x0 = 0x3f00b8a0
; Set x0 to Mailbox peripheral base address + 32 (0x20) which is the mailbox
; write register.
80424: d2971400 mov x0, #0xb8a0 ; #47264
80428: f2a7e000 movk x0, #0x3f00, lsl #16
; x1 = 0x3f00b898
; Set x1 to Mailbox peripheral base address + 24 (0x18) which is the mailbox
; status register.
8042c: d2971301 mov x1, #0xb898 ; #47256
80430: f2a7e001 movk x1, #0x3f00, lsl #16
; x3 = 0x3f00b880
; Set x3 to Mailbox peripheral base address (read register)
80434: d2971003 mov x3, #0xb880 ; #47232
80438: f2a7e003 movk x3, #0x3f00, lsl #16
8043c: b9000002 str w2, [x0]
80440: d503201f nop
80444: b9400020 ldr w0, [x1]
80448: 37f7ffc0 tbnz w0, #30, 80440 <mbox_call+0x40>
8044c: b9400060 ldr w0, [x3]
80450: 6b02001f cmp w0, w2
80454: 54ffff61 b.ne 80440 <mbox_call+0x40> ; b.any
80458: 911c4084 add x4, x4, #0x710
8045c: 52b00000 mov w0, #0x80000000 ; #-2147483648
80460: b9400481 ldr w1, [x4, #4]
80464: 6b00003f cmp w1, w0
80468: 1a9f17e0 cset w0, eq ; eq = none
8046c: d65f03c0 ret
0000000000080470 <uart_init>:
; Push frame pointer (x29) and procedure link register (x30) onto the
; (downward-growing) stack, after leaving a gap of 32 bytes on the stack,
; and update the stack pointer.
;
; See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJAIFJII.html
80470: a9bd7bfd stp x29, x30, [sp, #-48]!
; x0=0x86000
80474: d0000020 adrp x0, 86000 <uart_hex+0x59a0>
; x1 = 0x86710 (address of framebuffer mailbox request)
80478: 911c4001 add x1, x0, #0x710
; Move stack pointer address into frame pointer
8047c: 910003fd mov x29, sp
; Store x19, x20 on the stack immediately before frame pointer and link register
80480: a90153f3 stp x19, x20, [sp, #16]
; x19 = 0x3f201030 (address of UART0_CR)
80484: d2820613 mov x19, #0x1030 ; #4144
80488: f2a7e413 movk x19, #0x3f20, lsl #16
; store x21 on the stack before x20
8048c: f90013f5 str x21, [sp, #32]
; w2 = 32 (0x20)
80490: 52800402 mov w2, #0x20 ; #32
; [UART0_CR] = 0 (32 bits)
80494: b900027f str wzr, [x19]
; w20 = 2
80498: 52800054 mov w20, #0x2 ; #2
; [0x86710] (mbox request + 0-3) = 32 (0x00000020)
8049c: b9071002 str w2, [x0, #1808]
; w0 = 0x00038002
804a0: 52900040 mov w0, #0x8002 ; #32770
804a4: 72a00060 movk w0, #0x3, lsl #16
; [0x86714] (mbox request + 4-7) = 0 (0x00000000)
804a8: b900043f str wzr, [x1, #4]
; w2 = 12 (0x0000000c)
804ac: 52800182 mov w2, #0xc ; #12
; [0x86718] (mbox request + 8-11) = 229378 (0x00038002)
804b0: b9000820 str w0, [x1, #8]
; w0 = 8
804b4: 52800100 mov w0, #0x8 ; #8
; [0x8671c] (mbox request + 12-15) = 12 (0x0000000c)
804b8: b9000c22 str w2, [x1, #12]
; w2 = 4000000 (0x003d0900)
804bc: 52812002 mov w2, #0x900 ; #2304
804c0: 72a007a2 movk w2, #0x3d, lsl #16
; [0x86720] (mbox request + 16-19) = 8 (0x00000008)
804c4: b9001020 str w0, [x1, #16]
; x21 = 0x3f200098
804c8: d2801315 mov x21, #0x98 ; #152
804cc: f2a7e415 movk x21, #0x3f20, lsl #16
; [0x86724] (mbox request + 20-23) = 2 (0x00000002)
804d0: b9001434 str w20, [x1, #20]
; [0x86728] (mbox request + 24-27) = 4000000 (0x003d0900)
804d4: b9001822 str w2, [x1, #24]
; [0x8672c] (mbox request + 28-31) = 0 (0x00000000)
804d8: b9001c3f str wzr, [x1, #28]
; Call mbox_call function
804dc: 97ffffc9 bl 80400 <mbox_call>
804e0: d2800082 mov x2, #0x4 ; #4
804e4: f2a7e402 movk x2, #0x3f20, lsl #16
804e8: d2801283 mov x3, #0x94 ; #148
804ec: f2a7e403 movk x3, #0x3f20, lsl #16
804f0: b9400041 ldr w1, [x2]
804f4: 52880004 mov w4, #0x4000 ; #16384
804f8: 72a00044 movk w4, #0x2, lsl #16
804fc: 120e6421 and w1, w1, #0xfffc0fff
80500: 528012c0 mov w0, #0x96 ; #150
80504: 2a040021 orr w1, w1, w4
80508: b9000041 str w1, [x2]
8050c: b900007f str wzr, [x3]
80510: 97fffed0 bl 80050 <wait_cycles>
80514: 52980000 mov w0, #0xc000 ; #49152
80518: b90002a0 str w0, [x21]
8051c: 528012c0 mov w0, #0x96 ; #150
80520: 97fffecc bl 80050 <wait_cycles>
80524: b90002bf str wzr, [x21]
80528: d2820880 mov x0, #0x1044 ; #4164
8052c: f2a7e400 movk x0, #0x3f20, lsl #16
80530: d2820482 mov x2, #0x1024 ; #4132
80534: f2a7e402 movk x2, #0x3f20, lsl #16
80538: f94013f5 ldr x21, [sp, #32]
8053c: 5280ffe3 mov w3, #0x7ff ; #2047
80540: d2820501 mov x1, #0x1028 ; #4136
80544: f2a7e401 movk x1, #0x3f20, lsl #16
80548: b9000003 str w3, [x0]
8054c: d2820580 mov x0, #0x102c ; #4140
80550: f2a7e400 movk x0, #0x3f20, lsl #16
80554: b9000054 str w20, [x2]
80558: 52800162 mov w2, #0xb ; #11
8055c: b9000022 str w2, [x1]
80560: 52800c01 mov w1, #0x60 ; #96
80564: b9000001 str w1, [x0]
80568: 52806020 mov w0, #0x301 ; #769
8056c: b9000260 str w0, [x19]
80570: a94153f3 ldp x19, x20, [sp, #16]
80574: a8c37bfd ldp x29, x30, [sp], #48
80578: d65f03c0 ret
8057c: d503201f nop
0000000000080580 <uart_send>:
80580: d2820302 mov x2, #0x1018 ; #4120
80584: f2a7e402 movk x2, #0x3f20, lsl #16
80588: d503201f nop
8058c: b9400041 ldr w1, [x2]
80590: 372fffc1 tbnz w1, #5, 80588 <uart_send+0x8>
80594: d2820001 mov x1, #0x1000 ; #4096
80598: f2a7e401 movk x1, #0x3f20, lsl #16
8059c: b9000020 str w0, [x1]
805a0: d65f03c0 ret
805a4: d503201f nop
805a8: d503201f nop
805ac: d503201f nop
00000000000805b0 <uart_getc>:
805b0: d2820301 mov x1, #0x1018 ; #4120
805b4: f2a7e401 movk x1, #0x3f20, lsl #16
805b8: d503201f nop
805bc: b9400020 ldr w0, [x1]
805c0: 3727ffc0 tbnz w0, #4, 805b8 <uart_getc+0x8>
805c4: d2820000 mov x0, #0x1000 ; #4096
805c8: f2a7e400 movk x0, #0x3f20, lsl #16
805cc: 52800141 mov w1, #0xa ; #10
805d0: b9400000 ldr w0, [x0]
805d4: 12001c00 and w0, w0, #0xff
805d8: 7100341f cmp w0, #0xd
805dc: 1a811000 csel w0, w0, w1, ne ; ne = any
805e0: d65f03c0 ret
805e4: d503201f nop
805e8: d503201f nop
805ec: d503201f nop
00000000000805f0 <uart_puts>:
805f0: 39400001 ldrb w1, [x0]
805f4: 34000221 cbz w1, 80638 <uart_puts+0x48>
805f8: d2820302 mov x2, #0x1018 ; #4120
805fc: f2a7e402 movk x2, #0x3f20, lsl #16
80600: d2820004 mov x4, #0x1000 ; #4096
80604: f2a7e404 movk x4, #0x3f20, lsl #16
80608: 528001a5 mov w5, #0xd ; #13
8060c: d503201f nop
80610: 7100283f cmp w1, #0xa
80614: 54000160 b.eq 80640 <uart_puts+0x50> ; b.none
80618: 38401403 ldrb w3, [x0], #1
8061c: d503201f nop
80620: d503201f nop
80624: b9400041 ldr w1, [x2]
80628: 372fffc1 tbnz w1, #5, 80620 <uart_puts+0x30>
8062c: b9000083 str w3, [x4]
80630: 39400001 ldrb w1, [x0]
80634: 35fffee1 cbnz w1, 80610 <uart_puts+0x20>
80638: d65f03c0 ret
8063c: d503201f nop
80640: d503201f nop
80644: b9400041 ldr w1, [x2]
80648: 372fffc1 tbnz w1, #5, 80640 <uart_puts+0x50>
8064c: b9000085 str w5, [x4]
80650: 17fffff2 b 80618 <uart_puts+0x28>
80654: d503201f nop
80658: d503201f nop
8065c: d503201f nop
0000000000080660 <uart_hex>:
80660: d2820302 mov x2, #0x1018 ; #4120
80664: f2a7e402 movk x2, #0x3f20, lsl #16
80668: d2820005 mov x5, #0x1000 ; #4096
8066c: f2a7e405 movk x5, #0x3f20, lsl #16
80670: 52800383 mov w3, #0x1c ; #28
80674: 528006e7 mov w7, #0x37 ; #55
80678: 52800606 mov w6, #0x30 ; #48
8067c: d503201f nop
80680: 1ac32401 lsr w1, w0, w3
80684: 12000c21 and w1, w1, #0xf
80688: 7100243f cmp w1, #0x9
8068c: 1a8680e4 csel w4, w7, w6, hi ; hi = pmore
80690: 0b010084 add w4, w4, w1
80694: d503201f nop
80698: d503201f nop
8069c: b9400041 ldr w1, [x2]
806a0: 372fffc1 tbnz w1, #5, 80698 <uart_hex+0x38>
806a4: b90000a4 str w4, [x5]
806a8: 51001063 sub w3, w3, #0x4
806ac: 3100107f cmn w3, #0x4
806b0: 54fffe81 b.ne 80680 <uart_hex+0x20> ; b.any
806b4: d65f03c0 ret
;;; 0x806b8-0x806e7: (48 bytes): "Unable to set screen resolution to 1024x768x32"
;;; 0x806e8-0x866e7: (24 Kb): homer picture in GIMP header image file format (RGB)
;;; 0x866e8-0x866ef: (8 bytes): < --- purpose unknown --- >
;;; 0x866f0-0x866f7: (8 bytes): framebuffer address
;;; 0x866f8-0x866fb: (4 bytes): framebuffer pitch (bytes per line)
;;; 0x866fc-0x866ff: (4 bytes): screen height
;;; 0x86700-0x86703: (4 bytes): screen width
;;; 0x86704-0x8670f: (12 bytes): < --- purpose unknown --- >
;;; 0x86710-0x8679b: (140 bytes): framebuffer mailbox request
;;; 0x8679c-0x8679f: (4 bytes): < --- purpose unknown --- >
After that is done, my next goal will be to render an updated version of the classic main menu screen:
Logically it isn't the most important step to achieve, but the reason to do this first is motivational. It will feel like I've made a lot of progress if the menu is displayed, even if it isn't functional.
I might add some menu options, and adjust the relative size of the menu in relation to the screen size (or even provide keyboard shortcuts to change the size dynamically).
After this, I'd like to get JTAG debugging working, so that I can debug the kernel builds running on the RPi directly from my Mac. I have not yet purchased any JTAG equipment.
After this I'll begin the task of porting the ZX Spectrum +2/+3 ROMs, starting with the 16K ROM which is shared with the ZX Spectrum 48K.
This online copy of The Complete Spectrum ROM Disassembly should aid the porting process.
Once this is done, I intend to port the remaining ROMs, using this disassembly guide to help me.
I have purchased the following items:
- Two Raspberry Pi 3Bs
- Apple iMac (approximate version)
- ZX Spectrum ULA book
- Raspberry Pi Assembly Language RISC OS Beginners (Hands On Guide)
- The Complete Spectrum ROM Disassembly
- 2 Aukru Gear Cable Micro USB Charging cable with switches
- DSD Tech 4 Pin Dupont Cable USB to TTL Serial Converter CP2102
- iLEPO Smart USB Charger
- FTDI C232HM-DDHSL-0 JTAG device
Bare metal development and ARM Assembly are new for me. This project will be a
vehicle for me, and hopefully others too, to learn about kernel development in
ARM Assembly. If you would like to get involved, please do join the
#spectrum4
IRC channel on chat.freenode.net
.
I will document my progress as I go along, so if you have knowledge that might help me overcome problems, I'd be very happy to hear from you.
To start with, I'm providing some links to help me track the web content which should help me work on this progress (see below).
In order to avoid needing to physically remove the SD card on the RPi 3B every time you make changes to the spectrum4 kernel during development, it is advisable to serve the operating system over your local network from another computer. In my case, this meant either serving them from my Mac, or serving them from a second Raspberry Pi.
I have a second RPi 3B which I have set up as a TFTP server, loosely following this guide. At the time of writing, this guide is a little out of date (see e.g. this Github issue). I will see if I can make a PR (rather than an issue) to address the issues I found with it.
For example, the guide advises that on the RPi 3B, one can skip the sections Client configuration and Program USB boot mode. This was not my experience, and I indeed needed to set the USB boot mode in order for TFTP booting to work (note, we actually boot over Ethernet, not over USB, so the name of this configuration setting is a little misleading!).
The guide also covers a more complex use case than we have, which involves serving a linux distribution. Therefore some stages of this guide are not required. The reduced set of steps I followed in order to achieve TFTP booting is as follows:
-
Install Raspbian on an SD card. I personally install Raspbian using my Mac by following the Command line steps of this page.
-
Mount the SD card and add
program_usb_boot_mode=1
to the end of theconfig.txt
file. -
Start the Raspberry Pi, and check that the boot mode flag was set in the One-Time-Programmable memory:
$ vcgencmd otp_dump | grep 17:
17:3020000a
If it wasn't successful, you will likely get the result 17:1020000a
. That is
no good, perhaps try another reboot, and double check /boot/config.txt
contains your change.
-
Remove the line you added to
/boot/config.txt
and save. -
Shut down the RPi.
-
Remove the SD card.
-
Place the Raspberry Pi somewhere safe. It will be the Raspberry Pi that will run spectrum4! The other raspberry pi is the one that will serve the spectrum4 images to it.
-
Put the same SD card in the second Raspberry Pi, and start it up.
-
Update system packages (
sudo apt-get update && sudo apt-get -y upgrade
). -
Configure a static IP:
echo -e "\\n#### Set static IP for TFTP booting\\ninterface eth0\\nstatic ip_address=$(ip -4 addr show dev eth0 | grep inet | awk '{print $2}')\\nstatic routers=$(ip route | grep default | awk '{print $3}')\\nstatic domain_name_servers=$(ip route | grep default | awk '{print $3}')" | tee -a /etc/dhcpcd.conf
- Install
dnsmasq
andtcpdump
sudo apt-get install -y dnsmasq tcpdump
- Stop dnsmasq breaking DNS resolving:
sudo rm /etc/resolvconf/update.d/dnsmasq
- Let's run an
ssh
daemon in order that we canscp
files to the Raspberry Pi.
sudo touch /boot/ssh
sudo reboot
- Start tcpdump so you can search for DHCP packets from the client Raspberry Pi:
sudo tcpdump -i eth0 port bootpc
- Connect the other Raspberry Pi to your network (without an SD card
inserted!) and power it on. Check that the LEDs illuminate on the client
after around 10 seconds, then you should get a packet from the client
DHCP/BOOTP, Request from ...
IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb...
- Now we need to modify the dnsmasq configuration to enable DHCP to reply to the
device. Press
Ctrl+C
on the keyboard to exit the tcpdump program, then run the following to replace the contents of/etc/dnsmasq.conf
:
echo -e "port=0\\ndhcp-range=$(ip -4 addr show dev eth0 | grep inet | awk '{print $4}'),proxy\\nlog-dhcp\\nenable-tftp\\ntftp-root=/tftpboot\\npxe-service=0,\"Raspberry Pi Boot\"" | sudo tee /etc/dnsmasq.conf
- Now create a
/tftpboot
directory:
sudo mkdir /tftpboot
sudo chmod 777 /tftpboot
sudo systemctl enable dnsmasq.service
sudo systemctl restart dnsmasq.service
- Now monitor the dnsmasq log:
tail -F /var/log/daemon.log
- If your other Raspberry Pi isn't already running, then turn it on (but keep your display connected to the current Raspberry Pi). You should see something like this appear in the dnsmasq logs:
raspberrypi dnsmasq-tftp[1903]: file /tftpboot/bootcode.bin not found
Use Ctrl+C
to exit the monitoring state.
- Now we just need to copy some files into the
/tftpboot
directory. A simple and small example is the Peter Lemon Julia set animation.
curl -#L 'https://github.com/raspberrypi/firmware/raw/abfb4be3e1b5836e1ffd96de4ce499406ec9dbb8/boot/bootcode.bin' > /tftpboot/bootcode.bin
curl -#L 'https://github.com/raspberrypi/firmware/raw/abfb4be3e1b5836e1ffd96de4ce499406ec9dbb8/boot/start.elf' > /tftpboot/start.elf
curl -#L 'https://github.com/PeterLemon/RaspberryPi/raw/7130e72637d08b1976512bd60a372acb9b458310/boot/config.txt' > /tftpboot/config.txt
curl -#L 'https://github.com/PeterLemon/RaspberryPi/raw/7130e72637d08b1976512bd60a372acb9b458310/NEON/Fractal/Julia/kernel7.img' > /tftpboot/kernel7.img
- Turn the SD-card-less RPi off, connect it to a display with an HDMI cable, and then turn it back on again. You should see a fractal animation based on the Julia Set.
I haven't got this working yet.
I haven't tried this yet, is pretty adventurous.
I haven't tried this yet either, I don't currently have a Windows installation to try this out with.
- David Welch - Bare metal guide
- CS107e - Stanford's introductory course to bare metal programming on Raspberry Pi.
- Peter Lemon - RPi bare metal tutorials
- Cambridge University - Baking Pi
- Mauri de Souza Nunes - Baking Pi for RPi 3B
- eLinux RPi hardware page
- BCM2837 help page
- OSDev.org - Raspberry Pi Bare Bones
- The Raspberry Pi UARTs
- Raspberry Pi bare metal/assembly forum
- Adam Ransom - Hello World RPi 3B
- Zoltan Baldaszti - Bare Metal Programming on Raspberry Pi 3
- Sergey Matyukevich - Learning operating system development using Linux kernel and Raspberry Pi
- Leon de Boer - Baremetal Raspberry Pi
- ICTeam 28 - PiFox: 3D rail shooter written in ARM assembly
- Tetris-Duel-Team - Multiplayer Tetris for Raspberry Pi (in bare metal assembly)
- Jake Sandler - Building an Operating System for the Raspberry Pi
- Brian Sidebotham - Raspberry-Pi Bare Metal Tutorial
- Miro Samek - Building Bare-Metal ARM Systems with GNU
- Andre Richter - Bare Metal Rust Programming on Raspberry Pi 3
- Linuxhit - Raspberry Pi PXE Boot – Network booting a Pi 4 without an SD card
- William Lam - Two methods to network boot Raspberry Pi 4
- Adam Greenwood-Byrne - Writing a “bare metal” operating system for Raspberry Pi 4
- ARM Cortex-A53 MPCore Processor Technical Reference Manual
- The ARMv8 Instruction Set Overview
- The A64 instruction set
- The armasm User Guide
- The A64 Instruction Set Reference
- ARM Cortex-A Series Programmer's Guide for ARMv8-A
- A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes and Cryptography
- How to handle stripped binaries with GDB
- ARM Processor Cortex A53 MPCore Product Revision r0 Software Developers Errata Notice
- Bare-metal Boot Code for ARMv8-A Processors
- ARM Trusted Firmware
- Herman Hermitage - Tools and information for the Broadcom VideoCore IV (RaspberryPi)
- VideoCore® IV 3D - Architecture Reference Guide
- Remote Debugging of Raspberry Pi with JTAG interface
- JTag for Pi 3 - David Welch advice
- SUSE Blog - Debugging Raspberry Pi 3 with JTAG
- dwelch67 armjtag folder
- Daniel Krebs - JTAG and bare metal on Raspberry Pi 3
- Mete Balci - Bare Metal Raspberry Pi 3B+: JTAG
See USB subfolder of this project.
- Sinclair ZX Spectrum - BASIC Programming
- World of Spectrum - Documentation
- Interrupts
- ZXBaremulator
- SkoolKit disassemblies
- Sergey Kiselev - Building ZX Spectrum Clone - Harlequin
- Matt Westcott (Gasman) - Channels and streams
- Steve Ciarcia - Build Your Own Z80 Computer
- Z80 Info
- Ben Eater - Build an 8-bit computer from scratch
- Robin Mitchell - How to Build a Z80 Computer
- Z80 CPU User Manual UM008011-0816
- Channels and streams
- ZX Art (not really useful, but pretty cool!)
- DSD Tech 4 Pin Dupont Cable USB to TTL Serial Converter CP2102 Drivers
- Viewing SSID from macOS
- Enabling WiFi on rpi over serial connection
Note, run the above as root. May need to also run
sudo rfkill unblock all
.
diskutil list
Assuming disk2 is the SD card:
diskutil unmountDisk /dev/disk2
sudo dd bs=1m if=<......>.img of=/dev/rdisk2
diskutil eject /dev/rdisk2