Friday, January 31, 2014

Raster IRQs now work

Today I implemented a small but important feature that has been in the pipeline for a while: raster interrupts triggered by the VIC-IV.

I already had much of the machinery in place for raster IRQs, I just hadn't finished tying it all together.

So a few lines of VHDL later, I set the FPGA building.  Unfortunately, just adding a few ties for the IRQ pushed the timing out by about 2ns from the 5.6ns required for the 192MHz pixel clock to around 7ns.  As a result it didn't work.

I scratched my head for a while wondering how about 7 logic gates could ruin the timing so badly, and eventually realised that I needed to pipeline the IRQ line by adding a drive stage, so that the IRQ had time to propagate across the FPGA.  Without it, ISE was rearranging everything else (badly) to make the IRQ line get to the CPU in one cycle.  Net result, the IRQ triggers one pixel clock cycle late on the CPU, which isn't really an issue, since the CPU runs at half that clock speed, so it should still trigger the CPU interrupt on the correct cycle.

So then I set about writing a little raster interrupt routine to test it.  This was an important step, not only to make sure that the raster interrupt line worked, but also that clearing VIC-IV interrupts worked in a C64 compatible way, unlike the C65 where the usual ASL $D019 or INC $D019 doesn't clear VIC interrupts.  This is a big source of incompatibility on the C65.

Interestingly, this incompatibility on the C65 is not the VIC-III's fault, but rather the 4510 CPU's.  This is because the 4510 uses the CMOS 65CE02 core that changed the behaviour of ASL, INC and other read-modify-write instructions.  On the 6510, the instructions read the original value, write the original value and then write the modified value.  This is why ASL $D019 or INC $D019 works to clear interrupts on the C64, because writing the value read from $D019 will clear all triggered interrupts.

But on the 4510, the original value is read, and the modified value is written, saving a cycle, and in the process really breaking compatibility.  The SuperCPU has a similar problem because it uses the 65816 that includes the same "optimisation".  Aware of this problem, I resolved that this problem would not exist in the C65GS, and today it was time to test it.

The interrupt routine I wrote is:
 ; CIA IRQ disable
 lda #$7f
 sta $DC0D
 ; clear bit 8 of raster compare
 and $D011
 sta $D011
 ; set raster for split
 lda #$80
 sta $D012
 ; enable raster IRQ
 lda #$01
 sta $D01A
 ; set IRQ vector
 lda #irq
 sta $0315

irq: ; border yellow
 lda #$07
 sta $D020
 ; wait for a bit
 ldx #$ff
l1: dex
 bne l1
 ; border back to light blue
 lda #$0e
 sta $D020
 ; acknowledge IRQ
 inc $D019
 ; return from interrupt, via keyboard scan etc.
 jmp $EA31

As can be seen, I am using my usual INC $D019 to clear the raster IRQ.

Now to see if it worked.  Bingo! A nice little raster bar.  There is a few cycles jitter as you would expect, but of course with a 96MHz CPU the jitter is only about one character wide.  Since each character is 20 cycles wide, that means a jitter of less than 10 cycles.  That's more than on a real C64 because I still have some wait states in the C65GS CPU memory access that I have yet to work around, and so some instructions can take a dozen or so cycles, in particular things like INC that include six memory accesses can take 12 or 13 cycles.

For comparison, here is the same routing running on VICE.  

The keen observer will notice that not only is the raster bar much narrow on the C65GS, but it is also not in quite the same position.  This is because the 1920x1200 frame has 1248 physical rasters (including flyback), which is 4x PAL's 312 lines.  However, to keep the vertical borders small, in C64 mode the C65GS makes each logical raster equal to five physical rasters.

I'll have to do something about this so that raster splits occur on the correct logical line.  This will most likely consist of having logical rasters spaced 3-lines apart before the display, 5-lines apart during the display, and 3-lines after.  It all gets a bit fun, because 1920x1200@60Hz has only one invisible raster at the beginning of frame, while PAL has more, so the logical raster counter will have to start during flyback in the previous frame.  Entirely possible, just a bit fiddly.

256-colour modes now mostly working

The past few days I have been fixing some bugs in the VIC-IV character generator that were causing various glitches.  Today I got the last of the important ones fixed, and so now when the C65GS boots, it looks just like a real C64:

I also had time to finish implementing the palette.  The palette is mostly like on the Commodore 65, with $D1xx holding the red component, $D2xx the green and $D3xx the blue components.  These are 8-bit registers, in principle allowing for 24-bit colour, but for now the simple VGA output I am using is limited to 12-bit colour.  I have plans to make an HDMI out that would have 24 colour digitally.  Here is what happens when you run:  LDA $D012; STA $D020; JMP *-6, you get nice beautiful solid raster bars with the default palette (which is 128 colours repeated twice).

$D012 is the VIC-II compatibility raster counter, so it is counting every 5th raster.  If we use $D052 instead, we get the physical raster counter, which means the raster bars get much, much finer:

You really need to see a zoomed detail to see just how fine they are:

With these various other bugs and problems fixed, I was also able to test fullcolour character mode.  This tells the VIC-IV to use 64 bytes for each character instead of 8, and each byte is the colour value for a single pixel of the character.  Just flip bit 1 in $D054 on and it is engaged for all characters.   The display then changed to the following, showing the 1970s carpet stripe pattern from the default contents of RAM, as well as some horizontal gradient I POKEd in, just because I could.

Here is the same thing with the resolution increased from 320x200 to 800x500 using the hardware scale registers ($D042, $D043). It is really 960x600 resolution if you used the border area, which is easy to do.

What is not apparent in the above is that I have also implemented four palette banks, so there is really a 1,024 colour palette available.  I will most likely make it possible to use a different palette bank for background and sprites at the same time.

Finally, here is the hardware and the monitor together, before the FPGA board gets put inside a C64 case and uses a real C64 keyboard for interaction.  There will still be plenty more to do before the FPGA design is complete, including of course sound and sprites.

Thursday, January 30, 2014

Motivations and goals for the project

The previous posts have mostly consisted of screen shots of progress on the C65GS, but until now I have not gotten around to actually describing what I intend to achieve, and what bought me to this point.

I owned a Commodore 65 prototype from 1994 until 2010, when I sold it to a collector because, among various reasons, I didn't feel that I had the resources to care for what was rapidly becoming a valuable museum piece.  During the time that I owned the C65, I did make regular use of it, wrote a few simple demos and utilities, and modified some existing C64 software to take advantage of the faster CPU.

I have also owned a C128D through that time, which I also enjoyed using.  However, I always found the C128 architecture to be rather strange and unappealing.  It really does, to me at least, feel like a hacked on C64, rather than the feeling of a new and enhanced machine that the C65 provides.

During the 1990s and 2000s I had also repeatedly thought about making a C64 accelerator using a trick I devised and tested that avoids the synchronisation with VIC-II RAM problem faced by accelerators like the SuperCPU, which either limited compatibility or the speed of acceleration possible.

During my PhD studies I learned to program in VHDL, and started thinking about implementing an accelerator in an FPGA.  However, FPGAs at the time were too slow to provide the degree of acceleration that I considered necessary to make the project worth pursuing.

My goals were to make the most powerful 8-bit computer to date by various measures:

  • Better graphics than the Apple IIgs, Atari 800 or Plus/4: 1920x1200 @ 60Hz, 256 colour palette from 4,096 colours (later from 24-bit colour palette once I create an HDMI output) via my VIC-IV video controller.
  • Better sprites than the C64.  Plan is for the 8 compatibility sprites, plus perhaps 32 256-colour Enhanced Sprites with hardware scaling and practically unlimited size.  Maximum number of displayable sprites will depend on the resolution of the display and the sprites on a given raster line.
  • Faster CPU than the SuperCPU or any available 65C816 CPU (20MHz), and ideally with enough headroom to beat a 20MHz 65C816 running in 16-bit mode.  Currently the 65GS10 runs at 96MHz, but with an effective speed more like 48MHz until I work on some planned IPC improvements, like a 16-bit cache of zero-page to make zero-page indirect instructions take as little as 3 cycles.
  • More RAM than a fully expanded Apple IIgs or C65 (~8.125MB).  It will initially have 128KB of chipram like the C65, plus 16MB of slowram, plus "some" ROM.
  • Comparable or better sound capability than the Apple IIgs.  Multiple SIDs plus digital audio channels.  Design to be finalised.
I also wanted to make the machine more backward compatible than the C65 or any 65C816 based machine.  The main issue here is actually quite easy to fix, consisting of restoring the 6502 read-write-modify behaviour of instructions like INC and ASL.  I would also like to make the machine sufficiently C65 compatible to be able to run a stock C65 ROM.

However, perfect C65 compatibility is not high on my list, given the relative lack of software available for it anyway.  In particular, I have no real intention at this stage of implementing the bit-planar graphics modes, as they were never really a good idea for an 8-bit computer, requiring way to many cycles to edit even a single pixel.  

Instead, all new graphics modes (and Enhanced Sprites) are planned to really be character mode, but allowing 16-bit character sets and making characters 8x8 fully addressible pixels, i.e., requiring 64 bytes per character.  This also saves lots of RAM and CPU cycles when most of the screen is blank or repetitive.  Enhanced Sprites will be mapped in the same way, allowing reuse of graphic characters to help save chipram.

This graphics architecture helps to keep fun in programming the system by making it non-trivial to have a full 1920x1200 image, as there is only about 10% of the chipram required to support such an image, and the slowram is too slow to supply the data, even if using the DMAagic DMA controller I intend to implement.

In short, I hope to preserve most of the fun elements of an 8-bit computer, while providing some 21st century improvements that will make the machine fun to program and use, and who knows, maybe help foster new life in the demo scene.

From a hardware perspective, I am purposely implementing it using an off-the-shelf FPGA development board designed for university students, as the boards are relatively cheap for their performance and have many built-in peripherals, like ethernet, VGA output, USB keyboard input.  This also has the significant benefit that availability will not be based on small production runs by myself or anyone else.  

The design is intended to be able to be installed in a real C64 case with keyboard using either a Keyrah v2, or a custom interface PCB that I have worked on.  The custom interface PCB will likely offer datasette and IEC serial ports, and later may also provide a userport and/or expansion port, depending on some unresolved factors.

Tuesday, January 28, 2014

Debugging character display generation

Working on the character generator for the VIC-IV, I have a bug where the left edge of the character display gets two characters of junk, before beginning the real stuff.  Annoyingly, the bug only shows up on the FPGA, and not in simulation.

I don't have fancy gear to probe the internals of the FPGA here at home, nor the knowledge of how to us it, anyway. Also, as I use a Mac, it is a pain to get those tools to work in the first place.

To work around this I have added video generator debug registers that allow you to specify the exact pixel position on screen that you want to know several internal video generator state registers.  These values then get latched at that position of the frame, so that they can be read out some other registers at ones leisure.  

Of course, I want an automated means of setting the position, reading the results (one frame later of course), and then advancing the position along a raster, so that I can capture the entire sequence of events over the entire raster.

So I set about creating this.  When the debug registers are set, the VIC-IV draws a red cross-hair showing the pixel being interrogated, as can be seen below:

I then wrote a C program that talks to the board over my serial monitor interface, setting the registers, waiting a couple of frames (just to be sure), and then reading the results out from the debug registers, and rendering them in a useful way.

This displays information like the log at the end of this post.  It looks mostly like gobbledygook, and there is a bug that causes some wrong columns of output in this example, but the cycles_to_next_card and chargen_active and chargen_active_soon signals tell me that something is indeed going wrong at the left of each raster.  cycles_to_next_card should drop down to 1 eight cycles before chargen_active goes to 1.

In the process I realised that I was missing one other rather important signal.  So back to spending an hour or so rebuilding the FPGA program to get more information.  Then hopefully I will have what I need to fix this bug that has been hanging around for a while and spoiling the otherwise very nice looking display.

display_y=100, display_x=145, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=146, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=147, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=148, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=149, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=150, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=151, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=152, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=153, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=154, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=0
display_y=100, display_x=155, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=156, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=254, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=157, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=253, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=158, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=252, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=159, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=251, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=160, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=250, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=161, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=249, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=162, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=248, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=163, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=255, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=164, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=254, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=165, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=253, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=166, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=252, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=167, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=251, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=168, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=250, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=169, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=1, chargen_active=0, chargen_active_soon=1
display_y=100, display_x=170, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=40, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=171, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=39, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=172, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=38, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=173, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=37, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=174, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=36, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=175, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=35, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=176, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=34, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=177, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=33, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=178, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=32, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=179, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=31, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=180, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=30, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=181, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=29, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=182, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=28, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=183, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=27, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=184, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=26, chargen_active=2, chargen_active_soon=0
display_y=100, display_x=185, x_chargen_start_minus16=198890353, next_card_number=154, cycles_to_next_card=25, chargen_active=2, chargen_active_soon=0

Monday, January 27, 2014

CPU now passes complete suite of 6502 official opcode tests

The title says it all really.  The CPU now passes all official opcode stress tests from Marko Makela's suite.  It currently runs at 96MHz, but a wait state on most memory accesses the effective speed is more like 48x original.  A previous quick and dirty test with a FOR loop in BASIC suggests 42x, assuming that the CIA timers are running at the correct rate.

As a result, the C64 ROMs start up correctly as can be seen in the image below.  The display is shifted two characters to the right, due to a bug in the VIC-IV that I am trying to fix at the moment.

Of course the above image is rather boring, being just a 40x25 display like on a normal C64.  So the following two images show the same with the horizontal and vertical scalers set to 1x instead of 5x, yielding a 200x125 character display.  As mentioned in a previous post, the repetition of the C64 character rows is because the virtual row length is still set to 40 columns, so while it reads 200 bytes to display, on the next row, it goes back to the previous row + 40.  In other words, screen lines start at $0400, $0428 etc, but with overlapping spans.

A detail of the top left corner of the screen showing that all the issues with not displaying characters properly and repeating character data for multiple rows have been resolved.

Saturday, January 25, 2014

BASIC now works

Again, a quick few screen shots to show how things are moving along.

I spent the day fixing ADC and SBC bugs, as well as a few other miscellaneous CPU bugs.  Also implemented $01 CPU port for memory banking.

Things are now working well enough that BASIC works fairly well, as the following screen shots show.

First, raster bars in BASIC, which show that the CPU is MUCH faster than a real C64, even before I do a pile of optimisation work that I know needs to happen, and will give something like 2x to 3x the current figures.

Okay, so the CPU is clearly much faster than on a C64, but just how fast?  Well, let's do a quick comparison using BASIC:

52/60 of a second to count to 25,000 in BASIC.  Quite nice.  Let's see the same on a C64 and work out a back-of-envelope acceleration factor as things currently stand:

Okay, so we are 42 times faster.  That might be the answer to life the universe and everything, but not for this CPU design.  As mentioned above there are some waitstates that I know I can hide, and also some parallel instruction fetching when running code from chip RAM and other little tricks that will push this to be 2x to 3x the current figure.  Basically I am aiming for 100x C64 speed, and see no real hurdles to achieving it.

Friday, January 24, 2014

Some screen shots

This is a hastily prepared post with a few screen shots of the C65GS display just to give you an idea of what I am working on.

The C65GS drives 1920x1200 @ 60Hz natively, with 1248 physical rasters for PAL compatibility.

By default hardware scaling of the character generator is set at 5x to render a normal 320x200 display within the borders ($D042 = $04, $D043 = $04).  Of course, the rasters are still physical, so an INC $D020, JMP *-3 loop produces very fine rasters, as you can see in the following image.

Sorry for the blur, my phone camera is not the best for this sort of thing, and I don't have a better way to capture the display yet.

You can also see that I still have at least one CPU bug that prevents 38911 from being printed correctly.  Looking into that.

The next image shows the detail near the ready prompt to give an idea of just how fine the rasters are.  You can also see that $D020 can be incremented quite fast in relation to the pixel clock, each colour band being only about 3 characters wide.  The pixel to CPU clock ratio is currently 2:1, instead of 8:1 on the C64.  In fact, it is possible to make them even narrower in future when I improve the IPC of the CPU (removing dead cycles from INC etc when not writing to locations that really matter, like $D019, and using the 64bit wide chipram bus to fetch entire instructions in a single cycle.

The next two images show the character generator set to physical resolution ($D042 = $00, $D043 = $00). There is a bug that is apparent in these images, where the character generator draws the characters at physical resolution, but doesn't fetch the character number from screen ram properly, resulting the same character being repeated three times each.  This is fairly high on my list of things to fix.

Also in this mode you can see that the character generator doesn't naively increment the pointer to screen memory when moving to the next line.  Instead there is a virtual screen width register that decides how much to increment each line.  In this example it is still set to 40 for 40-column mode, hence the repeating.

In the lower part of the screen you can see some odd things, like underlined characters.  These are characters with C65-compatibility VIC-III extended attributes of underline, reverse, bold and blink.

Again, a zoomed in view of the cornerwhere you get an idea of how teeny tiny these characters are.  Even at this physical resolution the rasters aren't too wide.

Anyway, that's it for this sneak peak.  I'll explain more about the machine in a future post.