Thursday, April 21, 2016

On cycle count predictability and related things

Some folks have expressed their concern that this CPU redesign takes away from the genuine 8-bit computer feel of the MEGA65.  My feeling is that it doesn't which I will explain below, but at the same time I don't want to be dismissive of anyone's concerns. Our goal remains to make something that is authentic and enjoyable for a wide range of people to use and program.  So please poke me, either in the comments or elsewhere if you wish.

But for now, I will take a little time to explain how the CPU looks from the user-perspective, to hopefully provide some assurance that it is not really a great departure from what we already had.  Indeed, from what I understand, what we doing here is not greatly different from how the Chameleon's CPU operates, i.e., some more modern CPU construction techniques are used behind the scenes, to provide what is very much (in their case) a 6502.

The main difference is that we are being transparent how we are making the CPU behind the scenes, so that it gives the end result of being a 6502 and 4502 compatible CPU.  We're sorry if that spoils the "magic trick" for some, but we strongly believe that transparency is always best in the long run.

The out-of-order instruction retirement is just a fancy way of saying that the CPU takes and executes the instructions in order, but some can take longer to complete, for example if they need to read or write from memory.

What doesn't change, is if an instruction requires the value read from memory, that it can't be completed until the thing it depends on is complete.  That is, it still behaves exactly as one expects a 6502 to behave, for any given program.  This is quite similar in many ways to the way that the SuperCPU has a 1-byte write-through "cache."  We are just using a different mechanism (register renaming, or reservation slots, depending on how you want to look at it), but to achieve much the same goal.

So if we look at a simple loop:

l1: lda $1000,x
sta $2000,x 
inx              
bne l1         

The simulation of this loop for the new CPU (in its current unfinished form, so there might be some changes) below shows how a couple of loop iterations go through. Note that register contents are BEFORE the instruction is executed, just because of how the simulation outputs stuff.  i.e., it shows the CPU state just before it executes the instruction, instead of just after.

-- LDA / STA / INX / BNE instructions all execute on consecutive cycles, taking
-- a total of only 20ns
@450ns: PC $8104 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@455ns: PC $8107 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@460ns: PC $810A A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@465ns: PC $810B A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@525ns: PC $8104 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@530ns: PC $8107 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@535ns: PC $810A A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@540ns: PC $810B A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@600ns: PC $8104 A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10

What can basically be seen above is that the non-branching instructions all take one cycle to run, whether or not they need a memory access, because all the out-of-order retirement and register renaming hides that. The result is that the timing is actually somewhat simpler and easier to predict for the most part than on a real 6502.  Note that we will still have a ~1MHz, ~2MHz and ~3.5MHz speed settings, where we will emulate the normal 6502 and 4502 timing of all instructions, and when we get time to do it, to make the memory access cycles also match that of a 6502 exactly, and naturally also the same for 3.5MHz 4502 mode. (One of the key reasons for reimplementing the CPU this time, is actually to make sure it has two "personalities", where in 6502 mode, all illegal opcodes work properly, and when in 4502 mode, all 4502 opcodes work properly, and can match the timing exactly -- so that we can have a real C64 mode and a real C65 mode, both of which are as compatible as possible.

The other obvious thing is that the branch instruction suffers a pretty big penalty, which is because the pipeline takes a bit of time to start feeding the new instructions.  However, because the clock speed is 4x, and the main pipeline is 4-stage, the end result is that the branch actually takes exactly the same amount of time as on our previous 48MHz CPU design.

It's also worth mentioning that most of the sources of timing uncertainty in modern PC processors etc don't actually come from the pipeline and other features that we are talking about here.  (In fact, the 6502 already pipelines between instructions a little). They come from the cache, from virtual memory, from the operating system that is hiding behind and pre-empting your process all the time and filling the cache with rubbish as a result. We are not having any of that stuff in the MEGA65: What you get is a 6502 or 4502 processor, that behaves how you expect.  We have just implemented it using some lessons learned over the past few decades of CPU implementation.

Otherwise, I think that this work has got some folks thinking about what makes a machine have character, instead of just being an 8-bit version of another wise soulless kind of PC, or FPGA-centric thing that people build.  For us, there are some key things, of which the following are a few.  Of course, we are thinking about many other things, such as C64 compatibility, but we take these simply for granted. 

First, the video generation MUST be rasterised, without a frame-buffer, just like on a real C64 or C65.  That is, the video chip needs to be deciding, cycle by cycle, what colour the next pixel will be, and allow the programmer to do horrible things to it that were never intended.  It is already possible, for example, on the VIC-IV in the MEGA65 to cause a monitor to totally lose sync, because you can trick it into moving the HSYNC pulses on a raster line.  I get back to this again below, but it is really a very important point.  In fact, I would say that what really makes the C64 interesting is the VIC-II and the SID. The CPU, while still important, is really secondary in many ways. It is the custom chips and overall combination that really define the "character" and "personality" of the C64.  The MEGA65 will of course have a its own personality, but we still feel that it will indeed have a personality, and that it will be a very strong one.

Second, it has to still be a simple bare-metal machine, where you have effectively full access to all the hardware when you are running on it.  The only piece we have outside that is the Hypervisor, which is best understood as an integrated freeze cartridge, so that you can easily load, save and switch what you are doing.

Third, the machine must still have fundamental limitations, that provide opportunity for programmers to try to stretch what the machine can do.  This is why we have the combination of CPU and resolution improvements together, for example, so that the relationship of CPU performance and the number of bits on screen at a time remain in reasonable relation.  The C64 has 64000 pixels from not more than 64KB = 512kbit on screen at a time, and 1x10^6 cycles or 3x10^5 instructions per second, so that there is approximately one instruction per bit of displayed graphics per second.  The MEGA65 has about 2x10^6 pixels, and is expected to have some multiple of 10^7 instructions per second.  Thus the instructions per pixel-bit is increased by an order of magnitude over the C64, so that it offers a nice bit of extra freedom, but without removing the limitation completely.  (Compare that with a modern PC, which instead has about 10^10 instructions per second, not counting the 10^12 or more GPU instructions per second). Moreover, the number of bits per pixel available from RAM is still in proportion: A C64 has about 8 bits per pixel available (64KB / 64000 pixels). The C65 actually has less, because while it has 128KB of RAM, it can do, for example 640x400 or 1280x400 resolutions. The MEGA65 goes further, having the same RAM as the C65, but with many more pixels, much more creativity will be required to find solutions to having full screen full-colour displays -- just as this presents special challenges and opportunities for ingenuity on the C64 (and C65). My point here is really that while the boundaries of what is "possible" on the MEGA65 are naturally different to those of the C64, we have retained this sense of a limited computer, so that it still has character, and will still require years of careful thought and experimentation to find its limits.

Finally, the specification has to be fixed for the long-term, like the C64's, so that people can program it with confidence, knowing that their code will "just work" on MEGA65s for years and decades to come, because otherwise the limits of the machine are not real.  This is actually why we want to get this CPU matter sorted out sooner rather than later, so that we can say with authority, "This is the CPU of the MEGA65. It shall be no faster."  Similarly, we want to pin down the last few points on the VIC-IV

Anyway, as I have said, we want this machine to be fun for the community, and something with a stable and fixed specification once we release it, so that it can have a long life, including so that stuff that you write with cycle-by-cycle timing will keep on working.  So please don't hesitate to let us know if you have concerns about our approach, or suggestions how we can do it better. This is one of the great things about an open-source project, that people can look and provide feedback and help to make sure that the end result is as good as possible. We can't guarantee that we can take everyone's requests and include them (partly because some of you ask for opposite things ;), but we do listen and think carefully about them all.

Tuesday, April 19, 2016

Overview of the new CPU design (as of today)

This is totally subject to change without notice, but the following gives an over-view of the design of the new CPU:

The gs4502b will, at this stage, be a pipelined, triple-core, out-of-order instruction retirement, register-renaming processor with parallel instruction pre-fetch buffer and self-modifying code hazard avoidance.  

We also plan for it to run at 192MHz.

Okay, that's all fairly technical gobbledey-gook, so lets break it down:

First, for the most part, the features of the processor will increase the rate at which instructions can be processed, i.e., decrease the number of cycles that many instructions will take. The main exception is the fact that it is pipelined, means that in some cases instructions may take more cycles, in particular branches or self-modifying code, because the pipeline will have to flush.  However, the increase in clock speed means that it will be extremely hard to craft a set of instructions that run slower on this processor than on our existing 48MHz one.

Now moving to the specific features:

A pipelined processor is one that breaks instructions down into separate little bits, like reading the instruction, decoding the instruction, actually doing the instruction, and writing back any results to memory, and so on.  One instruction can be doing each of those things at any point in time, so while an individual instruction takes longer, the number of instructions per cycle doesn't have to drop.  The big advantage is that a pipeline usually allows the clock speed to be increased -- this is exactly why we are employing one on the new 4502b.

The down side to a pipeline is that if the pipeline has to be flushed, it takes a while for it to start executing instructions again.  This was the problem with the Pentium 4 processors that used crazy pipelines to push the clock-speed way high, but didn't have enough cache memory to sustain the pipeline, meaning that actual performance was often quite poor. However, on the MEGA65, the CPU is effectively operating from cache the whole time, as the BRAM we use for the main memory is internal to the FPGA, and can be accessed as fast as the cache on a typical processor.

The 4502b will also be triple core. The first core will be the "CPU", and the 2nd and 3rd cores will be primarily for floppy drive emulation.  However, when you don't want or need to emulate a floppy drive, they will be available for use by the programmer.  Also, at this stage, the cores will be able to be set in two different performance modes: In the one mode, the primary core gets priority, so that it can run as fast as possible.  In the other mode, all cores will share the memory bus more fairly, and so while the first core will likely still run fast, it won't be as fast as in the first mode, but this will be offset in most cases by the increased performance of the 2nd and 3rd cores. Of course, this will require software that is designed to take advantage of the extra cores, of which none currently exists -- although I did write some dual-processor 6502 code back in the 1990s, but that's a story for another day.

The CPU will also support out-of-order instruction retirement.  This means that while instructions will start executing in the correct order, quick instructions will be allowed to finish while slower ones will continue in the background.  This will allow more instructions to be processed per unit time, by reducing the amount of the time the CPU sits blocked waiting for memory accesses to complete. In particular, memory reads and writes will continue in the back-ground, without blocking the CPU from executing new instructions, unless the new instructions depend on the results of the old instructions, for example if we have LDA $1234  followed by ADC $3456, the ADC would normally need to wait for the LDA to finish so that we can have the result ready to use as input to the ADC instruction.  However, even then, it will sometimes be possible to continue processing, where we can easily predict where the result will come from, as in this example, by using register renaming.

Register renaming is a fancy trick, where we can have multiple versions of a register at the same time. Using the example from above, we can say that one version of the accumulator register will get its value from location $1234.  Then when we want to use ADC to calculate based on that renamed register, we can tell the appropriate part of the CPU that the input to ADC is in fact the output from the previous instruction, by giving it the name that the result will have.  If that all sounds crazily complicated, don't worry too much. Just understand that it helps the CPU to go a lot faster, especially when there are a lot of memory accesses.  For those interested, wikipedea has a good page on this.

The CPU will also have a parallel instruction pre-fetch buffer. This is really a simple little thing that holds the up-coming 16 instruction bytes, and allows an entire instruction to be dispatched every single cycle, unless the buffer is empty, or the CPU pipeline stalls.  This means that instructions that used to take upto 7 cycles on a regular 6502 can sometimes be executed in just one cycle*!

Of course the asterisk is there, because there can be a lot of reasons why this might not happen in practice.  But in theory, the new CPU will be able to execute 192 million instructions per second.  This compares with the approximately 10 - 20 million instructions per second that the existing 48MHz CPU can achieve, and of course looks quite absurd next to the ~250,000 - 300,000 instructions that a real C64 could execute per second.  And that is using just one core on the 4502b.  The theoretical peak performance will be 576 million instructions per second, although as anyone who knows CPU benchmarks will know, that the reality might be only 10% - 50% of that figure.  Nonetheless, that is still very, very fast for an 8-bit CPU.

Finally, to make sure that all existing software can run on it, the CPU will include self-modifying code hazard avoidance. This is just a fancy way of saying that the CPU will realise when a program modifies itself, and flush the pipeline whenever it needs to.  The only trade-off is that code that modifies itself might suffer a penalty of about 10 - 20 cycles each time it modifies itself.  Of course, at 192MHz, that is still less than 105 nano-seconds.  That is, a worse-case pipeline stall on this processor will stall the CPU for only about 1/10th of a cycle when compared to a 1MHz 6502.

So anyway, that's the current thinking on this processor, and when I get the chance, I will provide an update on how far along the implementation currently is, and give some tentative simulation results to give an idea of how fast the processor might end up in practice.

Planning for stability, and the case for re-working the CPU

One of the things that we are determined to do with the MEGA65, is to have the core functions of the machine ready and stable from the outset.

In particular, we want people to be able to have dependable cycle timing for the CPU and VIC-IV, so far as is possible, so that people can safely write games and demos for it, without worrying about future updates breaking things.

The problem at the moment is that our existing 48MHz CPU doesn't meet timing closure, i.e., it is too fast for what the FPGA can guarantee will work, and it's also much too big: It takes up somewhere between 15% - 30% of the very large FPGA we are using.  This is a Bad Thing. Especially since we don't yet have a CPU core for an emulated floppy drive, and we would really like to be able to emulate two floppy drives at the same time, so we would need 3 CPUs.  Of course the floppy drive CPUs can be 6502s instead of 4502s, which simplifies things a bit, but it was still running the risk of being much too big.

Also, the current CPU has a couple of weird bugs that are proving hard to track down, because the existing CPU has been built by accretion, as I have realised things that need to be in it.

So, while on the one hand, it feels like we are going backwards in the short-term, I have started implementing an all new CPU, that will be much smaller, will meet timing closure, and will generally be simpler and easier to understand, and therefore to debug.

This will in fact be the 3rd or 4th CPU design for the MEGA65, depending on how you count things, and will also incorporate what I have learnt through that process, and also some other modern CPU features that I have been reading up on.  The net result is that the new CPU should be quite a lot faster than the current design, but you will have to wait for future blog posts to find out how fast, because even I don't yet know how fast it will end up being.

So expect a few blog posts over the coming days and weeks as I go through the design of the CPU, and document the process of getting it to work.

Sunday, April 17, 2016

We've entered the MEGA65 into this year's hackaday contest

Hop over to the link below and take a look, and if you like, help us spread the word:

https://hackaday.io/project/11096-mega65-open-8-bit-computer

Meanwhile, we have not been idle in the background, but have some fun progress to report as soon as I get the chance to write about it.