Thursday 31 July 2014

Ethernet transmission

In the last post I described how receiving ethernet frames was working.

Transmitting ethernet frames looked like it should follow a fairly similar process, so I spent a little time working on this.

I have structured it so that full-duplex operation is possible, with separate state machines controlling transmission and reception.

The first step was to see if I could cause anything to come out of the ethernet port at all.  Here I hit a bit of a problem, because all modern ethernet cards automatically check the ethernet Frame Check Sequence (FCS), and discard frames that are bad.  This meant that I would need to implement the FCS, and form completely valid ethernet frames first-up.

I much prefer to be able to make incremental advances, knowing that I have addressed particular steps as I go along, so I really wanted something that would let me receive invalid ethernet frames.  Then it dawned on me that the solution was to connect two C65GS's together, since the ethernet receive side doesn't (yet) check the FCS or pretty much anything else about the frames when receiving them.


This also means that I can check things without having to worry about one end feeding crazy bonjour packets all the time.

This let me quickly confirm that in fact nothing was coming out of the ethernet port with my first attempt.

I took a guess that the transmit side might not start transmitting if you don't immediately start with the ethernet preamble code.  After fixing that, and with back-to-back Nexys4 boards running as C65GSs, I was able to cause simple frames to be sent from one to the other.  I just stuffed a sequence of bytes in to the transmit buffer at $FFDE800, and then set the frame length in $FFDE043 - $FFDE044, and then wrote $01 to $FFDE045, and voila, the frame was sent to the other side with a reassuring blink on the ethernet led:

.sffde043 ff 00 01

You can see the momentus frame as received at the other end here.  Sadly no "Watson, come here" or "One small step" here.

.Mffde800                                                       
 :FFDE800 FE 00 5C C2 76 86 0A 0B 0D 0D 0E 0F 07 08 09 01
 :FFDE810 02 03 04 05 07 07 08 09 02 02 03 04 05 06 00 00
 :FFDE820 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE830 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE850 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE880 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE900 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The first two bytes are the received frame length (I can see that there is an out-by-one error here somewhere, my guess is on the transmit side), then the next four are the FCS as I calculate it as the packet is received, then the packet data starts after, from 0A onwards.  I think I am calculating the FCS incorrectly, as it never matches that of any real received frame sent by an actual computer, so I will need to look into that.

But anyway, this is good progress.

Wednesday 30 July 2014

Starting work on the ethernet adapter

As I mentioned in an earlier post we have a student working on the ethernet controller for the C65GS using the on-board 10/100mbit ethernet adapter on the Nexys4 board.

We spent a bit of time yesterday understanding how it works, and it is very pleasing that less than 24 hours later I was able to receive this ethernet frame with the C65GS connected by ethernet to my Mac:

 :FFDE800 FF FF FF FF FF FF C8 2A 14 08 DA E2 08 00 45 00 ................
 :FFDE810 01 4E 9C 70 00 00 FF 11 FC DC A9 FE CD 54 A9 FE ................
 :FFDE820 FF FF EB 5D 13 8A 01 3A 33 D6 44 52 49 4E 45 54 ..........DRINET
 :FFDE830 54 4D A9 FE CD 54 C0 0E 00 00 00 3E 39 63 30 31 TM..............
 :FFDE840 61 38 63 30 2D 31 36 37 36 30 39 38 38 39 31 00 ................
 :FFDE850 00 00 00 00 00 00 00 00 00 00 00 00 69 71 6E 2E ............iqn.
 :FFDE860 31 39 39 35 2D 31 32 2E 63 6F 6D 2E 61 74 74 6F 1995.12.com.atto
 :FFDE870 74 65 63 68 3A 78 74 65 6E 64 73 61 6E 3A 73 65 tech:xtendsan:se
 :FFDE880 72 2E 63 30 32 66 38 72 6D 78 64 68 32 68 0A 00 r.c02f8rmxdh2h..
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 

Clearly this is not a very interesting ethernet frame for our purposes, but what is clear is that it is locking onto the frame preamble and receiving the bits and putting them together all correctly.

There is no ethernet checksum being performed yet, and it isn't possible to send frames, either.  These are things I will likely work on when I get a chance.

In the meantime the student, now joined by another student, who will work on writing some test software for the ethernet controller.

Once it is all working, then we will look at adding RR.net emulation registers so that existing software can use the ethernet interface.

Why hardware is hard

I have finally found and fixed the problem that was stopping most (but not all) CIA IRQs in the redesigned CPU.  The result was the cursor would blink very slowly.

I had bashed away for several days on the CIA, to try to figure out what was going wrong, with no luck.  I wasn't able to reproduce the problem in simulation, so I was pulling my hair out.

So I progressively added more and more instrumentation that revealed that the CIA was resetting the interrupt status register, and that it seemed to be doing so at the request of the CPU.  This should only happen if the CPU is reading from $DC0D.

It was about then that the realisation dawned on me that because the memory controller on my CPU has separate channels for RAM, IO and other types of memory, it is possible for the IO bus to still be presenting the instruction to read from the last accessed IO address.  Indeed $DC0D is the last IO address touched in the C64's IRQ routine, and thus the problem.

This also explains why the problem would go away if I accessed any IO-mapped memory, even from the serial monitor.  In retrospect, I should probably have reflected on this a little more deeper, especially the fact that the CIA design hadn't changed, but my CPU design HAD changed since it was last all working.

Anyway, this was a reminder of why hardware design is hard: it does exactly what you ask it, and keeps on doing it until you ask it to stop.  Getting that right all of the time takes a lot of careful attention and testing.

Sunday 27 July 2014

First speed test of 48MHz CPU

I am continuing to fight with getting the reimplemented CPU and VIC-IV all settled down, however things are getting much closer.

As the following image implies, it can run the C64 ROM (not yet C65 ROM -- it gets stuck in the DOS somewhere).  It should be noted that the CPU performance here is not final, and some instructions might end up faster or slower than depicted here.  That said, the CPU is certainly quite a bit faster than the old 32MHz one.


44.36x is almost exactly 48mhz/32mhz = 150% the speed of the old CPU at 28.93x.  Pleasingly this is before I do anything to optimise the performance.  Also, whereas the old CPU filled the FPGA to capacity, with the new CPU about two-thirds of the FPGA remains free -- space for implementing sprites, a 1541 and other goodies as I get the chance.

Speaking of optimisations, one that I may attack in the not too distant future is a stack cache that allows RTS to execute in just 1 cycle.  While it will make some impact on the SynthMark64 score, it is more interesting for real-life work loads where JSR/RTS are very common instructions.  But otherwise there is no caching anywhere in this -- it is all raw, predictable cycle times, which helps make it feel like a simple 8-bit computer, albeit a very fast one.

What isn't entirely obvious here is that keyboard input has broken for some reason, with my PS/2 keyboard reader failing to detect key-release events.  Also, for some reason the CIA interrupts are not always happening as often as they should.  Combined with not being able to use the C65 ROM, this means I had to side-load SynthMark64 via the serial monitor, start it directly from the serial monitor, and then use the serial-monitor to stuff the ENTER key-press into the keyboard buffer.  So there is still a bit to go, but at least it feels like I am getting somewhere.

Saturday 26 July 2014

Debugging the new CPU

Debugging of the re-implemented CPU continues, and is hopefully close to complete.  I thought I would describe some of the process I have followed.

The real secret to debugging anything is discoverability, that is having the means to work out what is happening so that you can tell not only whether the result is correct, but how the result is being calculated.

With a hardware design this can be rather annoying, because the time it takes to make a trivial change, resnythesise and test the design can be of the order of an hour.  Assuming that you correctly expose the thing you are trying to debug, that means you can examine (and hopefully fix) at most a dozen or so defects per full day of effort. Not Good.

Fortunately there are simulation tools for VHDL that let you debug without having to go through the whole synthesis process, and thus reduce the time to examine a defect from hours to minutes.  While this has limitations, for example, to debug the SD card interface in simulation I would need to write an SD card simulator, it is extremely useful, and I have made extensive use of the free and open-source simulation tool, ghdl.

The processor redesign basically consisted of gutting out the first implementation of the CPU and leaving just the shell that accesses the memory and interfaces with the serial monitor, which I have described in previous posts.  The serial monitor is extremely useful, because it allows reading and writing of all memory, as well as examining the processor state, and single-stepping the processor.

The first part was to re-do the serial monitor interface, because this needed an overhaul for the new processor architecture.  This was rather tricky, because simulating a serial connection feeding various commands in would take a fair bit of work, and the time scales of serial input means that simulation would be rather slow anyway.  So as a result I used some of the LEDs on FPGA board to provide some useful debugging output, and worked as carefully as I could to make sure that the code was likely to work.

The second and related step was getting the memory access stuff working again, and accessible via the reworked serial monitor interface.

These two steps took much longer than I had hoped, and were really frustrating.  In retrospect, it might well have been easier to make a simulator for serial input and used ghdl simulation to shorten the process a bit.

After this, I set about implementing a few simple instructions so that I could get single-stepping of the CPU through the serial monitor working.  This also turned out to take way longer than I would have liked, partly because the new CPU architecture uses 6502-style end-of-instruction pipelining which really complicated single-stepping.  I did get it working in the end.

Then it was on to implementing LDA, STA, JMP and a few other instructions to allow the writing of simple little test programs to confirm that the CPU was generally working.  At this point ghdl was useful to allow quick testing of the instructions and their interactions.

In the process of doing this, I realised that the debug output I was producing in ghdl was not as good as it could be.  Basically I was looking at hexadecimal instruction bytes and trying to decide if it was right or not.

It would be much easier to debug if I could get ghdl to show full instruction disassemblies as well, so in stead of just seeing 8D 0D DC, it would also show STA $DC0D.  Also, it would help enormously to know what memory access was happening each cycle, so that I could get an idea of exactly where an instruction was going astray.

I finally had time to implement this during the week, and now I can easily get output like:

MEMORY reading $FFFF654 = $A9
MEMORY reading $FFFF655 = $00
MEMORY reading $FFFF656 = $85
$F654 A9 00     lda  #$00          A:00 X:22 Y:33 Z:00 SP:01FF P:26 $01=3F  ..E-.IZ.  
MEMORY reading $FFFF657 = $20
MEMORY reading $FFFF658 = $A9
MEMORY reading $FFFF658 = $A9
MEMORY writing $0000020 <= $00
$F656 85 20     sta  $20           A:00 X:22 Y:33 Z:00 SP:01FF P:26 $01=3F  ..E-.IZ.  
MEMORY reading $FFFF659 = $91
MEMORY reading $FFFF65A = $91
MEMORY reading $FFFF65A = $85
$F658 A9 91     lda  #$91          A:91 X:22 Y:33 Z:00 SP:01FF P:A4 $01=3F  N.E-.I..

Actually the output has a little more information in it, but the above gives you an idea.

We can see a few things from this output.

First, the instructions seem to work, as we see the right values end up in the accumulator, and the correct value being written to the write address.

Second, we can see that there is a dummy read in STA, which is part of the design that allows 48MHz operation.  So for some instructions at least, we don't expect 48x performance.  Some of these might get improved down the track, but some penalty cycles will have to remain.

Thirds, we can see the 6502-style pre-fetching of the next instruction while the previous instruction is finishing off.

Armed with the ability to produce this kind of trace, I used the TTL6502 test program for 6502 processors, and by examining the simulation output was able to quickly find and fix quite a number of bugs.

The TTL6502 program only tests the original 6502 instructions, not any of the 4502 extensions.  So I have followed a bit of an ad-hoc process of writing little programs that use each of the new instructions, and verifying from the memory trace, register and flag values that all is well.  This has also turned up a great many bugs.

This is more or less where I am at now, fixing bugs with PHW (push word, immediate or absolute) and a few other remaining instructions.  Once that is done, we should hopefully be back to being able to boot the C65 ROM into C64 mode, and then soon after running SynthMark64 to get an idea of the speed of the new CPU.

Tuesday 22 July 2014

Improved hardware scaler

Previously the pixel scaler for the VIC-IV allowed logical pixels to be any integer number of physical pixels in both X and Y directions.  This struck me as a little inadequate.  For example, for 80 column mode it meant either no side borders or enormous side borders.

So I have replaced these simple integer counters with fixed point counters that are used as divisors for the width of pixels. A value of 1 means that logical pixels will be 128 pixels wide, while a value of 255 means that logical pixels will be 1/2 a physical pixel wide.  This allows zoom factors from 0.5x to 128x, with very fine granularity at the smaller end of the range.  The following frame shows the smaller end of the range and just how fine the graduation is.  You really need to click on the image and zoom it in to see what is going on.  Memory is not uniformly initialised, hence the different textures that can be seen.



There is no sub-pixel sampling, so there will be aliasing effects.  Nonetheless, the result is much more flexible than it was previously.  When I get a moment I will adjust the 80-column display modes to use this with 2.5 physical pixels per logical pixel, so that the borders don't move when switching to 80 column mode -- unless of course the result looks too silly with the mix of fat and skinny pixels.

Thursday 17 July 2014

More work on new CPU, and some very skinny raster stripes

It still isn't very exciting to look at right now, but the CPU is getting closer to working properly.

I have found and fixed abut in the TRB (Test Reset Bit) instruction. This is a handy little instruction for clearing bits in byte.  For example, LDA #$01 / TRB $D030 will clear bit 0 in $D030, which on a C65 will bank out the second kilo-byte of colour RAM from $DC00 - $DFFF so that you can see the CIAs again.  The correct calculation for the result is (memory and (not A)), but I had (memory and A), which has the effect of reseting all of the bits except the one(s) you wanted reset.  Needless to say that wasn't working too well.

I also fixed some bugs with IO mapping.  In particular, the SD card controller is visible to the CPU again, and Kickstart even gets as far as loading the master boot record from the SD card.  There does seem to be an out-by-one error with the buffer addresses, such that the whole sector is rotated by one.

Here is Kickstart finding the SD card at 48MHz:


That looked a bit boring, so I wrote a little loop to do some raster effects:

This is the little loop:

loop     LDA $D052   ; VIC-IV physical raster line low bits (range 0 -  1199)
         CMP $D052
         BEQ *-3
         INC $D020
         DEC $D020
         JMP loop

  The loop should increment and decrement $D020 just once at the start of each raster line. However it looks like the compare instructions are using a fixed value, instead of the operand, which is why there are a few rasters on which there are no bars, while the rest fail to properly compare the raster number with itself.

This is due to a bug in the compare instructions, which I have yet to get to the bottom of. My gut feeling is that it is some sort of timing bug, where the wrong value is read from the bus.  I have seen it in simulation once or twice, which suggests that I should be able to analyse it fairly easily to find and fix the cause.

Meanwhile, it is interesting to look at the pattern and how narrow the stripes are.  They are actually almost half the width that they seem at first when you look closer, because of the adjacency of the bars on successive raster. The following image makes this a bit clearer:



The VIC-IV runs at 4x the CPU clock, so every four physical pixels corresponds to once CPU clock tick.  The logical pixels of the character generator are five logical pixels wide here, so one and a quarter CPU clocks wide.

INC and DEC take seven cycles on my CPU at the moment, due to the need to include wait-states to avoid back-to-back memory accesses at 48MHz.  This should equate to 7x4 = 28 pixels, or about five and a half logical pixels, just over half a character wide, which is pretty much what we are seeing.

On a real C64 the same bars would be almost 10x wider, at six characters or 48 pixels wide.  So even allowing for the massively higher pixel clock on the C65GS (192MHz versus 8MHz on the C64), there is certainly scope to do some pretty interesting tricks.  Vertical raster bars and split screens should both be quite possible, although there are probably easier ways to get the same effects.


Wednesday 16 July 2014

New CPU is working (sort of)

It's taken much longer than I would have liked, but I have the redesigned CPU mostly working now.

The CPU is running at 48MHz, and should be about 40x C64 speed, although the exact figure is likely to change.

The reason it isn't 48x is that I have had to put some wait-states in a few places to make the timing work.

Reading from anything other than fast RAM incurs one extra cycle, which means reading from IO currently has a two cycle penalty.  Fastram, as the name suggests, has no wait states.  Writing to IO also has no wait states.

Also, anywhere where the CPU makes a memory access for which the address or data is dependent on whatever has just been read from memory, this has had to be split into two cycles.  This mostly affects the Read-Modify-Write (RMW) class of instructions, like INC, DEC, ASL and ROR.  This basically means that we have a dummy cycle similar to what the real 6502 has.

Unfortunately, it isn't very practical to perform the dummy write that the 6502 does, so I will need to add an extra cycle for $D019 so that DEC $D019 and variations work for clearing interrupts.

I have some ideas for caching the top of the stack so that RTS can execute in a single cycle, which will provide a solid boost for many programs, but that's some way down the track, because I need to get the CPU working properly first.

The screen shot from simulation below shows that it can run the kickstart ROM and get as far as trying to find the SD card:


The astute observer will notice that the top line of the display is showing the wrong contents.  This is because the bad-line for that row of characters had already occurred.  If I leave the simulation long enough that it can draw a 2nd frame, then it should show the kickstart banner.  As it happens, the simulation managed to draw another frame while I was writing the this post, so you can see the real version below:


I need to shake down the remaining bugs like this in the VIC-IV that have crept in with the substantial rework that it has suffered while I have been doing the CPU.  Both efforts, CPU and VIC-IV rework, are really targeted at making the whole thing use much less of the FPGA so that I have enough space to implement sprites and the other missing functionality.

In any case, the fact that it can set the video mode, clear the screen, and decide that it is looking for the SD card shows that an awful lot of the CPU is actually working.  There are some bugs, however.

First, I haven't finished implementing BRK or interrupts of any sort.

Second, I haven't finished implementing the PHW (push word, either immediate or absolute) instruction.  It won't be hard, but it just hasn't hit the priority queue yet, and it is a little weird, since in the CPU the two addressing modes will likely have very different implementations.

Third, there are some weird bugs with accessing IO.

The SD card controller and other IO functions provided by that module aren't mapping in the address space properly when run on the FPGA, even though they simulate fine.

Also, running the following little routine to draw a rough vertical raster bar locks up as soon as the accumulator has the value $F0.  Once that happens, the Z flag stays perpetually set, and so nothing more gets drawn.

loop    LDA $D012
        CMP $D012
        BEQ *-3
        INC $D020
        INC $D020
        JMP loop

It works fine, however, if I put a NOP between the CMP and the BEQ.  So there is something timing dependant going on.  What is weird is that without the NOP the bug manifests, even if the CPU is in single-stepping mode.

This reworking of the VIC and CPU at the same time hasn't been the most fun, because it has gone backwards from working to a seething mess.  But it is now finally starting to draw back together, and should hopefully soon catch up with where the old excessively large design got to.  Then comes the fun part of adding sprites and other goodies, but that will still be a little while off.

Wednesday 9 July 2014

Improving simulation of the C65GS to help with debugging

I am continuing to slog through reimplementing much of the CPU and VIC-IV to reduce the size, and hopefully improve the CPU speed.

One of the great pains with VHDL is the time it takes to synthesise (that's VHDL/FPGA speak for "compile") a design.  In the case of the C65GS it varies between 10 minutes and an hour.

Even then, when you spot something wrong, it can be quite hard to work out what it is so that you can start fixing it.

This is why there are simulation tools that let you run VHDL code, albeit really slowly.

However, it isn't quite that simple in practice, especially if you are using the open-source GHDL VHDL simulation tool.

A lot of the problem comes from the Xilinx core generator that is the recommended way to create large memories and other useful stuff in an FPGA design.  Those cores are specified in a very hardware-specific manner.  They do include simulatable versions, but none of them run in GHDL.

A while back I had already started making my own simple memory modules in VHDL.  The only ones that were too tricky to get right so that the Xilinx tools knew what I meant was true dual-port memory where two different things can read and write at the same time.  This meant that I couldn't simulate chipram, which really scuttled any sensible simulation.

However now I have changed the design so that chipram is only ever written to by the CPU.  Reads from chipram are serviced from a separate shadow ram.  This means that chipram is fastram on the C65GS, but more importantly, it removed the barrier to simulation.

So over the last week or so I have been hacking away at improving the simulability of the C65GS, with the direct intention of tracking down bugs much more quickly.  In particular, I have been wanting to get the character generator working again after recently redesigning it to work off a single 8-bit data bus, in preparation for implementing sprites.

It is easy to get VHDL to output text messages at various points, but that can be extremely tedious to wade through as you try to work out if the character generator is fetching one too many or too few bytes during a bad line, or fetching the data from one byte to the left or the right of where it should be.

To make this part easier I wrote a little program that reads the output of the VHDL for me, and then generates a BMP file with all the rendered pixels.  Then I can just look at the static image, see what is wrong, fix it, and repeat the cycle -- in a process that takes just a minute or two, provided you only need to look at the top few rasters of the frame, which is usually enough.

With this in place, I have been able to make much more rapid progress, and already have the character generator mostly working again, as the partially rendered frame below shows.  Excuse the bizarre colours and choice of data.  The main thing is I can see 40 columns being drawn without any significant glitches, and both colour RAM and screen RAM are being addressed correctly:


The purple is the right border, and the text area is currently placed hard-left in the frame, with the left border turned off.  The funny brown stripe is because the right border is still in its normal place, so the space that should have been taken by the left border instead shows up as background colour ($D021).

Now I am just tracking down some simulation bugs that cause the simulation to die before it can draw a complete frame.  Once that is working, I probably have enough of the character generator working again that I can turn my attention back to finishing the re-implementation of the CPU.

Thursday 3 July 2014

Stunt Car Racer at 29x Speed

When I still owned my C65 I made a YouTube video of it running Stunt Car Racer to show how much faster the C65 is than a stock C64.  You can see it below:




Since the C65GS is now able to run some software, I thought I would try Stunt Car Racer out on it.

In the still image below we can see that it is working fairly well.  The vertical graphics glitches are a known problem with bitmap mode on the C65GS, and I will fix them in due course.


But the real question was how fast it would run.  I knew it ought to run quite fast, and that it lacks any vertical blank interlock, so it would run as fast as it possibly could.

Here is a few seconds of Stunt Car Racer running at Ludicrous Speed.  It goes without saying that it was entirely unplayable at this speed.



It looks like the CPU redesign I am doing should end up at 48MHz and somewhere between 30x and 50x C64 speed.  I'll have to post an update when I get to that point.

Tuesday 1 July 2014

A frustrating fortnight of work reimplementing the CPU

I realised recently that I needed to reimplement the CPU so that it wasn't taking 70% - 80% of what is really a very large FPGA, so that I could fit the extra bits and pieces in that I have been planning.

Ideally, I would include a 1541 and a complete C64 in the FPGA for compatibility.  This would help to free me from having to make the C65/C65GS mode too compatible with the C64, instead, I would just transfer control of a running program from one to the other when cycle-exact operation or illegal opcodes were needed.

But in the shorter term, I wanted to be able to put a couple of SIDs in, and perhaps also improve the synthesis time a bit.

So I set about redoing the CPU, and the process has been one frustration after another.

My gut feeling was, and remains that 96MHz should be fairly possible on this FPGA, although the complexity of the memory system of a C65 makes that more difficult than it first seems.

I was keeping a careful eye on the maximum clock speed in the synthesis reports, and realised that for the time being at least, 64MHz would be easier. I could then optimise my way back up.

After bashing my head against a lot of bugs and silly design decisions on my part, I started making some progress and got some instructions running.

But then testing LDA/LDX/LDY/LDZ I found that the value loaded into the register was often wrong, or from the previous instruction.

After a lot of poking around, I finally realised that ISE was not constraining the cpu clock in the design, and the late data was being used because the design had a real maximum clock speed of only about 28MHz, but was being run at 64MHz.

I was a bit cranky with ISE for not realising that a clock created by dividing another clock is in fact a clock.

I eventually a way to constrain the clock, discovered how horrible the timing situation was, and have been trying to fix it since.

At this stage it looks like 64MHz should be possible, although with some the odd drive stage when executing instructions to get values or addresses ready to read or write in the following cycle.  Exactly how much impact this will make is hard to estimate at this stage, but the CPU might be in drive states perhaps 20% - 25% of the time.

That means that the speed up compared with a stock C64 might end up around 64MHz * 0.75 / 1MHz = 48x.  The result will be helped a little by the fact that many of the single byte instructions will execute in a single cycle. But it is really too early to say whether I will be able to make the CPU work at 64MHz, and exactly what the speed comparison will be.  It will all depend on whether I can make actual forwards progress or whether I stay bogged down chasing my tail some more.