November 29th 2021 – Raspberry Pi Zero 2

This is a write up of my latest experiences using a multi core setup to speed up PiTrex. I write it down here again, so there is a spot where I may find it for future references. I have been writing about this stuff in:

The intro within the raspi forum sums it up rather well, a copy/paste:

Question!
Does anyone know what may cause an “interrupt like” behaviour on a multicore system, something which I might have overlooked?
Some sort of contention with a ressource unkown to me?

Everything works fine now – except… the last step.

For this to understand I need to describe the project again.

PiTrex
In general this is a Pi connected via a special circuit to a vectrex (an old video gaming console able to display vector graphics). Via the cartridge port of that machine you can halt the internal 6809 processor and via the BUS you can access the address space of the vectrex.

All we need to address is the VIA 6522 chip, since this one handles all “internal” communication and setup with the analog components, e.g. PSG sound chip, vector generation hardware, joystick etc.

To draw vectors using the VIA you actually tell the hardware:
– switch beam on
– move in that direction
– switch beam off
and so on…
In order for this to work everything must be VERY exact – otherwise the vector output has jitters and does not look clean.

With very exact I mean for “normal” vectors something around 1/1500000 seconds. For raster text display even more exact.


Now, all this is working and has been working for quite some time now.

What we further implemented are several emulators (MAME – like) on the bare metal pi and output the game on the vectrex (star wars, tempest etc…).
This also works well.

Now comes the Pi Zero 2. And for the first time we have the opportunity to use a multi core environment.
Since the vectrex can only display a certain amount of vectors with 50Hz (lets say 500) – my goal is with the multi core environment to use one core just for the display of the vectors and one (or more) other cores for emulation.
That “vector”-core would than be an actual “DVG” or “AVG” that is fed from the other core(s) via a vector pipeline.

Everything said above is implemented and working!

BUT

When using two cores (one for emulation, one for output) – the output has “jitters”.
I have seen similar behaviour before, when I had interrupts enabled during output of vectors using a single core.
The above explained sequence:
– switch beam on
– move in that direction
– switch beam off

Was interrupted… and the needed “switch beam off” (for example) was not given at the right time and the vector was drawn longer than wanted.

I am able to switch “modes” within my program on the fly:
– I can output using a single core (core 0) and the output is CLEAN.
– I can output using a single core (core 1) and the output is CLEAN.
– If I use two cores (core 0 emulation) (core 1 output) it seems that core 1 is interrupted. Interrupted for … I dare say milliseconds – some vectors are really FAR off.

Further:
In the multi core setup:
– core 0 does not access ANY GPIO, Mailbox, Timer or other sensible region
– core 1 is the only core allowed to communicate with the outside

The only “overlap” both cores have is the pipeline which core 0 fills and core 1 displays.
There are actually 3 such pipelines to ensure both cores access them seperately:
– read pipeline
– write pipeline
– next read pipeline

Those are switched as needed. Access to the switching is protected by semaphors, and core 1 has priority.
I have not only not enabled any interrupts I have DISABLED all interrupts.

Still the jitters remain.


Now the question again:

Does anyone know what may cause this “interrupt like” behaviour on a multicore system, something which I might have overlooked?
Some sort of contention with a ressource unkown to me?

To answer some specific question:

How did I implement the Semaphore? -> using a normal spin buffer, actually a copy/paste from Circle (implementation see bare metal forum).

How is the MMU setup? -> Each core has seperate stack areas. For core 1 I have reserved a special pipeline buffer, that is a couple of MB away from any RAM that is used by core 0.
Caches and MMU are enabled.
MMU “S” flag is set (but if unset, does not change anything).
For the core 1 memory I additionally flagged the RAM with “no execute” (but if unset, does not change anything) .
Otherwise the memory type is “normal”.
For the GPIO area the memory type is “device”.


Next I tried implementing a very easy example – to not have to run all emulations to see some “jitter”. What I ended up with is THIS most basic example (semaphores not needed anymore, since I do not change anything relevant):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  while (1)
  {
    v_WaitRecal();
    while ((((volatile unsigned char)currentButtonState&0x0f)) == (0x08)) 
    {
      extern VectorPipeline *_VPL[];
      extern int vectorPipeLineWriting;
      int a;
      
      unsigned char * p= (unsigned char *)_VPL[vectorPipeLineWriting];
      for (int i=0; i<MAX_PIPELINE*sizeof(VectorPipeline); i++ )
      {
	a += *(p+i);
      }
    }
  }    

This is the most BASIC display loop. An endless loop that prints some vectorlists (a Major Havoc like titlescreen). Which was setup before the while(1) loop.
The whole pipeline thing is implemented thus – when no new vectors are drawn, the old list will be redisplayed.

What this than boils down to is:

I summarize:

  • core 0 and core 1 run independened of each other
  • the above pointer “*p” points to a memory area, that has nothing todo with core 1
    a) core 1 uses the “read” buffer (not the above shown “write” buffer)
    b) the read buffer additionally gets copied to a special core 1 memory area
  • thus just accessing a (from core 1 point of view) seemingly random area of memory (data-memory) is enough to produce jitter

To me this showed that normal “synching” (semaphors, monitors or the like) of the two cores has nothing to do with my problem.


Next thing was, that I suspected the PiTrex. Quote from the mailing list:

@Kevin:
I am “blindly” looking at things from different directions. One of these directions is directed at the PiTrex – not the PiZero.
Apart from the facts described in the bare metal forum, there are two different occurences, which I did not mention yet… mainly because they are fairly seldom and I wanted to investigate them after the problem mentioned above.
But thinking about things – there is one possible way, that all three problems are actually the same.

All in SMP – mode:
a) Sometimes – not more often than once in 1/2 an hour or so. When I am running a program – the program just exits and I am taken back to the menu… without me touching any button or doing anything at all.
b) “Often” (after 1-2 minutes at most, but never at precisely the same point, pitrex appears to freeze and the screen is black. I found something to circumvent that – but I didn’t REALY know what caused this.

Now my current “bedtime” theory:
I actually don’t believe anymore, that it is “BUS” contention in the Pi Zero 2 architecture. The whole chip is designed with four cores in mind, I am sure that
a) the bus is fast enough to support 4 cores memory handling
b) the memory location I write in core 0 are totally different than in core 1, thus there is no memory/cache contention going on
c) this already happens, when I use only 2 of the 4 cores, the architecture and the chip design is nearly 10 years old, if it were that rubbish, we would have heard of it…
d) even IF there is a contention, it would be in the low nano second area, nothing that would cause the tremendous effects that are visible in the video (linked in the forum message)

The faults…
a) can (if everything is programmed correctly) only happen, when all 4 buttons are pressed, that is, when the GPIO input from the VIA reads at the right point 0x?f;
b) I debugged this, and the line that is “stuck” is the following:
  while ((GET(VIA_int_flags) & 0x40) == 0);
Meaning the T1 counter of the VIA NEVER! sets an IRQ signal, and this is an endless loop and thus the output stands still.
This is verifyed and the core 0 is till running, while core 1 is in the loop.

My current theory therefore goes in the last possible direction where there might be trouble – the PiTrex.

Question!
Can the PiTrex in any way be influenced by BUS activity not related to the GPIO?
To me as a software only guy and viewing things from the pure “outside”, it looks like:

If the PiZero does a memory access in core 0 and at the same time accesses the GPIO pins with core 1 and read(writes?) values from the PiTrex… than there is a slight chance, that the value gotten back from (or written to) the GPIO port is not the correct value as gotten from the Vectrex.
This hypothesis would explain all three above mentioned behaviours.

a) the button is read only once per round (thus seldom) … and the event (“reset to the menu”) is happening rather seldom – a falsly return 0xf would explain it though.
b) not reading the Interrupt flag at the correct time (or not writing the correct value), would result in a endless loop, this is done with each drawn vector, that has no direct follow up -> can happen more often
c) the garbled screen – reading and writing MANY times per single vector -> happens very often! Thus much garbling…

I know this is a strange theory and it is a little bit clutching at straws.
But it would explain everything that is happening.

Any thoughts?


Basically – the answer Kevin gave was a negation :-). But meanwhile I also got first responses from bare metal forum and a new suspect hits the scene.

Caches

For more technical information on the subject – pls visit the forum. But basically there were voices that suspected cache hits/misses might be the reason. To test that out – I altered my test program slightly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
  while (1)
  {
    v_WaitRecal();
    while ((((volatile unsigned char)currentButtonState&0x0f)) == (0x08))
    {
      extern VectorPipeline *_VPL[];
      extern int vectorPipeLineWriting;
      register unsigned int * p= (unsigned int *)_VPL[vectorPipeLineWriting];
      volatile register int a=0;
      while (1)
      {
        for (register int i=0; i<MAX_PIPELINE*sizeof(VectorPipeline)/4; i++ )
      //for (register int i=0; i<2; i++ )
        {
          a += *(p+i);
        }
      }     
    }
  }

If you execute the program “as is” than there is jitter.
If you exchange the comment at the two “for()” loops – than there is no jitter.

The program runs “jitter free” up to a size of about 500-520 kB (loop size).

Increasing the loop size further – steadily increases the “jitter”.

My conjecture from this is:

It really seems to be the L2 cache – at least size wise.

The L2 cache size is 512KB (above 500-520kB is not an EXACT measure).

For me that looks like L2 cache activity “blocks” the bus for an exceedingly long time – so that communication via the GPIO pins can not happen within my speed limits.

Within “normal” program execution I do not have any control when L2 cache activity hits me, that would seem to me, that my current core 1 implementation does not work.

I am still not sure why there is such a HUGE amount of jitter.
Cache misses should not cause micro or even milli seconds of “pauses”… the ARM literature I have found tells that it should be at most about ~20 cycles that a cache miss would take.
(which would be 20 nano seconds – ONE Vectrex cylce is about 666 nanoseconds, a “correct” VIA access the double of that: 1332 nanoseconds – FAR away from any jitter???)

Further thoughts:

  • it seems the only thing to ensure no jitter would be to run all “emulation” within a maximum if XXX byte (possibly 512kB – maybe less)
  • this would be a huge undertaking because as of now I use RAM freely, and there is BSS and non BSS Data and stack (different memory regions), which would have to be unified
  • there are libraries I use, which are not “above” to use “statics” – which again is possibly somewere in a different memory region

-> This all means to me – unless someone (other than me) can find a solution to the described jitter problem (or discovers that it is something else I have not thought about and the “something else” is something I can fix):

PiTrex will not use a Multi Core setup!

Tagged on:

3 thoughts on “November 29th 2021 – Raspberry Pi Zero 2

  1. gauze

    sad it’s not working as you hoped, I picked up a zero2 last month hoping for support soon but I can always use it for something else if it doesn’t pan out.

    1. Malban Post author

      The “old system” (IRQ) is still working.
      Using only one core reduces the speed increase to only 40%.

      That is still not bad and I think worthwhile. But it is somehow “sad” that 3/4 of the CPU can not be used.
      (unless we find some other trick)

  2. Malban Post author

    Follow up thoughts:

    Two thing I *might* try:

    1) Block core 0, while core 1 uses GPIO.
    It is possibly to use a “mailbox IRQ” to halt core 0 from core 1.
    You have to use the mailbox system for that. One can configure core 0 to enable an interrupt once a “message” appears in the mailbox.
    The thought would be:
    Core 1: before doing a GPIO -> send a message to core 0
    Core 0: when receiving the message, the interrupt handler does a WFE (Wait for event) – thus “shutting down” the core.
    Core 1: once finished the GPIO use a SEV (set event) to wake core 0 up.
    Core 0: exits the interrupt immediately after it gets woken up.
    (This could also be a FIQ)

    The thing is… one full write/read cycle of core 1 takes about at least 6 vectrex cycles.
    This is than 6*666 = ~4000 ARM cycles.
    And all that core one does all day is pump data to VIA, so basically core 1 is nearly COMPLETELY blocking core 0. It might be even more performat to use the old IRQ setup.

    But to be sure – one would have to try it out.

    2) It is possible (even for the “normal” Pi Zero”) to access the GPIO with the GPU (NOT the arm cores).
    This would give parallel execution even with the Pi0.

    However there is nearly no documentation about this (the GPIO is closed source) and the experienced cache contention could very much be exactly the same as we are experiencing now – since the GPU for all I know is still using the same BUS.

    … these are current thoughts …

    However I deem the chances very slim of any of the above point to actually help very much, and it would require much knowledge gathering, programming and trying out. I am not sure whether I actually want to invest the energy in this.

Leave a Reply to gauze Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.