[Schematic - Scheme on P2] LISP MACHINE
SYSTEM STATUS: ONLINE

Perplexity attempts to double the LUT for adjacent Cogs - doubling the R5RS procedures that can be used from the jump table - and throws in a Hub RAM ring-buffer for bytecode. a couple of good strategies that I hope to incorporate once the basics are sorted.

Comprehensive Multi-Cog Bytecode Interpreter Architecture for Propeller 2

Executive Summary

Based on extensive research into P2 architecture and your requirements for multiple cogs running continuously with shared LUT tables and efficient work distribution, I recommend a hybrid approach combining LUT RAM sharing with hub-based circular queue management. This provides optimal memory efficiency while maintaining high performance for your bytecode interpreter farm.

Inter-Cog Communication Methods Analysis

1. LUT RAM Sharing (Recommended Primary Method)

Advantages:

  • Maximum memory efficiency: Adjacent cogs (0-1, 2-3, 4-5, 6-7) share 1KB LUT effectively giving 2KB access12
  • Fastest access: Direct memory access without hub latency
  • Hardware-accelerated: Built into P2 silicon with SETLUTS instruction3
  • Perfect for shared lookup tables: Ideal for your bytecode operation tables

Implementation:

PUB start_interpreter_pair(cog_pair_id) : success
    ' Start even cog first
    success := coginit(cog_pair_id, @interpreter_even, @work_queue)
    if success >= 0
        ' Start odd cog that shares LUT with even cog
        success := coginit(cog_pair_id + 1, @interpreter_odd, @work_queue)
' In even cog (cog 0, 2, 4, 6)
interpreter_even
    setluts #1          ' Enable LUT sharing to receive from odd cog
    setq2   #255        ' Load 256 longs
    rdlong  0, ##@lut_table_src  ' Load shared LUT table
    jmp     #main_loop

' In odd cog (cog 1, 3, 5, 7)  
interpreter_odd
    setluts #1          ' Enable LUT sharing to write to even cog
    ' Even cog already loaded LUT - odd cog can access it immediately
    jmp     #main_loop

2. Hub-Based Circular Queue System (Recommended Secondary Method)

Advantages:

  • Scalable: Supports multiple producers and consumers45
  • Well-understood: Standard producer-consumer pattern67
  • Efficient: Circular buffers minimize memory allocation overhead8
  • Atomic operations: P2 hub RAM provides atomic long access9

Implementation:

CON
    QUEUE_SIZE = 256        ' Power of 2 for efficient masking
    QUEUE_ENTRY_SIZE = 8    ' Header + bytecode pointer

VAR
    ' Circular queue in hub RAM
    long queue_buffer[QUEUE_SIZE]
    long queue_head         ' Producer index
    long queue_tail         ' Consumer index
    long queue_count        ' Current entries
    long queue_lock         ' Simple spinlock

3. Hardware FIFO (Built-in)

Advantages:

  • Zero setup: Built into P2 hardware10
  • Automatic flow control: Hardware handles synchronization
  • High throughput: One word per clock cycle10

Limitations:

  • Single producer-consumer: Cannot handle multiple cogs efficiently
  • Fixed size: Limited buffer capacity
  • Not suitable for work distribution: Designed for streaming data

4. Cog Attention Mechanism

Advantages:

  • Hardware-accelerated: Uses COGATN instruction for signaling11
  • Low latency: Direct cog-to-cog signaling
  • Event-driven: Efficient for notifications

Implementation:

' Signal work available to cogs 1-7
cogatn  #%11111110      ' Attention to cogs 1-7

' In worker cog
wait_for_work
    waitatn             ' Wait for attention signal
    ' Check work queue
    jmp     #process_work

5. Event/Polling System

Advantages:

  • Simple: Easy to implement and debug
  • Flexible: Can monitor multiple conditions

Limitations:

  • CPU overhead: Continuous polling wastes cycles
  • Latency: Depends on polling frequency
  • Not suitable for high-performance requirements

6. Mailbox System (Hub-based)

Advantages:

  • Structured: Well-defined message format
  • Reliable: Can include acknowledgments and error handling

Limitations:

  • Overhead: Additional metadata per message
  • Complexity: Requires message parsing

Recommended Architecture

Hybrid LUT Sharing + Hub Queue Architecture

' P2 Multi-Cog Bytecode Interpreter System
' Optimized for continuous operation with shared LUT tables

CON
    _CLKFREQ = 300_000_000
    MAX_COGS = 8
    QUEUE_SIZE = 512
    WORK_ENTRY_SIZE = 2     ' bytecode_ptr + length

    ' Bytecode opcodes (same as your existing)
    OP_HALT = 0
    OP_CONS = 1
    ' ... etc

VAR
    ' Global work distribution system
    long work_queue[QUEUE_SIZE]
    long queue_head, queue_tail, queue_count
    long active_cogs
    long cog_status[MAX_COGS]

PUB main() | cog_id, pair_id

    ' Initialize work queue
    queue_head := 0
    queue_tail := 0
    queue_count := 0
    
    ' Start cog pairs (0-1, 2-3, 4-5, 6-7)
    repeat pair_id from 0 to 3
        start_cog_pair(pair_id * 2)
    
    ' Main work distributor loop
    repeat
        if queue_count < QUEUE_SIZE - 10
            generate_work()
        waitms(1)

PUB start_cog_pair(base_cog_id) | even_cog, odd_cog

    ' Start even cog (primary)
    even_cog := coginit(COGEXEC_NEW, @interpreter_even, @work_queue)
    if even_cog >= 0
        cog_status[base_cog_id] := even_cog
        
        ' Start odd cog (shares LUT with even)
        odd_cog := coginit(COGEXEC_NEW, @interpreter_odd, @work_queue)
        if odd_cog >= 0
            cog_status[base_cog_id + 1] := odd_cog
            active_cogs += 2

PUB add_work(bytecode_ptr, length) | head_next

    ' Thread-safe work addition
    repeat while queue_count >= QUEUE_SIZE
        waitms(1)
    
    ' Add work atomically
    head_next := (queue_head + 1) & (QUEUE_SIZE - 1)
    work_queue[queue_head << 1] := bytecode_ptr
    work_queue[(queue_head << 1) + 1] := length
    queue_head := head_next
    queue_count++
    
    ' Signal available work to all cogs
    cogatn(#%11111110)

DAT
' ============= SHARED COG CODE =============
                org
                asmclk

' Even cog entry point (cogs 0, 2, 4, 6)
interpreter_even
                ' Enable LUT sharing (receive from odd cog)
                setluts #1
                
                ' Load shared LUT table
                setq2   #255
                rdlong  0, ##@lut_table_src
                
                ' Set up work queue access
                mov     work_queue_ptr, ptra
                
                ' Fall through to main loop
                jmp     #main_interpreter_loop

' Odd cog entry point (cogs 1, 3, 5, 7)
interpreter_odd
                ' Enable LUT sharing (write to even cog)
                setluts #1
                
                ' Even cog already loaded LUT - we can access it
                mov     work_queue_ptr, ptra
                
                ' Fall through to main loop
                jmp     #main_interpreter_loop

main_interpreter_loop
                ' Wait for work signal
                waitatn
                
                ' Try to get work from queue
                call    #get_work_item
                tjz     work_available, #main_interpreter_loop
                
                ' Process bytecode
                call    #process_bytecode
                
                ' Loop for more work
                jmp     #main_interpreter_loop

get_work_item
                ' Atomic work queue access
                mov     temp, work_queue_ptr
                add     temp, #8            ' offset to queue_tail
                rdlong  tail_current, temp
                
                add     temp, #4            ' offset to queue_count  
                rdlong  count_current, temp
                tjz     count_current, #no_work_available
                
                ' Calculate work item address
                mov     work_addr, work_queue_ptr
                mov     temp, tail_current
                shl     temp, #3            ' * 8 (2 longs per entry)
                add     work_addr, temp
                
                ' Read work item
                rdlong  bytecode_ptr, work_addr
                add     work_addr, #4
                rdlong  bytecode_len, work_addr
                
                ' Update tail atomically
                add     tail_current, #1
                and     tail_current, #(QUEUE_SIZE - 1)
                mov     temp, work_queue_ptr
                add     temp, #8
                wrlong  tail_current, temp
                
                ' Decrement count
                add     temp, #4
                sub     count_current, #1
                wrlong  count_current, temp
                
                mov     work_available, #1
                ret

no_work_available
                mov     work_available, #0
                ret

process_bytecode
                ' Set up FIFO for bytecode reading
                rdfast  #0, bytecode_ptr
                mov     instruction_count, bytecode_len
                
bytecode_loop
                tjz     instruction_count, #bytecode_done
                
                ' Read next opcode
                rfbyte  opcode
                
                ' Execute using shared LUT table
                mov     lut_addr, opcode
                rdlut   handler_addr, lut_addr
                call    handler_addr
                
                sub     instruction_count, #1
                jmp     #bytecode_loop
                
bytecode_done
                ret

' --- Bytecode Handlers (shared via LUT) ---
op_halt_impl
                debug("HALT from cog ", udec(cogid))
                ret

op_cons_impl
                debug("CONS from cog ", udec(cogid))
                ' Your existing CONS implementation
                ret

' ... other handlers ...

' --- Cog Variables ---
work_queue_ptr      long    0
bytecode_ptr        long    0
bytecode_len        long    0
work_available      long    0
tail_current        long    0
count_current       long    0
work_addr           long    0
temp                long    0
opcode              long    0
lut_addr            long    0
handler_addr        long    0
instruction_count   long    0

DAT
' ============= SHARED LUT TABLE =============
                orgh $400

lut_table_src
                long    op_halt_impl
                long    op_cons_impl
                long    op_car_impl
                long    op_cdr_impl
                long    op_atom_impl
                long    op_quote_impl
                long    op_eq_impl
                long    op_add_impl
                long    op_sub_impl
                long    op_mul_impl
                long    op_div_impl
                long    op_halt_impl[^245]   ' Fill remaining entries

Key Architectural Decisions

1. LUT RAM Sharing Strategy

  • Cog pairs (0-1, 2-3, 4-5, 6-7) share 1KB LUT effectively accessing 2KB
  • Even cog loads the lookup table, odd cog inherits access
  • Shared bytecode handlers eliminate code duplication
  • Memory efficient: 4 cog pairs = 4KB total LUT usage vs 8KB without sharing

2. Work Distribution System

  • Circular queue in hub RAM for work items
  • Lock-free design using atomic operations where possible
  • Cog attention mechanism for efficient work notification
  • Scalable to all 8 cogs with minimal overhead

3. Continuous Operation

  • No stopping: Cogs run continuously waiting for work
  • Dynamic work loading: Main thread feeds work queue
  • Load balancing: Any cog can take any work item
  • Fault tolerance: Individual cog failure doesn't stop system

4. Performance Optimizations

  • FIFO bytecode reading: Hardware-accelerated sequential access
  • LUT-based dispatch: Single instruction handler lookup
  • Minimal synchronization: Reduced contention points
  • Efficient memory layout: Aligned for optimal P2 access patterns

Testing and Validation

PUB test_system() | i, start_time, end_time, throughput

    debug("Starting multi-cog bytecode interpreter test")
    
    ' Generate test workload
    repeat i from 0 to 100
        add_work(@test_bytecode, 10)
    
    start_time := getct()
    
    ' Wait for completion
    repeat while queue_count > 0
        waitms(10)
    
    end_time := getct()
    throughput := (100 * _CLKFREQ) / (end_time - start_time)
    
    debug("Processed 100 bytecode sequences")
    debug("Throughput: ", udec(throughput), " sequences/second")
    debug("Active cogs: ", udec(active_cogs))

DAT
test_bytecode   byte    OP_ADD, OP_ADD, OP_SUB, OP_MUL, OP_DIV, OP_CONS, OP_CAR, OP_CDR, OP_EQ, OP_HALT

Summary

This hybrid architecture provides:

  1. Maximum memory efficiency through LUT RAM sharing
  2. High throughput via continuous cog operation
  3. Excellent scalability supporting up to 8 cogs
  4. Robust work distribution using proven circular queue patterns
  5. Hardware optimization leveraging P2's unique features

The system achieves your goals of multiple cogs running continuously with shared LUT tables while providing efficient work distribution through a combination of the best inter-cog communication methods available on the Propeller 2.


  1. ^ https://forums.parallax.com/uploads/editor/bz/sexnusfpmtjf.pdf
  2. ^ https://p2docs.github.io/lutmem.html
  3. ^ https://forums.parallax.com/discussion/168399/catalina-and-the-p2
  4. ^ https://openclassrooms.com/en/courses/5684021-scale-up-your-code-with-java-concurrency/6667996-implement-a-producer-consumer-pattern-using-a-blockingqueue
  5. ^ https://www.kernel.org/doc/html/v6.14-rc1/core-api/circular-buffers.html
  6. ^ https://jenkov.com/tutorials/java-concurrency/producer-consumer.html
  7. ^ https://java-design-patterns.com/patterns/producer-consumer/
  8. ^ https://www.boost.org/doc/libs/release/libs/circular_buffer/
  9. ^ https://forums.parallax.com/discussion/172806/question-on-p2-hub-ram-access
  10. ^ a b https://forums.parallax.com/uploads/editor/iv/wa47dttfu0ac.pdf
  11. ^ https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOoqhTxkXBcYNzk2IyoCpgPVpzb8_OkD16GKW2uJ0GBwTL_sLXrX-
  12. [^ 12 ]
    https://forums.parallax.com/discussion/164188/is-lut-sharing-between-adjacent-cogs-very-important/p15
  13. [^ 13 ]
    https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20210709-3006917.pdf?srsltid=AfmBOoq8MC4gbQ-VoxzqDh5vz9GgJCt6eu2MtPiTn1Aety2K3ALRPmI5
  14. [^ 14 ]
    https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20210709-3006917.pdf
  15. [^ 15 ]
    https://my.eng.utah.edu/~cs3710/xilinx-docs/xapp464.pdf
  16. [^ 16 ]
    https://www.electronicdesign.com/technologies/industrial/boards/article/21754713/eied-online-parallax-propeller
  17. [^ 17 ]
    https://forums.parallax.com/discussion/169539/cog-2-cog-communication
  18. [^ 18 ]
    https://forums.parallax.com/discussion/164188/is-lut-sharing-between-adjacent-cogs-very-important/p14
  19. [^ 19 ]
    https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOoqBQMlQrLBjH_N-aOGAf3OfOdqK2M7u94o0caoR9z44OnfPz2zy
  20. [^ 20 ]
    https://propeller373.rssing.com/chan-63988404/all_p2.html
  21. [^ 21 ]
    https://microarch.org/micro53/papers/738300a088.pdf
  22. [^ 22 ]
    https://github.com/parallaxinc/propgcc-docs/blob/master/doc/Memory.md
  23. [^ 23 ]
    https://www.scribd.com/document/381642317/P2-DataSheet
  24. [^ 24 ]
    https://forums.parallax.com/discussion/98581/propeller-memory-management
  25. [^ 25 ]
    https://www.parallax.com/propeller/qna-mobile/Advanced/Content/QnaTopics/QnaCogs.htm
  26. [^ 26 ]
    https://dl.acm.org/doi/pdf/10.1145/3508352.3549469
  27. [^ 27 ]
    https://www.nutsvolts.com/magazine/article/an-introduction-to-the-parallax-propeller-2
  28. [^ 28 ]
    https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOooCCeLdPxX-0oZloBfRZj4rLc54O_T7JwrlCOI2PZ3Gx1-O9rEn
  29. [^ 29 ]
    https://mm.digikey.com/Volume0/opasdata/d220001/medias/docus/5789/P2-EC32MB-Edge-Module-Rev-B-Guide-v2.0.pdf
  30. [^ 30 ]
    https://p2docs.github.io
  31. [^ 31 ]
    https://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
  32. [^ 32 ]
    http://cdn.propellerheads.se/ReasonEssentials2/Manuals/Reason_Essentials_2_Operation_Manual.pdf
  33. [^ 33 ]
    https://www.suse.com/support/kb/doc/?id=000016427
  34. [^ 34 ]
    https://knowledge.broadcom.com/external/article/15322/default-hub-queue-configuration-on-secon.html
  35. [^ 35 ]
    https://www.parallax.com/package/propeller-2-assembly-language-pasm2-manual/
  36. [^ 36 ]
    https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/unified-infrastructure-management/23-4/installing/install-secondary-hubs/configure-queues-and-tunnels.html
  37. [^ 37 ]
    https://www.parallax.com/propeller-2/documentation/
  38. [^ 38 ]
    https://forums.parallax.com/discussion/175917/are-there-current-examples-of-pasm2-in-lut-memory
  39. [^ 39 ]
    https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v14_0.pdf
  40. [^ 40 ]
    https://www.youtube.com/watch?v=VUA_9i4l-zI
  41. [^ 41 ]
    https://www.linkedin.com/posts/robert-anderson-bi-architect_implementation-details-31-message-queues-activity-7278263655857569792-hzB3
  42. [^ 42 ]
    https://wwwapps.grassvalley.com/manuals/k2_summit_v10.1.3/core/x-cc/content/topic/k2/inout/t_CaptureServiceP2Config.html
  43. [^ 43 ]
    https://www.cs.toronto.edu/~matei/docs/D-Queue_Implementations-DISC04.pdf
  44. [^ 44 ]
    https://forums.parallax.com/discussion/144683/propeller-ii-programing-questions-to-chip
  45. [^ 45 ]
    https://researchut.com/post/99/
  46. [^ 46 ]
    https://drops.dagstuhl.de/storage/00lipics/lipics-vol253-opodis2022/LIPIcs.OPODIS.2022.4/LIPIcs.OPODIS.2022.4.pdf
  47. [^ 47 ]
    https://www.youtube.com/watch?v=7jbm0LXulCA
  48. [^ 48 ]
    https://stackoverflow.com/questions/58844583/how-to-implement-fifo-queue-with-reactjsredux
  49. [^ 49 ]
    https://forums.parallax.com/discussion/comment/1002349/
  50. [^ 50 ]
    https://learn.microsoft.com/ka-ge/cpp/parallel/concrt/how-to-implement-various-producer-consumer-patterns?view=msvc-160
  51. [^ 51 ]
    https://pymotw.com/2/Queue/
  52. [^ 52 ]
    https://groups.google.com/g/beagleboard/c/F9JI8_vQ-mE
  53. [^ 53 ]
    https://www.reddit.com/r/learnprogramming/comments/1gyawvn/are_there_real_situations_where_a/
  54. [^ 54 ]
    https://www.cse.iitd.ac.in/~neeraj/TA/cs316/2005-2006/project/final/anubha.pdf
  55. [^ 55 ]
    https://stackoverflow.com/questions/1669245/circular-buffer-in-flash
  56. [^ 56 ]
    https://stackoverflow.com/questions/30589977/for-producer-consumer-pattern-which-implementation-is-better-arrayblockingqueu
  57. [^ 57 ]
    https://gist.github.com/wendyyuchensun/b8ac0cebae3acf53bdbb561f05d2e738
  58. [^ 58 ]
    https://forums.parallax.com/discussion/171176/memory-drivers-for-p2-psram-sram-hyperram-was-hyperram-driver-for-p2/p7
  59. [^ 59 ]
    https://www.youtube.com/watch?v=Vnn5ezleq54
  60. [^ 60 ]
    https://arxiv.org/html/2411.07176
  61. [^ 61 ]
    https://www.polleverywhere.com
  62. [^ 62 ]
    https://forums.parallax.com/discussion/169714/p2-mailbox-and-parameters-where-to-place-and-what-is-needed
  63. [^ 63 ]
    https://arxiv.org/abs/2411.07176
  64. [^ 64 ]
    https://www.parallax.com/go/PBASICHelp/Content/LanguageTopics/Commands/POLLMODE.htm
  65. [^ 65 ]
    https://www.elektormagazine.com/articles/parallax-propeller-2-part-2-spin2
  66. [^ 66 ]
    https://huggingface.co/papers/2411.07176
  67. [^ 67 ]
    https://www.npmjs.com/package/polling-to-event
  68. [^ 68 ]
    https://github.com/parallaxinc/propgcc-docs/blob/master/doc/InDepth.md
  69. [^ 69 ]
    https://discuss.huggingface.co/t/more-expressive-attention-with-negative-weights/119667
  70. [^ 70 ]
    https://techcommunity.microsoft.com/discussions/97357baa-b675-431d-b1e3-ef2daac79067/event-polling-system/4423994
  71. [^ 71 ]
    https://proto.actor/docs/mailboxes/
  72. [^ 72 ]
    https://github.com/trestad/cogattn
  73. [^ 73 ]
    https://softwareengineering.stackexchange.com/questions/102771/when-would-polling-for-events-be-better-than-using-observer-pattern/102781
  74. [^ 74 ]
    https://www.instructables.com/Smart-Mail-Box/
  75. [^ 75 ]
    https://pmc.ncbi.nlm.nih.gov/articles/PMC2879667/
  76. [^ 76 ]
    https://docs.zephyrproject.org/latest/kernel/services/polling.html
  77. [^ 77 ]
    https://pkg.go.dev/github.com/AsynkronIT/protoactor-go/mailbox
  78. [^ 78 ]
    https://web.uvic.ca/~dbub/Cognition_Action/Cog_Psych_Readings_files/Attention.pdf