Perplexity attempts to double the LUT for adjacent Cogs - doubling the R5RS procedures that can be used from the jump table - and throws in a Hub RAM ring-buffer for bytecode. a couple of good strategies that I hope to incorporate once the basics are sorted.
Comprehensive Multi-Cog Bytecode Interpreter Architecture for Propeller 2
Executive Summary
Based on extensive research into P2 architecture and your requirements for multiple cogs running continuously with shared LUT tables and efficient work distribution, I recommend a hybrid approach combining LUT RAM sharing with hub-based circular queue management. This provides optimal memory efficiency while maintaining high performance for your bytecode interpreter farm.
Inter-Cog Communication Methods Analysis
1. LUT RAM Sharing (Recommended Primary Method)
Advantages:
- Maximum memory efficiency: Adjacent cogs (0-1, 2-3, 4-5, 6-7) share 1KB LUT effectively giving 2KB access12
- Fastest access: Direct memory access without hub latency
- Hardware-accelerated: Built into P2 silicon with
SETLUTS
instruction3 - Perfect for shared lookup tables: Ideal for your bytecode operation tables
Implementation:
PUB start_interpreter_pair(cog_pair_id) : success
' Start even cog first
success := coginit(cog_pair_id, @interpreter_even, @work_queue)
if success >= 0
' Start odd cog that shares LUT with even cog
success := coginit(cog_pair_id + 1, @interpreter_odd, @work_queue)
' In even cog (cog 0, 2, 4, 6)
interpreter_even
setluts #1 ' Enable LUT sharing to receive from odd cog
setq2 #255 ' Load 256 longs
rdlong 0, ##@lut_table_src ' Load shared LUT table
jmp #main_loop
' In odd cog (cog 1, 3, 5, 7)
interpreter_odd
setluts #1 ' Enable LUT sharing to write to even cog
' Even cog already loaded LUT - odd cog can access it immediately
jmp #main_loop
2. Hub-Based Circular Queue System (Recommended Secondary Method)
Advantages:
- Scalable: Supports multiple producers and consumers45
- Well-understood: Standard producer-consumer pattern67
- Efficient: Circular buffers minimize memory allocation overhead8
- Atomic operations: P2 hub RAM provides atomic long access9
Implementation:
CON
QUEUE_SIZE = 256 ' Power of 2 for efficient masking
QUEUE_ENTRY_SIZE = 8 ' Header + bytecode pointer
VAR
' Circular queue in hub RAM
long queue_buffer[QUEUE_SIZE]
long queue_head ' Producer index
long queue_tail ' Consumer index
long queue_count ' Current entries
long queue_lock ' Simple spinlock
3. Hardware FIFO (Built-in)
Advantages:
- Zero setup: Built into P2 hardware10
- Automatic flow control: Hardware handles synchronization
- High throughput: One word per clock cycle10
Limitations:
- Single producer-consumer: Cannot handle multiple cogs efficiently
- Fixed size: Limited buffer capacity
- Not suitable for work distribution: Designed for streaming data
4. Cog Attention Mechanism
Advantages:
- Hardware-accelerated: Uses
COGATN
instruction for signaling11 - Low latency: Direct cog-to-cog signaling
- Event-driven: Efficient for notifications
Implementation:
' Signal work available to cogs 1-7
cogatn #%11111110 ' Attention to cogs 1-7
' In worker cog
wait_for_work
waitatn ' Wait for attention signal
' Check work queue
jmp #process_work
5. Event/Polling System
Advantages:
- Simple: Easy to implement and debug
- Flexible: Can monitor multiple conditions
Limitations:
- CPU overhead: Continuous polling wastes cycles
- Latency: Depends on polling frequency
- Not suitable for high-performance requirements
6. Mailbox System (Hub-based)
Advantages:
- Structured: Well-defined message format
- Reliable: Can include acknowledgments and error handling
Limitations:
- Overhead: Additional metadata per message
- Complexity: Requires message parsing
Recommended Architecture
Hybrid LUT Sharing + Hub Queue Architecture
' P2 Multi-Cog Bytecode Interpreter System
' Optimized for continuous operation with shared LUT tables
CON
_CLKFREQ = 300_000_000
MAX_COGS = 8
QUEUE_SIZE = 512
WORK_ENTRY_SIZE = 2 ' bytecode_ptr + length
' Bytecode opcodes (same as your existing)
OP_HALT = 0
OP_CONS = 1
' ... etc
VAR
' Global work distribution system
long work_queue[QUEUE_SIZE]
long queue_head, queue_tail, queue_count
long active_cogs
long cog_status[MAX_COGS]
PUB main() | cog_id, pair_id
' Initialize work queue
queue_head := 0
queue_tail := 0
queue_count := 0
' Start cog pairs (0-1, 2-3, 4-5, 6-7)
repeat pair_id from 0 to 3
start_cog_pair(pair_id * 2)
' Main work distributor loop
repeat
if queue_count < QUEUE_SIZE - 10
generate_work()
waitms(1)
PUB start_cog_pair(base_cog_id) | even_cog, odd_cog
' Start even cog (primary)
even_cog := coginit(COGEXEC_NEW, @interpreter_even, @work_queue)
if even_cog >= 0
cog_status[base_cog_id] := even_cog
' Start odd cog (shares LUT with even)
odd_cog := coginit(COGEXEC_NEW, @interpreter_odd, @work_queue)
if odd_cog >= 0
cog_status[base_cog_id + 1] := odd_cog
active_cogs += 2
PUB add_work(bytecode_ptr, length) | head_next
' Thread-safe work addition
repeat while queue_count >= QUEUE_SIZE
waitms(1)
' Add work atomically
head_next := (queue_head + 1) & (QUEUE_SIZE - 1)
work_queue[queue_head << 1] := bytecode_ptr
work_queue[(queue_head << 1) + 1] := length
queue_head := head_next
queue_count++
' Signal available work to all cogs
cogatn(#%11111110)
DAT
' ============= SHARED COG CODE =============
org
asmclk
' Even cog entry point (cogs 0, 2, 4, 6)
interpreter_even
' Enable LUT sharing (receive from odd cog)
setluts #1
' Load shared LUT table
setq2 #255
rdlong 0, ##@lut_table_src
' Set up work queue access
mov work_queue_ptr, ptra
' Fall through to main loop
jmp #main_interpreter_loop
' Odd cog entry point (cogs 1, 3, 5, 7)
interpreter_odd
' Enable LUT sharing (write to even cog)
setluts #1
' Even cog already loaded LUT - we can access it
mov work_queue_ptr, ptra
' Fall through to main loop
jmp #main_interpreter_loop
main_interpreter_loop
' Wait for work signal
waitatn
' Try to get work from queue
call #get_work_item
tjz work_available, #main_interpreter_loop
' Process bytecode
call #process_bytecode
' Loop for more work
jmp #main_interpreter_loop
get_work_item
' Atomic work queue access
mov temp, work_queue_ptr
add temp, #8 ' offset to queue_tail
rdlong tail_current, temp
add temp, #4 ' offset to queue_count
rdlong count_current, temp
tjz count_current, #no_work_available
' Calculate work item address
mov work_addr, work_queue_ptr
mov temp, tail_current
shl temp, #3 ' * 8 (2 longs per entry)
add work_addr, temp
' Read work item
rdlong bytecode_ptr, work_addr
add work_addr, #4
rdlong bytecode_len, work_addr
' Update tail atomically
add tail_current, #1
and tail_current, #(QUEUE_SIZE - 1)
mov temp, work_queue_ptr
add temp, #8
wrlong tail_current, temp
' Decrement count
add temp, #4
sub count_current, #1
wrlong count_current, temp
mov work_available, #1
ret
no_work_available
mov work_available, #0
ret
process_bytecode
' Set up FIFO for bytecode reading
rdfast #0, bytecode_ptr
mov instruction_count, bytecode_len
bytecode_loop
tjz instruction_count, #bytecode_done
' Read next opcode
rfbyte opcode
' Execute using shared LUT table
mov lut_addr, opcode
rdlut handler_addr, lut_addr
call handler_addr
sub instruction_count, #1
jmp #bytecode_loop
bytecode_done
ret
' --- Bytecode Handlers (shared via LUT) ---
op_halt_impl
debug("HALT from cog ", udec(cogid))
ret
op_cons_impl
debug("CONS from cog ", udec(cogid))
' Your existing CONS implementation
ret
' ... other handlers ...
' --- Cog Variables ---
work_queue_ptr long 0
bytecode_ptr long 0
bytecode_len long 0
work_available long 0
tail_current long 0
count_current long 0
work_addr long 0
temp long 0
opcode long 0
lut_addr long 0
handler_addr long 0
instruction_count long 0
DAT
' ============= SHARED LUT TABLE =============
orgh $400
lut_table_src
long op_halt_impl
long op_cons_impl
long op_car_impl
long op_cdr_impl
long op_atom_impl
long op_quote_impl
long op_eq_impl
long op_add_impl
long op_sub_impl
long op_mul_impl
long op_div_impl
long op_halt_impl[^245] ' Fill remaining entries
Key Architectural Decisions
1. LUT RAM Sharing Strategy
- Cog pairs (0-1, 2-3, 4-5, 6-7) share 1KB LUT effectively accessing 2KB
- Even cog loads the lookup table, odd cog inherits access
- Shared bytecode handlers eliminate code duplication
- Memory efficient: 4 cog pairs = 4KB total LUT usage vs 8KB without sharing
2. Work Distribution System
- Circular queue in hub RAM for work items
- Lock-free design using atomic operations where possible
- Cog attention mechanism for efficient work notification
- Scalable to all 8 cogs with minimal overhead
3. Continuous Operation
- No stopping: Cogs run continuously waiting for work
- Dynamic work loading: Main thread feeds work queue
- Load balancing: Any cog can take any work item
- Fault tolerance: Individual cog failure doesn't stop system
4. Performance Optimizations
- FIFO bytecode reading: Hardware-accelerated sequential access
- LUT-based dispatch: Single instruction handler lookup
- Minimal synchronization: Reduced contention points
- Efficient memory layout: Aligned for optimal P2 access patterns
Testing and Validation
PUB test_system() | i, start_time, end_time, throughput
debug("Starting multi-cog bytecode interpreter test")
' Generate test workload
repeat i from 0 to 100
add_work(@test_bytecode, 10)
start_time := getct()
' Wait for completion
repeat while queue_count > 0
waitms(10)
end_time := getct()
throughput := (100 * _CLKFREQ) / (end_time - start_time)
debug("Processed 100 bytecode sequences")
debug("Throughput: ", udec(throughput), " sequences/second")
debug("Active cogs: ", udec(active_cogs))
DAT
test_bytecode byte OP_ADD, OP_ADD, OP_SUB, OP_MUL, OP_DIV, OP_CONS, OP_CAR, OP_CDR, OP_EQ, OP_HALT
Summary
This hybrid architecture provides:
- Maximum memory efficiency through LUT RAM sharing
- High throughput via continuous cog operation
- Excellent scalability supporting up to 8 cogs
- Robust work distribution using proven circular queue patterns
- Hardware optimization leveraging P2's unique features
The system achieves your goals of multiple cogs running continuously with shared LUT tables while providing efficient work distribution through a combination of the best inter-cog communication methods available on the Propeller 2.
- ^ https://forums.parallax.com/uploads/editor/bz/sexnusfpmtjf.pdf
- ^ https://p2docs.github.io/lutmem.html
- ^ https://forums.parallax.com/discussion/168399/catalina-and-the-p2
- ^ https://openclassrooms.com/en/courses/5684021-scale-up-your-code-with-java-concurrency/6667996-implement-a-producer-consumer-pattern-using-a-blockingqueue
- ^ https://www.kernel.org/doc/html/v6.14-rc1/core-api/circular-buffers.html
- ^ https://jenkov.com/tutorials/java-concurrency/producer-consumer.html
- ^ https://java-design-patterns.com/patterns/producer-consumer/
- ^ https://www.boost.org/doc/libs/release/libs/circular_buffer/
- ^ https://forums.parallax.com/discussion/172806/question-on-p2-hub-ram-access
- ^ a b https://forums.parallax.com/uploads/editor/iv/wa47dttfu0ac.pdf
- ^ https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOoqhTxkXBcYNzk2IyoCpgPVpzb8_OkD16GKW2uJ0GBwTL_sLXrX-
-
[^
12
]https://forums.parallax.com/discussion/164188/is-lut-sharing-between-adjacent-cogs-very-important/p15
-
[^
13
]https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20210709-3006917.pdf?srsltid=AfmBOoq8MC4gbQ-VoxzqDh5vz9GgJCt6eu2MtPiTn1Aety2K3ALRPmI5
-
[^
14
]https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20210709-3006917.pdf
-
[^
15
]https://my.eng.utah.edu/~cs3710/xilinx-docs/xapp464.pdf
-
[^
16
]https://www.electronicdesign.com/technologies/industrial/boards/article/21754713/eied-online-parallax-propeller
-
[^
17
]https://forums.parallax.com/discussion/169539/cog-2-cog-communication
-
[^
18
]https://forums.parallax.com/discussion/164188/is-lut-sharing-between-adjacent-cogs-very-important/p14
-
[^
19
]https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOoqBQMlQrLBjH_N-aOGAf3OfOdqK2M7u94o0caoR9z44OnfPz2zy
-
[^
20
]https://propeller373.rssing.com/chan-63988404/all_p2.html
-
[^
21
]https://microarch.org/micro53/papers/738300a088.pdf
-
[^
22
]https://github.com/parallaxinc/propgcc-docs/blob/master/doc/Memory.md
-
[^
23
]https://www.scribd.com/document/381642317/P2-DataSheet
-
[^
24
]https://forums.parallax.com/discussion/98581/propeller-memory-management
-
[^
25
]https://www.parallax.com/propeller/qna-mobile/Advanced/Content/QnaTopics/QnaCogs.htm
-
[^
26
]https://dl.acm.org/doi/pdf/10.1145/3508352.3549469
-
[^
27
]https://www.nutsvolts.com/magazine/article/an-introduction-to-the-parallax-propeller-2
-
[^
28
]https://www.mouser.com/datasheet/2/321/Propeller2_P2X8C4M64P_Datasheet_20221101-3006917.pdf?srsltid=AfmBOooCCeLdPxX-0oZloBfRZj4rLc54O_T7JwrlCOI2PZ3Gx1-O9rEn
-
[^
29
]https://mm.digikey.com/Volume0/opasdata/d220001/medias/docus/5789/P2-EC32MB-Edge-Module-Rev-B-Guide-v2.0.pdf
-
[^
30
]https://p2docs.github.io
-
[^
31
]https://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
-
[^
32
]http://cdn.propellerheads.se/ReasonEssentials2/Manuals/Reason_Essentials_2_Operation_Manual.pdf
-
[^
33
]https://www.suse.com/support/kb/doc/?id=000016427
-
[^
34
]https://knowledge.broadcom.com/external/article/15322/default-hub-queue-configuration-on-secon.html
-
[^
35
]https://www.parallax.com/package/propeller-2-assembly-language-pasm2-manual/
-
[^
36
]https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/unified-infrastructure-management/23-4/installing/install-secondary-hubs/configure-queues-and-tunnels.html
-
[^
37
]https://www.parallax.com/propeller-2/documentation/
-
[^
38
]https://forums.parallax.com/discussion/175917/are-there-current-examples-of-pasm2-in-lut-memory
-
[^
39
]https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v14_0.pdf
-
[^
40
]https://www.youtube.com/watch?v=VUA_9i4l-zI
-
[^
41
]https://www.linkedin.com/posts/robert-anderson-bi-architect_implementation-details-31-message-queues-activity-7278263655857569792-hzB3
-
[^
42
]https://wwwapps.grassvalley.com/manuals/k2_summit_v10.1.3/core/x-cc/content/topic/k2/inout/t_CaptureServiceP2Config.html
-
[^
43
]https://www.cs.toronto.edu/~matei/docs/D-Queue_Implementations-DISC04.pdf
-
[^
44
]https://forums.parallax.com/discussion/144683/propeller-ii-programing-questions-to-chip
-
[^
45
]https://researchut.com/post/99/
-
[^
46
]https://drops.dagstuhl.de/storage/00lipics/lipics-vol253-opodis2022/LIPIcs.OPODIS.2022.4/LIPIcs.OPODIS.2022.4.pdf
-
[^
47
]https://www.youtube.com/watch?v=7jbm0LXulCA
-
[^
48
]https://stackoverflow.com/questions/58844583/how-to-implement-fifo-queue-with-reactjsredux
-
[^
49
]https://forums.parallax.com/discussion/comment/1002349/
-
[^
50
]https://learn.microsoft.com/ka-ge/cpp/parallel/concrt/how-to-implement-various-producer-consumer-patterns?view=msvc-160
-
[^
51
]https://pymotw.com/2/Queue/
-
[^
52
]https://groups.google.com/g/beagleboard/c/F9JI8_vQ-mE
-
[^
53
]https://www.reddit.com/r/learnprogramming/comments/1gyawvn/are_there_real_situations_where_a/
-
[^
54
]https://www.cse.iitd.ac.in/~neeraj/TA/cs316/2005-2006/project/final/anubha.pdf
-
[^
55
]https://stackoverflow.com/questions/1669245/circular-buffer-in-flash
-
[^
56
]https://stackoverflow.com/questions/30589977/for-producer-consumer-pattern-which-implementation-is-better-arrayblockingqueu
-
[^
57
]https://gist.github.com/wendyyuchensun/b8ac0cebae3acf53bdbb561f05d2e738
-
[^
58
]https://forums.parallax.com/discussion/171176/memory-drivers-for-p2-psram-sram-hyperram-was-hyperram-driver-for-p2/p7
-
[^
59
]https://www.youtube.com/watch?v=Vnn5ezleq54
-
[^
60
]https://arxiv.org/html/2411.07176
-
[^
61
]https://www.polleverywhere.com
-
[^
62
]https://forums.parallax.com/discussion/169714/p2-mailbox-and-parameters-where-to-place-and-what-is-needed
-
[^
63
]https://arxiv.org/abs/2411.07176
-
[^
64
]https://www.parallax.com/go/PBASICHelp/Content/LanguageTopics/Commands/POLLMODE.htm
-
[^
65
]https://www.elektormagazine.com/articles/parallax-propeller-2-part-2-spin2
-
[^
66
]https://huggingface.co/papers/2411.07176
-
[^
67
]https://www.npmjs.com/package/polling-to-event
-
[^
68
]https://github.com/parallaxinc/propgcc-docs/blob/master/doc/InDepth.md
-
[^
69
]https://discuss.huggingface.co/t/more-expressive-attention-with-negative-weights/119667
-
[^
70
]https://techcommunity.microsoft.com/discussions/97357baa-b675-431d-b1e3-ef2daac79067/event-polling-system/4423994
-
[^
71
]https://proto.actor/docs/mailboxes/
-
[^
72
]https://github.com/trestad/cogattn
-
[^
73
]https://softwareengineering.stackexchange.com/questions/102771/when-would-polling-for-events-be-better-than-using-observer-pattern/102781
-
[^
74
]https://www.instructables.com/Smart-Mail-Box/
-
[^
75
]https://pmc.ncbi.nlm.nih.gov/articles/PMC2879667/
-
[^
76
]https://docs.zephyrproject.org/latest/kernel/services/polling.html
-
[^
77
]https://pkg.go.dev/github.com/AsynkronIT/protoactor-go/mailbox
-
[^
78
]https://web.uvic.ca/~dbub/Cognition_Action/Cog_Psych_Readings_files/Attention.pdf