AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
Omega Tokens: Finding The Self Solving Frame
First note, there is no degeneracy in this cell now. As per hundreds of bulk tests with many readouts, the degeneracy is swept up in the SVD kernel, the fl_gram eigh svd, the FLEigh structure, or any of the subsequent catches that the pytorch handles.
The degeneracy problem is solved, and with that introduced a massive amount of new problems. Problems that I have built prototypes to address; each core problem has been narrowed down to three core components as solutions for information movement.
S^N sequential
Scattered S^N * D for orthogonal clustering
S * D + D * D for structural cohesive memory annealing
This comes down to three important utilities that many core structures depend on.
Sequence, distance, cosine similarity, QKV support, rotary support, and more.
- Sequential structural cohesion; LLM, tokens, next token prediction, spearman, and so on.
- Behavioral attenuated implicit; ViT, Resnets, diffusers, etc
- Geometric alignment structure; Distillation, transfer learning, teacher/student, genetic inheritance, generational learning, SVAE, geolip prototypes, and constellations.
Third is least useful to the out of scope, first two are very useful so they are my predominant focus here.
I have 14 potential prototypes and I will be forming a notebook for each, testing the robustness, the positives, the negatives, the storage and recall capacity, the magnitude standardization vs normalization accuracy, the flow matched directional EMA vs non-EMA, the structural supported ensemble approach vs the residual approach, and a few other elemental substructures.
The biggest tradeoffs will be between normalization clipping and standardization unit structured tokens. These are inherently entirely different expectations and produce entirely different opinions.
Each of these experiments will be fully documented, the subsequent models included in the notebook sections, and the notebooks represented in the cell repo.
The Cell is a fickle beast, but I believe I have tamed the monster. The battery will be substantially stronger with the new cell upgrades, as the battery includes multiple constellation elements such as FILM solidification, normalization at curative points rather than destructive, and a few other elements to assist with producing tokenizations such as direct Conv support and huggingface transformer capacity for the MOE substructures.
As it stands, the transformer tokens here are represented simply as [b, S, D, V] also [b, S, U, Vt], and they have direct embedding tokenization potentials on many structures, but not all structures. There are multiple deviant structures that suffer from certain rules that require additional solutions before those work.
The prototypes may not exactly reflect this shape, and the shape may change for packaging and reuse purposes so bare with it for now. I'm only one person and I'm heavily relying on Claude to handle many of the logistics. I can code all of this, it just takes a lot longer for me to do manually so I'm basically on NO GELU HERE - NO NORMS HERE - NO PROJECTION HERE duty. I'm basically babysitting Claude so the code is correct and making sure the tests come out as they are supposed to.
Original Question:
- Can the cells utilize positional encoding patches from the triton d=2 decompositions? I'm thinking maybe, and it's worth a shot.
Update:
- They can, and without degenerates when curated correctly. Their usefulness isn't as helpful unless applied to an regression cascade that projects upward to the largest structure and compares rotation. The rotation for the projections are perfect procrustes if handled correctly as per the experimental documentation on the SVD triton kernel.
Original Question Assessment:
- If the degenerates can converge in-model implicitly, this changes the game entirely. The SVD cell can handle ensuring the preservation, the reconstruction of the cells have no rival. Now... lets see if this holds up over huge dimensional spaces, or if the models simply... shatter with the thin triton 500x speed.
Getting access to that 500x speed is beyond reproach in terms of speed, there is no comparison.
Update:
- The model can benefit from the 5000x speed, yes 5000 not 500, and it can be useful - however there are stipulations that require multiple uses so the gains are not as useful as I'd hoped. You need to run many more of them to get a useful informational cluster, otherwise you just end up norming everything to death down the line without magnitudes as shown by the cascade tests.
- There NEEDS to be an fp64 triton kernel and I'll work that out today if it's even possible to run fp64 through triton in this fashion. The fp32 is showing serious problems with rounding faults and it needs to be addressed with fp64.
- The D=2 models operate around 8x faster, at a much lower accuracy overall. Without the high fidelity access to the FLEigh or gram eigh SVD structure, the model simply does not have the necessary matmul accuracy to represent the outcome.
- The D=2 magnitudes are useful only for D=2, if you project the rotation upward to the higher D you lose the magnitude directional accuracy as per the tests and documentation. This means D=2 when ran 32 times is still less accurate than a single D=8 in magnitude-sense, even when cross-correlation is used to determine the most likely magnitude.
- This behavior is ideal when distilling a model's signals into another, not as useful when forming a proper embedding encoder utility chain. The encoder needs to be fairly stable, so you need to make sure the model is capably learning the encoding spectrum, and each subsequent encoder chain down the line sees the same structural system and the residual opinions of the last - otherwise the encodings are simply lost. I've mentioned it before, residuals are lossless in this regime, and with that the lossless behavior is essentially explained as rigidity and difficult to differentiate strictness. This when correctly aligned is a powerful implicit structural controller, and explicitly a nightmare to tune into something that isn't just ON OR OFF gating.
There are no more paved roads here... It's time to chart some jungle.
I've baked in CM config controllers into the head of the spectral cell. This will allow the CM to be crutched heavily, letting the model legitimately diverge and drift into impossible terrain and still maintain order - catching everything invalid as it cuts through.
======================================================================
COMPLETE
======================================================================
Best val acc: 93.8%
Time: 979s (8.2s/epoch)
Conv: 4,251,200 Cells: 366,176 Head: 167,946 Total: 4,785,322
Comparison:
SpectralCell standalone (D=16 V=16 h=256 +conv +aug): 79.1% 926K 1.2s/ep
ConduitBattery backbone (GPT trainer, ep55/120): 88.7% ~2M ?s/ep
Conv + SpectralCell inline: 93.8% 4,785,322 8.2s/ep

