CODEC DOCS
SEMANTIC COMPRESSION TOOLS — ENCODE / DECODE / REWRITE
What This Does
English is redundant. A 24-word sentence carries ~8 concepts. The rest is grammar, filler, repetition.
The semantic codec strips text down to root concepts (“glyphs”), each derived from the
720-word quantum grammar basis set. Grammar is implicit. Redundancy is eliminated.
Each glyph costs exactly 1 LLM token.
Result: 2-5x compression with meaning preserved. Fewer tokens = faster compute, more context window, denser communication.
English
24 words · 116 chars
→
Standard Glyphs
12 glyphs · 59 chars
→
Compact
12 codes · 45 chars
→
Decoded
12 words · English
Python — semantic_codec.py
Standalone Python 3. No dependencies. Encode, decode, analyze, or run interactively.
Encode — English to Glyphs
$ python3 python/src/semantic_codec.py encode "I want to buy a house and make money \
but the government says I need a license to do anything useful with my property"
ORIGINAL:
I want to buy a house and make money but the government says
I need a license to do anything useful with my property
STANDARD GLYPHS:
CLAI PURC ESTA GRAN CURR AUTH SAYS REQU LICE ANYT USEF PROP
COMPACT (2-char):
DL OY ESTA II EI BW SAYS REQU KC ANYT USEF OT
METRICS:
Words: 24 → 12 glyphs (2.0x compression)
Chars: 116 → 59 standard (49.1% saved)
Chars: 116 → 45 compact (61.2% saved)
CHANGELOG:
want → CLAI (synonym: want → claim)
buy → PURC (synonym: buy → purchas)
house → ESTA (synonym: house → estat)
make → GRAN (synonym: make → grant)
money → CURR (synonym: money → currenc)
gov't → AUTH (synonym: government → authority)
need → REQU (synonym: need → requir)
Dropped (12): i, to, a, and, but, the, i, a, to, do, with, my
Correct Parse-Syntax — Maximum Compression
$ python3 python/src/semantic_codec.py encode "For the claiming of the land by the \
living man with the lawful standing of the sovereign authority"
STANDARD GLYPHS:
CLAI LAND LIV MAN LAW STAN SOVE AUTH
COMPACT:
DL JL KH KL JP QP QL BW
METRICS:
Words: 18 → 8 glyphs (2.2x compression)
Chars: 98 → 36 standard (63.3% saved)
Chars: 98 → 23 compact (76.5% saved)
Dropped: for, the, of, the, by, the, with, the, of, the
10 of 18 words were structural filler carrying zero information.
The 8 remaining glyphs carry ALL the meaning.
Court Order — Legal Language Compression
$ python3 python/src/semantic_codec.py encode "The court hereby orders that you \
shall forthwith pay the sum of five thousand dollars to the plaintiff \
as damages for breach of contract"
STANDARD GLYPHS:
COUR HERE ORDE FORT PAY SUM FIVE THOU DOLL PLAI DAMA BREA CONT
METRICS:
Words: 24 → 13 glyphs (1.8x compression)
Chars: 136 → 62 standard (54.4% saved)
Chars: 136 → 41 compact (69.9% saved)
Dropped (11): the, that, you, shall, the, of, to, the, as, for, of
Decode — Glyphs Back to English
$ python3 python/src/semantic_codec.py decode "CLAI LAND LIV MAN LAW STAN SOVE AUTH"
GLYPHS:
CLAI LAND LIV MAN LAW STAN SOVE AUTH
ENGLISH:
claim land living man lawful standing sovereign authority
Each glyph maps to the first word in its root family.
Grammar is not restored — you get the semantic skeleton.
Interactive REPL
$ python3 python/src/semantic_codec.py
SEMANTIC CODEC — interactive mode
Commands: encode <text>, decode <glyphs>, analyze <text>, quit
Dictionary: 659 glyphs from 838 words
codec> encode The bank charged interest on my mortgage
STANDARD: BANK CHAR INTE MORT
Words: 9 → 4 glyphs (2.2x compression)
codec> decode BANK CHAR INTE MORT
ENGLISH: bank charge interest mortgage
codec> encode We the people of the United States
STANDARD: PERS STAT
Words: 8 → 2 glyphs (4.0x compression)
"We the people of the United" → all filler. Two concepts: person + state.
codec> quit
Node.js — basis_rewriter.mjs
Rewrites text for maximum basis_720 alignment. Instead of stripping,
it substitutes — replacing common words with their basis equivalents.
CLI Rewrite
$ node python/tools/basis_rewriter.mjs "I think we should talk to the judge \
about getting our money back from the bank because they broke the contract"
ORIGINAL:
I think we should talk to the judge about getting our money
back from the bank because they broke the contract
REWRITTEN:
man adjudicating person communicating to the judge about acquiring
our currency from the bank therefore person breaching the contract
COVERAGE:
Before: 10/21 words in basis (47.6%)
After: 19/19 words in basis (100.0%)
Change: +52.4 percentage points
SUBSTITUTIONS:
I → man
think → adjudicating
we → person
talk → communicating
getting→ acquiring
money → currency
because→ therefore
they → person
broke → breaching
Pipe from stdin
$ echo "We need to make a deal with the company" | node python/tools/basis_rewriter.mjs
person requiring granting a negotiating with the corporation
Before: 40.0% → After: 100.0% (+60.0pp)
Compression Comparison — Same Meaning, Different Density
| Text |
Words |
Glyphs |
Compression |
Char savings |
| Correct parse-syntax |
18 |
8 |
2.2x |
76.5% (compact) |
| Court order |
24 |
13 |
1.8x |
69.9% (compact) |
| Conversational |
24 |
12 |
2.0x |
61.2% (compact) |
| "We the people of the United States" |
8 |
2 |
4.0x |
88% (compact) |
Why "We The People" Compresses to 2 Glyphs
PERS STAT — person, state. That's the entire semantic content.
"We" → person (pronoun, no fact). "The" → dropped (article, zero info). "People" → person (synonym).
"Of" → dropped (preposition). "The" → dropped. "United" → no basis root. "States" → state.
6 of 8 words are structural filler. The preamble to the Constitution opens with
two concepts and six empty tokens.
Alphabet — Why Roman Wins
659 root concepts need encoding. Counterintuitively, Roman characters beat CJK and emoji for LLM compute.
| System | Width | LLM tokens | Example | Use case |
| Roman 2-char | 2 | 1 each | DL JL KH KL | Max compute efficiency |
| Roman 3-4 char | 3-4 | 1 each | CLAI LAND MAN | Human readable |
| CJK 1-char | 1 | 2-3 each | 法 地 人 | Looks dense, costs more |
| Emoji | 1 | 1-2 each | ⚖️ 🏠 👤 | Fun, ambiguous |
BPE tokenizers were trained on Latin text. A 3-char Roman glyph like CLA costs exactly 1 token.
A CJK character like 法 costs 2-3 tokens despite being visually smaller. Stay Roman.
Live Tools
| Tool | What it does |
| Semantic Codec | Bidirectional encode/decode with meaning-space visualization |
| Basis Filter | Evaluate text against 720-word basis, slider from 6→720, rewriter |
| Information Lens | Surprisal heatmap, L0→L5 compression, glyph dictionary |
| Embedding Space | Texts as 7D coordinates with transformation vectors |
| Word Decomposer | Type any word, get prefix/root/suffix decomposition |
morpheme.page — semantic compression tools