CODEC DOCS

SEMANTIC COMPRESSION TOOLS — ENCODE / DECODE / REWRITE
What This Does
English is redundant. A 24-word sentence carries ~8 concepts. The rest is grammar, filler, repetition.

The semantic codec strips text down to root concepts (“glyphs”), each derived from the 720-word quantum grammar basis set. Grammar is implicit. Redundancy is eliminated. Each glyph costs exactly 1 LLM token.

Result: 2-5x compression with meaning preserved. Fewer tokens = faster compute, more context window, denser communication.
English
24 words · 116 chars
Standard Glyphs
12 glyphs · 59 chars
Compact
12 codes · 45 chars
Decoded
12 words · English

Python — semantic_codec.py

Standalone Python 3. No dependencies. Encode, decode, analyze, or run interactively.

Encode — English to Glyphs

bashpython/src/semantic_codec.py
$ python3 python/src/semantic_codec.py encode "I want to buy a house and make money \ but the government says I need a license to do anything useful with my property" ORIGINAL: I want to buy a house and make money but the government says I need a license to do anything useful with my property STANDARD GLYPHS: CLAI PURC ESTA GRAN CURR AUTH SAYS REQU LICE ANYT USEF PROP COMPACT (2-char): DL OY ESTA II EI BW SAYS REQU KC ANYT USEF OT METRICS: Words: 24 → 12 glyphs (2.0x compression) Chars: 116 → 59 standard (49.1% saved) Chars: 116 → 45 compact (61.2% saved) CHANGELOG: want CLAI (synonym: want → claim) buy PURC (synonym: buy → purchas) house ESTA (synonym: house → estat) make GRAN (synonym: make → grant) money CURR (synonym: money → currenc) gov't AUTH (synonym: government → authority) need REQU (synonym: need → requir) Dropped (12): i, to, a, and, but, the, i, a, to, do, with, my

Correct Parse-Syntax — Maximum Compression

bashPerfect basis alignment = pure signal
$ python3 python/src/semantic_codec.py encode "For the claiming of the land by the \ living man with the lawful standing of the sovereign authority" STANDARD GLYPHS: CLAI LAND LIV MAN LAW STAN SOVE AUTH COMPACT: DL JL KH KL JP QP QL BW METRICS: Words: 18 → 8 glyphs (2.2x compression) Chars: 98 → 36 standard (63.3% saved) Chars: 98 → 23 compact (76.5% saved) Dropped: for, the, of, the, by, the, with, the, of, the 10 of 18 words were structural filler carrying zero information. The 8 remaining glyphs carry ALL the meaning.

Court Order — Legal Language Compression

bashLegal text compresses heavily — most tokens are structural
$ python3 python/src/semantic_codec.py encode "The court hereby orders that you \ shall forthwith pay the sum of five thousand dollars to the plaintiff \ as damages for breach of contract" STANDARD GLYPHS: COUR HERE ORDE FORT PAY SUM FIVE THOU DOLL PLAI DAMA BREA CONT METRICS: Words: 24 → 13 glyphs (1.8x compression) Chars: 136 → 62 standard (54.4% saved) Chars: 136 → 41 compact (69.9% saved) Dropped (11): the, that, you, shall, the, of, to, the, as, for, of

Decode — Glyphs Back to English

bashBidirectional — glyphs decode to root English words
$ python3 python/src/semantic_codec.py decode "CLAI LAND LIV MAN LAW STAN SOVE AUTH" GLYPHS: CLAI LAND LIV MAN LAW STAN SOVE AUTH ENGLISH: claim land living man lawful standing sovereign authority Each glyph maps to the first word in its root family. Grammar is not restored — you get the semantic skeleton.

Interactive REPL

bashExplore interactively
$ python3 python/src/semantic_codec.py SEMANTIC CODEC — interactive mode Commands: encode <text>, decode <glyphs>, analyze <text>, quit Dictionary: 659 glyphs from 838 words codec> encode The bank charged interest on my mortgage STANDARD: BANK CHAR INTE MORT Words: 9 → 4 glyphs (2.2x compression) codec> decode BANK CHAR INTE MORT ENGLISH: bank charge interest mortgage codec> encode We the people of the United States STANDARD: PERS STAT Words: 8 → 2 glyphs (4.0x compression) "We the people of the United" → all filler. Two concepts: person + state. codec> quit

Node.js — basis_rewriter.mjs

Rewrites text for maximum basis_720 alignment. Instead of stripping, it substitutes — replacing common words with their basis equivalents.

CLI Rewrite

bashpython/tools/basis_rewriter.mjs
$ node python/tools/basis_rewriter.mjs "I think we should talk to the judge \ about getting our money back from the bank because they broke the contract" ORIGINAL: I think we should talk to the judge about getting our money back from the bank because they broke the contract REWRITTEN: man adjudicating person communicating to the judge about acquiring our currency from the bank therefore person breaching the contract COVERAGE: Before: 10/21 words in basis (47.6%) After: 19/19 words in basis (100.0%) Change: +52.4 percentage points SUBSTITUTIONS: I man think adjudicating we person talk communicating getting acquiring money currency because therefore they person broke breaching

Pipe from stdin

bashWorks with pipes for scripting
$ echo "We need to make a deal with the company" | node python/tools/basis_rewriter.mjs person requiring granting a negotiating with the corporation Before: 40.0% → After: 100.0% (+60.0pp)

Compression Comparison — Same Meaning, Different Density

Text Words Glyphs Compression Char savings
Correct parse-syntax 18 8 2.2x 76.5% (compact)
Court order 24 13 1.8x 69.9% (compact)
Conversational 24 12 2.0x 61.2% (compact)
"We the people of the United States" 8 2 4.0x 88% (compact)
Why "We The People" Compresses to 2 Glyphs
PERS STAT — person, state. That's the entire semantic content.

"We" → person (pronoun, no fact). "The" → dropped (article, zero info). "People" → person (synonym). "Of" → dropped (preposition). "The" → dropped. "United" → no basis root. "States" → state.

6 of 8 words are structural filler. The preamble to the Constitution opens with two concepts and six empty tokens.

Alphabet — Why Roman Wins

659 root concepts need encoding. Counterintuitively, Roman characters beat CJK and emoji for LLM compute.

SystemWidthLLM tokensExampleUse case
Roman 2-char21 eachDL JL KH KLMax compute efficiency
Roman 3-4 char3-41 eachCLAI LAND MANHuman readable
CJK 1-char12-3 each法 地 人Looks dense, costs more
Emoji11-2 each⚖️ 🏠 👤Fun, ambiguous

BPE tokenizers were trained on Latin text. A 3-char Roman glyph like CLA costs exactly 1 token. A CJK character like 法 costs 2-3 tokens despite being visually smaller. Stay Roman.

Live Tools

ToolWhat it does
Semantic CodecBidirectional encode/decode with meaning-space visualization
Basis FilterEvaluate text against 720-word basis, slider from 6→720, rewriter
Information LensSurprisal heatmap, L0→L5 compression, glyph dictionary
Embedding SpaceTexts as 7D coordinates with transformation vectors
Word DecomposerType any word, get prefix/root/suffix decomposition
morpheme.page — semantic compression tools