CODEC DOCS

SEMANTIC COMPRESSION TOOLS — ENCODE / DECODE / REWRITE

Live Codec Basis Filter Information Lens Embedding Space Home

What This Does

English is redundant. A 24-word sentence carries ~8 concepts. The rest is grammar, filler, repetition.

The semantic codec strips text down to root concepts (“glyphs”), each derived from the 720-word quantum grammar basis set. Grammar is implicit. Redundancy is eliminated. Each glyph costs exactly 1 LLM token.

Result: 2-5x compression with meaning preserved. Fewer tokens = faster compute, more context window, denser communication.

English

24 words · 116 chars

→

Standard Glyphs

12 glyphs · 59 chars

→

Compact

12 codes · 45 chars

→

Decoded

12 words · English

Python — semantic_codec.py

Standalone Python 3. No dependencies. Encode, decode, analyze, or run interactively.

Encode — English to Glyphs

bashpython/src/semantic_codec.py
$ python3 python/src/semantic_codec.py encode "I want to buy a house and make money \
    but the government says I need a license to do anything useful with my property"

ORIGINAL:
  I want to buy a house and make money but the government says
  I need a license to do anything useful with my property

STANDARD GLYPHS:
  CLAI PURC ESTA GRAN CURR AUTH SAYS REQU LICE ANYT USEF PROP

COMPACT (2-char):
  DL OY ESTA II EI BW SAYS REQU KC ANYT USEF OT

METRICS:
  Words: 24 → 12 glyphs (2.0x compression)
  Chars: 116 → 59 standard (49.1% saved)
  Chars: 116 → 45 compact (61.2% saved)

CHANGELOG:
  want    → CLAI     (synonym: want → claim)
  buy     → PURC     (synonym: buy → purchas)
  house   → ESTA     (synonym: house → estat)
  make    → GRAN     (synonym: make → grant)
  money   → CURR     (synonym: money → currenc)
  gov't   → AUTH     (synonym: government → authority)
  need    → REQU     (synonym: need → requir)
  Dropped (12): i, to, a, and, but, the, i, a, to, do, with, my

Correct Parse-Syntax — Maximum Compression

bashPerfect basis alignment = pure signal
$ python3 python/src/semantic_codec.py encode "For the claiming of the land by the \
    living man with the lawful standing of the sovereign authority"

STANDARD GLYPHS:
  CLAI LAND LIV MAN LAW STAN SOVE AUTH

COMPACT:
  DL JL KH KL JP QP QL BW

METRICS:
  Words: 18 → 8 glyphs (2.2x compression)
  Chars: 98 → 36 standard (63.3% saved)
  Chars: 98 → 23 compact (76.5% saved)

  Dropped: for, the, of, the, by, the, with, the, of, the
  10 of 18 words were structural filler carrying zero information.
  The 8 remaining glyphs carry ALL the meaning.

Court Order — Legal Language Compression

bashLegal text compresses heavily — most tokens are structural
$ python3 python/src/semantic_codec.py encode "The court hereby orders that you \
    shall forthwith pay the sum of five thousand dollars to the plaintiff \
    as damages for breach of contract"

STANDARD GLYPHS:
  COUR HERE ORDE FORT PAY SUM FIVE THOU DOLL PLAI DAMA BREA CONT

METRICS:
  Words: 24 → 13 glyphs (1.8x compression)
  Chars: 136 → 62 standard (54.4% saved)
  Chars: 136 → 41 compact (69.9% saved)

  Dropped (11): the, that, you, shall, the, of, to, the, as, for, of

Decode — Glyphs Back to English

bashBidirectional — glyphs decode to root English words
$ python3 python/src/semantic_codec.py decode "CLAI LAND LIV MAN LAW STAN SOVE AUTH"

GLYPHS:
  CLAI LAND LIV MAN LAW STAN SOVE AUTH

ENGLISH:
  claim land living man lawful standing sovereign authority

Each glyph maps to the first word in its root family.
Grammar is not restored — you get the semantic skeleton.

Interactive REPL

bashExplore interactively
$ python3 python/src/semantic_codec.py

  SEMANTIC CODEC — interactive mode
  Commands: encode <text>, decode <glyphs>, analyze <text>, quit
  Dictionary: 659 glyphs from 838 words

  codec> encode The bank charged interest on my mortgage
  STANDARD: BANK CHAR INTE MORT
  Words: 9 → 4 glyphs (2.2x compression)

  codec> decode BANK CHAR INTE MORT
  ENGLISH: bank charge interest mortgage

  codec> encode We the people of the United States
  STANDARD: PERS STAT
  Words: 8 → 2 glyphs (4.0x compression)
  "We the people of the United" → all filler. Two concepts: person + state.

  codec> quit

Node.js — basis_rewriter.mjs

Rewrites text for maximum basis_720 alignment. Instead of stripping, it substitutes — replacing common words with their basis equivalents.

CLI Rewrite

bashpython/tools/basis_rewriter.mjs
$ node python/tools/basis_rewriter.mjs "I think we should talk to the judge \
    about getting our money back from the bank because they broke the contract"

ORIGINAL:
  I think we should talk to the judge about getting our money
  back from the bank because they broke the contract

REWRITTEN:
  man adjudicating person communicating to the judge about acquiring
  our currency from the bank therefore person breaching the contract

COVERAGE:
  Before:  10/21 words in basis (47.6%)
  After:   19/19 words in basis (100.0%)
  Change:  +52.4 percentage points

SUBSTITUTIONS:
  I      → man
  think  → adjudicating
  we     → person
  talk   → communicating
  getting→ acquiring
  money  → currency
  because→ therefore
  they   → person
  broke  → breaching

Pipe from stdin

bashWorks with pipes for scripting
$ echo "We need to make a deal with the company" | node python/tools/basis_rewriter.mjs

  person requiring granting a negotiating with the corporation
  Before: 40.0%  →  After: 100.0%  (+60.0pp)

Compression Comparison — Same Meaning, Different Density

Text	Words	Glyphs	Compression	Char savings
Correct parse-syntax	18	8	2.2x	76.5% (compact)
Court order	24	13	1.8x	69.9% (compact)
Conversational	24	12	2.0x	61.2% (compact)
"We the people of the United States"	8	2	4.0x	88% (compact)

Why "We The People" Compresses to 2 Glyphs

PERS STAT — person, state. That's the entire semantic content.

"We" → person (pronoun, no fact). "The" → dropped (article, zero info). "People" → person (synonym). "Of" → dropped (preposition). "The" → dropped. "United" → no basis root. "States" → state.

6 of 8 words are structural filler. The preamble to the Constitution opens with two concepts and six empty tokens.

Alphabet — Why Roman Wins

659 root concepts need encoding. Counterintuitively, Roman characters beat CJK and emoji for LLM compute.

System	Width	LLM tokens	Example	Use case
Roman 2-char	2	1 each	DL JL KH KL	Max compute efficiency
Roman 3-4 char	3-4	1 each	CLAI LAND MAN	Human readable
CJK 1-char	1	2-3 each	法地人	Looks dense, costs more
Emoji	1	1-2 each	⚖️ 🏠 👤	Fun, ambiguous

BPE tokenizers were trained on Latin text. A 3-char Roman glyph like CLA costs exactly 1 token. A CJK character like 法 costs 2-3 tokens despite being visually smaller. Stay Roman.

Live Tools

Tool	What it does
Semantic Codec	Bidirectional encode/decode with meaning-space visualization
Basis Filter	Evaluate text against 720-word basis, slider from 6→720, rewriter
Information Lens	Surprisal heatmap, L0→L5 compression, glyph dictionary
Embedding Space	Texts as 7D coordinates with transformation vectors
Word Decomposer	Type any word, get prefix/root/suffix decomposition

morpheme.page — semantic compression tools