unicode

The Shape of Unicode Corruption Is a Clue

If you are debugging broken accented characters, do not stop at “Unicode is broken.” The shape of the corruption often tells you which part of the pipeline to inspect next.

When text with accented characters breaks on a page, mojibake is often the first word people reach for.

Sometimes that is right. Usually mojibake means broken character encoding or decoding: é turning into Ã©, or smart quotes turning into â€™.

But not every Unicode mangling problem is mojibake, and the difference is useful when you are trying to work out where the bug lives.

The root example for me was a travel page where only one section looked wrong. The rest of the page handled Unicode fine, but one section contained strings like:

R3mai-part
l1ngos
Gell3rt Hill
F3ny Street Market
Sz3ll K1lm1n Square

That does not look like a normal encoding failure. It looks more like accidental leetspeak-style substitution.

Here is the compact diagnostic version I now find useful:

Mojibake = broken character encoding/decoding.

Transliteration / ASCII folding = intentional lossy conversion to ASCII.

And then there is bad substitution, where characters are replaced by something else entirely.

Here are three tiny examples that show both the bad transformation and the correct handling.

Mojibake:

s = "Gellért"
bad = s.encode("utf-8").decode("latin-1")
fixed = bad.encode("latin-1").decode("utf-8")
print("bad  :", bad)
print("fixed:", fixed)

Output:

bad  : GellÃ©rt
fixed: Gellért

ASCII folding or transliteration is not always wrong. It is wrong when you use it as display text instead of as a search key, slug, or fallback field.

import unicodedata

s = "Római-part | lángos | Fény | Széll Kálmán"
search_key = "".join(
    c for c in unicodedata.normalize("NFKD", s)
    if not unicodedata.combining(c)
)
print("display:", s)
print("search :", search_key)

Output:

display: Római-part | lángos | Fény | Széll Kálmán
search : Romai-part | langos | Feny | Szell Kalman

Bad substitution:

s = "Római-part | lángos | Fény | Széll Kálmán"
bad = s.translate(str.maketrans({"ó": "3", "é": "3", "á": "1"}))
fixed = s
print("bad  :", bad)
print("fixed:", fixed)

Output:

bad  : R3mai-part | l1ngos | F3ny | Sz3ll K1lm1n
fixed: Római-part | lángos | Fény | Széll Kálmán

Those examples are only illustrations, not proof of what happened in the original case. But they show why Sz3ll K1lm1n should push you in a different direction from GellÃ©rt.

What made this more interesting is that a later live inspection made the situation messier, not cleaner. Alongside the original examples, the page also contained strings like R33mai-part, l33ngos, 33buda, Fő t33r, G##l Baba utca, Sz\u0000chenyi, and Andr\u0026aacute;ssy.

That is a stronger clue than the original neat leetspeak pattern on its own. Once you have digit substitution, hash substitution, entity leakage, and control-character style corruption in one artifact, you should be less confident that you are looking at one tidy encoding bug.

If the corruption looks like decode garbage, start with encoding and decoding boundaries. Check UTF-8 versus Latin-1 assumptions, HTTP headers, database column settings, file encodings, and any handoff points between systems.

If the corruption looks like clean accent stripping, look for explicit normalization, slugification, ASCII fallback, or a text sanitisation step that was applied too aggressively.

If it looks like R3mai-part and Sz3ll K1lm1n, look for a replacement map, a bad preprocessing step, or section-specific upstream content handling.

If several corruption styles show up together, broaden the search. Mixed patterns suggest multiple transforms, multiple data sources, or a broken pipeline boundary where differently processed text got merged back together.

So the useful rule is simple: do not just notice that text is corrupted. Classify the pattern first. Then combine that with the scope of the problem. If several corruption styles show up at once, suspect a pipeline problem rather than one simple encoding mistake.

That is a small distinction, but it can save time. The shape of Unicode corruption is often diagnostic, and so is a messy mixture of shapes.

If I saw this in a real system, I would check four things next: the raw stored text, any normalization or sanitisation step, any entity-encoding or decoding boundary, and any merge point where content from different sources gets combined.

The Shape of Unicode Corruption Is a Clue

Read more

My Growth-Focused SIPP Has Four Risk Sleeves

How I Size Lottery-Ticket Stocks

THX: Cheap Cash Flow, Real Business, Real Concentration Risk

HREE: Big Rare-Earth Project, Bigger Financing Gap