Intro
So, when I was a little kid, my parents would take me with them when they’d go visit someone, somewhere. Of course, being among adults was not fun. Lucky me, wherever I was taken, there was a good chance that some Tintin comic books were available.
My favorite character was always (and probably still is) Captain Haddock. Specifically, when he’d be shouting out stuff like:
Billions and billions of blue blistering barnacles
Dictatorial duck-billed diplodocus
I remember spending entire hours trying to figure out what those words meant. (And mind you, I was born in the Encyclopedia era—25 volumes, sequentially ordered monstrosities.)
Then One Day…
Oh right! One day, my wife just casually mentioned she used to love The Adventures of Tintin as a kid!
A billion-watt brainwave hit me so powerfully that even the nuclear station called to ask if they could borrow some power. I gently put my gin & tonic down on the table, excused myself, and went to the computer.
The Dataset
(C) Issues? Maybe. For sure! Well, as long as I publish only the process, I should be fine. After all, I am not making money from this. This is now a personal issue. And it’s an issue that has been bothering me for almost 3 decades. Yes, that’s 30 years! Doddering donkeys!
The Aim?
To ask AI what Captain Haddock meant with those utterances! Of course!
How It Went - EDA - Exploratory Data Analysis
Of course, I could just go to any modern LLM and ask these questions. But that isn’t fun, is it? Let’s do it the old school way and provide some embeddings to an open-source LLM. This meant digging into storage and finding those comics. Spent days scanning. Yep folks, this is how it goes sometimes.
PIL, scikit, sci-learn. Hold on! Wait!
- Each comic starts with a full-screen page and a title.
- The remaining pages have comic slides.
- Split each slide into an image?
- Each comic gets its own embedding?
- Hold on. This needs more power.
Na na na. Let’s do it differently:
- Page 1: Get the title. Create a new document with that title.
- Loop over each slide, capture 2 things:
- How many text bubbles are there in each slide.
- Extract the text from the bubbles, what the image conveys. (Thank you, Python!)
- Insert the captured text lists and descriptions of the images into a paragraph for context.
- Split all into chunks.
The Embeddings
- MongoDB
- LLaMA3
- WSL (Windows Subsystem for Linux)
- PDFs of AI-interpreted scans.
- Langchain
OCR vs LLMs
Did several tests, and the conclusion seems to point in one direction only: OCR tech is obsolete. Well, hold on. It’s not. In my case, that tech seems to be obsolete because if you feed a slide to any modern LLM, the model will be able to extract the text without breaking a sweat.
Ready? Go!
No wait! We have problems. BIG problems. And the problems come in the shape of language! Tintin was originally written in French! And not only that. The original translators, as specified by this article , say that they had to fit the translations in the original speech bubbles.
Decision Time!
Either I proceed in English, or I return to EDA. And that will take a long time because most likely I will have to deal with the complexities of metaphorical meanings.
Well, sometimes, when one approaches a project or a binary decision, the best way to do it is by flipping a coin:
- Heads: English
- Tails: French with the metaphorical goulash!
Heads it was! English then! We are back on track!
Ready? (Again?) Go!
“Billions of blue blistering barnacles!” is a comedic exaggeration used to express extreme surprise, shock, or frustration. It is often used in situations where he finds himself in a difficult or absurd situation.
As it turns out, validation included, the definition seems to be true and solid. In more than one example, when Captain Haddock finds himself in a difficult situation, this is his go-to expression.
To Do!
- Take the other route, the French - Metaphor way. That will give much more in-depth meaning to these things (if).
- Pass the outputs through a filtering LLM to ensure outputs are child-safe.
