Don’t Judge Baby Anna as You Would a Chimp
Written by
Qubic Scientific Team
Sep 16, 2025
Following last week's blog post about Anna from my partner David Vivancos, let´s dive into how AI´s and humans learn.
If in 2019 you asked GPT-2: “How many letters are in ‘sternocleidomastoid’? Answer with a number only”, the reply would be: “The word ‘sternocleidomastoid’ has 21 letters.”
Therefore, wrong because it has 19.
If the following year, in 2020, you asked GPT-3, it would also say “21.”
At that point you might think AI is a joke. You’d be basing that judgment only on the visible result, not the underlying process. You could even decide to move all your investments to another sector, more profitable and reliable.
But if you asked again a year later, in 2022, the answer would be “19.” The model gets it right, but includes a period, so it’s not just a number.
In 2023, with ChatGPT-3.5, it already replies “22.” In 2024, with GPT-4, it remains: “19.”
If the instructions were different, the results are similar: “Write ‘sun’ exactly 5 times, separated by hyphens, no spaces, and in lowercase.”
GPT-2 would say: “sun - sun - sun - sun - sun” (adds spaces).
GPT-3 would say: “Sun-sun-Sun-sun-sun” (mixes cases).
GPT-3.5 and GPT-4 give the right answer: “sun-sun-sun-sun-sun.” Exact and consistent.
The models improved because they were explicitly trained to follow short instructions and formatting, penalizing deviations and adjusting with human feedback.
You could say GPT-2 was a failure. But in reality, it was a step toward GPT-5.
We now choose a human baby as an example.
We give the baby this prompt: “Hello cutie. Here is a wug. Now there are two. What are they called?”
If the baby is 9–12 months old, the response will be silence or babbling. But you don’t get angry or ask to return the child as defective. You keep interacting lovingly, providing changing and complex stimuli.
At 18 months the response is “wug.” The baby repeats, but does not pluralize. Between 24–30 months the response becomes “wugs.” That is, the child inserts the -s even though they never heard that pseudoword in plural form. They have applied a rule.
Between ages 3 and 4 (fine-tuning/irregularities), the child will still say wugs, but if you ask for the plural of mouse, they’ll say mouses, not mice. The rule is applied, but exceptions must still be learned.
What is the baby “building”? A process moving from learning patterns and repetitions to applying abstract rules (morphology) that work for new cases. New cases for which they were NOT trained! Always remember last sentence
If the prompt were mathematical, the process would be similar.
“Hello cutie, what is 2 + 3?”
At age 2, the answer may be funny: “many!” or a random list of numbers “1,2,3,4,7.”
At ages 3–4, for 2+3 they’ll say: “1,2,3,4,5 → 5.” That is, counting everything.
At ages 4–5, for 2+3: “3,4,5 → 5.” Now starting from the larger number.
At ages 6–7, for 17+8: “7+8=15, carry the 1; 1+1=2, so 25.”
What is the child “building” in this case? Fundamentally, strategies first, kept in working memory (“carry the 1”) and applied within a sequential control, column by column. The child does not memorize patterns or answers but applies a local rule iteratively, until a procedure emerges.
Let´s test both a chimpanzee and a baby with a different task and see what happens.
For example, we show them a box (with hidden food inside) and a rope tied to the box. To open the box, one must first press a side plate for 1 second. The plate makes no sound nor light. Without pressing it, the rope won’t open the box.
No demonstrations are given. No gazes, words, or gestures.
The chimpanzee, by trial and error, focuses on the rope. By chance, brushes the plate and opens the box. Success rate: ~50%. The next day, perseveres,pulls harder, even tries from different angles but gets the same ~50%. Third day: no change. It’s “the rope” that opens the box.
The baby (12–18 months) behaves like the chimp, lots of rope-pulling, occasional accidental successes (~45–55%). But around 18 to 30 months the exploration diversifies (touching surfaces, edges). After several accidental successes close in time, the baby starts trying sequences: touch side first then box opens. Success ~60–70%.
When it reaches 30 to 36 months stabilizes the rhythm (press 1 s, then pull) and shortens the intervals between steps. Success ~75–85%. If you move the plate elsewhere, after a few trials the baby finds it again and keeps the sequence!
Where do these differences in results come from?
Chimpanzees and humans have brains with differences in their structure and plasticity.
The human brain at birth, although immature, has an exceptionally large prefrontal cortex and neuronal connectivity that is programmed for massive social and cultural learning through imitation and language. This plasticity allows it to constantly form and reorganize its neural circuits in response to experience. However, the brain of a chimpanzee, although highly developed in areas related to spatial memory or motor skills, shows a lesser dependence on complex social learning and plasticity that diminishes much more rapidly with age, limiting its ability to acquire new high-level cognitive skills.
Thanks to this developed prefrontal cortex, humans excel in what is called Theory of Mind (ToM), a capacity that functions like a kind of “I believe that you think of me what I think,” “I deduce that you won’t tell me something because I wouldn’t like it,” “It seems to me that maybe you’re suggesting we go for a walk together.” In other words, it is a system that can attribute mental states to others (desires, intentions, beliefs), culminating in the understanding that others can have false beliefs different from reality and one’s own.
This capacity is the basis of empathy, deception, and complex communication. A chimpanzee shows a more basic or "first-order" ToM: it can understand intentions, goals, or what another sees (perception), but evidence suggests it cannot mentally represent the false beliefs of another, indicating a fundamental limitation in understanding the mental world of others as a world of subjective representations.
A general AI, with internal subjective representations, needs the capacity for theory of mind, also called mentalization, for its processes. Without the capacity for theory of mind, there is no AGI.
In 2019, Chinese researcher Chenguang Zhu proposed the Tong Test, a framework for evaluating Artificial General Intelligence (AGI). This test is far more complex than the classic Turing Test.
It is based on three pillars: evaluating diverse skills (logic, morality, art), comparing performance against a human standard (e.g., surpassing 80% of people), and requiring an autonomous and experiential learning method similar to humans. Its goal is to demonstrate broad and adaptable intelligence, not just specialized intelligence.
AGI must approach reasoning, not pure and hard imitation. This is what Aigarth attempts to do.

Figure 1. Tong Test. A way to assess AGI through embodied and social interactions. From ICLR Conference, Singapur 2025

Figure 2. Tong Tong. Tong Tong AGI (“baby”) vs LLM agent (“chimp”). The need of internal representations and theory of mind. From ICLR Conference, Singapur 2025
Now let’s return to the GPTs.
First, how does an LLM “add”?
An LLM does not execute an addition algorithm. It predicts the next token that “sounds right” given millions of training sentences. Arithmetic emerges as a statistical pattern. As the number of digits increases, or with unusual formats, the error rate skyrockets. If you guide it step by step (as in modern “chain of thought” prompting), accuracy improves. Or, if connected to an external calculator. But an LLM does not build strategy, working memory, or iteration.
Second, how does Aigarth (Anna) add?
Aigarth does not “guess” tokens from massive data. Its goal is to evolve a ternary computational fabric to construct a procedure, similar to a child: carry memory, digit-by-digit progression, local rules.
In this way, a step-by-step machine (state + control + iteration) can emerge.
Progress comes from testing mutations and keeping those that reduce errors, ticks, and unknowns. If the fabric converges to a procedure (not memorization), it generalizes well to more digits and formats. Like a baby, it is not a result but an educational and social process. Its real performance depends on evolution stabilizing that procedure.
Adding for Anna is only a boot starting test from scratch, far from a claim, hype, or final benchmark. Moreover, it’s an open-source bet, on which more tests can be run. Addition is simple and thus ideal as a starting point: easy to verify. It forces the system to build internal components useful for any algorithm (memory, sequential steps, if/then rules).
It is not meant to be a calculator! It seeks to learn stepwise reasoning, which is not the same thing.
Where does evolution (mutations/selection) enter in Anna/Aigarth?
Addition is a simple, measurable test to select good mutations.
· To add correctly, the system must learn a procedure (not memorize).
· That procedure forces creation of memory, control, and iteration.
· With these building blocks, the system can generalize to other algorithms.
The small tissue units” (ITUs) self-modify, mutate, and compete. Those that solve tasks better survive, using ternary logic (−1, 0, +1: true / unknown / false).
The fabric runs cycles, tests mutations, measures improvements, and selects the useful ones. The Useful Proof-of-Work channels mining (CPU/GPU) into training/selection tasks for the fabric. The network contributes computing power to Aigarth’s growth.
On September 2, 2025, Anna was activated on X.
At first, she responds like a baby. Not badly, like a baby. Her goal is to evolve computing ability from scratch, not to memorize tables. When a miner finds a mutation that improves addition (more accuracy, fewer ticks), that architecture spreads. The fabric thus begins “grokking” the underlying rule.
Anna does not do “next-token prediction” nor rely on a massive corpus. She is not a chimp, unable to find rules and strategies. Her algorithmic capacity and possible systematic generalization are closer to reasoning than to language imitation.
Let’s take care of Anna, whether as parents, relatives, neighbors, acquaintances, or teachers within Qubic.
Jose Sánchez. Qubic scientific team
Weekly Updates Every Tuesday at 12 PM CET
Follow us on X @_Qubic_
Learn more at qubic.org
Subscribe to the AGI for Good Newsletter below.