Tokenizable Data: Docs vs Spreadsheets
Key Points
- The first step in assessing whether AI can handle a task is determining if the underlying data is “tokenizable,” meaning it can be represented as text-like chunks that fit into a document.
- Tokenizable data is categorized into tiers: Tier A (easily tokenized, like wiki text), Tier B (moderately tokenizable, such as spreadsheet‑scale tables that may need preprocessing), and Tier C (large data lakes or massive time‑series that are difficult to fit into a context window).
- While AI readily processes word documents, it struggles with spreadsheets and numeric accuracy, requiring specialized tools (e.g., data rails) or advanced techniques to extract meaningful insights.
- Recent advances like OpenAI’s agent mode, which can generate and manipulate Excel sheets, show progress, but handling large, complex datasets still often exceeds current AI capabilities without dedicated solutions.
Sections
- Understanding Tokenizable Data in AI - The speaker clarifies the concept of “tokenizable” data by suggesting that anything that can be expressed in a document is likely tokenizable, and explains how this notion determines whether a task fits within AI’s context window and handling capabilities.
- Tokenizable Data Simplifies AI Adoption - The speaker argues that data small enough to fit in a Word document or even on a napkin is far easier to tokenize and integrate with LLMs, whereas massive, complex datasets in data lakes make AI architecture much more challenging.
- Choosing Prompt Length Strategically - The speaker explains that large, context‑rich prompts are best for well‑defined, production‑focused tasks, while short, iterative prompts are more effective for exploratory, brainstorming, or casual conversations.
- Balancing Thoughts on Big Prompts - The speaker acknowledges that conversations aren't solely about big prompts, yet confesses a personal preference for them and stresses their honesty.
Full Transcript
# Tokenizable Data: Docs vs Spreadsheets **Source:** [https://www.youtube.com/watch?v=SRYgH2WvknQ](https://www.youtube.com/watch?v=SRYgH2WvknQ) **Duration:** 00:13:04 ## Summary - The first step in assessing whether AI can handle a task is determining if the underlying data is “tokenizable,” meaning it can be represented as text-like chunks that fit into a document. - Tokenizable data is categorized into tiers: Tier A (easily tokenized, like wiki text), Tier B (moderately tokenizable, such as spreadsheet‑scale tables that may need preprocessing), and Tier C (large data lakes or massive time‑series that are difficult to fit into a context window). - While AI readily processes word documents, it struggles with spreadsheets and numeric accuracy, requiring specialized tools (e.g., data rails) or advanced techniques to extract meaningful insights. - Recent advances like OpenAI’s agent mode, which can generate and manipulate Excel sheets, show progress, but handling large, complex datasets still often exceeds current AI capabilities without dedicated solutions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=0s) **Understanding Tokenizable Data in AI** - The speaker clarifies the concept of “tokenizable” data by suggesting that anything that can be expressed in a document is likely tokenizable, and explains how this notion determines whether a task fits within AI’s context window and handling capabilities. - [00:03:33](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=213s) **Tokenizable Data Simplifies AI Adoption** - The speaker argues that data small enough to fit in a Word document or even on a napkin is far easier to tokenize and integrate with LLMs, whereas massive, complex datasets in data lakes make AI architecture much more challenging. - [00:09:22](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=562s) **Choosing Prompt Length Strategically** - The speaker explains that large, context‑rich prompts are best for well‑defined, production‑focused tasks, while short, iterative prompts are more effective for exploratory, brainstorming, or casual conversations. - [00:12:57](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=777s) **Balancing Thoughts on Big Prompts** - The speaker acknowledges that conversations aren't solely about big prompts, yet confesses a personal preference for them and stresses their honesty. ## Full Transcript
I want to take a minute to talk about
three tricky ideas in AI. I want to
explain why they're confusing, why
they're hard to understand, why I often
get questions about them, and I want to
make sure that I explain them clearly
enough that you can understand and teach
them to others because they underlly a
lot of the concepts I teach and talk
about. And I find that people often
misunderstand them. Number one, what is
tokenizable data? I talk about something
that's tokenizable. I talk about
tokenizable distributions and people
just like I can kind of see them glazing
over. It's like what is tokenize? Very
simply, ask yourself if a piece of data
in your business or a piece of data in
your world could appear in a document.
If it could, that's a really good sign.
It's probably tokenizable. If you can't
imagine it fitting in a document, it's
probably not tokenizable. And so when
you ask if AI can do this, people often
think about the task as a whole, but I
always ask about the data in the tokens
first. Can I even fit it in? Can I even
see if the tokens will go into the
system? Then we get to subsequent
questions I talk about a lot, like, is
there too much data here? Is it too big
for the context window? Is the task too
multifaceted for a single prompt? Or is
the task too complex for an AI to handle
with nuance? AI often sort of polishes
off the nuance in a task. Those are all
questions that are downstream of
tokenizable data. Understand that the
way to think about whether AI can do
something starts with the token. It's
just a little chunk that passes into the
transformer. It's a piece of a word.
It's about four characters. So, as an
example of something that doesn't easily
tokenize, spreadsheets. You have to have
special techniques for spreadsheets. AI
is still X farther behind on
spreadsheets than it is on Word docs. Is
it getting better? Absolutely. It's not
where word doc processing is. You can
hand a very large word document to an AI
and ask it to at least give you a sense
of what's in there. If you handed a
large spreadsheet to an AI because you
value accuracy in numbers, you're not
going to get nearly as lucky in most
cases unless you have a specialized
tool. And that's why tools like data
rails exist. You need specialized tools
that help. And we are seeing progress.
Notably, agent mode came out from OpenAI
and they can now create Excel sheets.
It's getting better. But if you look at
tokenized data as like tier A is easily
tokenized. Anything in your wiki is
tokenizable, super easily tokenized.
Tier B would be data that is at
spreadsheet scale, right? It fits in a
spreadsheet. It's not super easy to
tokenize, but it's probably there's some
stuff you can do to massage it and get
it in there. Tier C, data in a data
lake. It can be available for search
potentially through concepts like
agentic search. It is not something it's
too big. It's not something that easily
tokenizes because it's like hundreds of
thousands and millions of rows of time
series data that you have to relate. And
so the the the traditional LLM
transformer architectures don't do well
with that kind of data. Now, you can
take small pieces of it and you can look
at tokenization and maybe learning
something there. But most of the time
when people talk about how they hook up
LLMs to large sources of data, what
they're really saying is they have
figured out how to search the data lake
in order to retrieve useful pieces of
information that they can ladder into
insights. And there's some preparatory
steps they need to get into to do that.
Well, I'm trying to keep it simple.
We're not going to go too far down that
path today. Think of tokenizable data as
stuff you can fit in a Word doc first.
Second, maybe it can go in an Excel.
Third, anything that is so big and
massive and complex and structured that
it has to go into a a data warehouse, a
data a data lake, that is going to be
much harder. And what's interesting is
the easier it is to tokenize, the more
you have a chance to shape your destiny
with AI and that content. It is actually
quite difficult if you're working with
data lakes to pivot and figure out how
to architect AI solutions over the top.
Organizations wrestle with this all the
time. But if you have something that's
much simpler, if you have like company
policies and how you write your
documents and this and that all in one
neat word doc or three or four, you can
easily get that into LLMs and
immediately control your destiny and be
off to the races. So think in terms of
tokenizable data. Think in terms of
whether it fits in a word doc, maybe
whether you can sketch it on a napkin
because whether you can sketch it on a
napkin, by the way, is also a really
handy test for context window size. If
you can draw the complexity on a napkin,
AI can be very helpful. If you can't fit
the complexity onto the napkin, it may
be too complex for a nuance perspective
from AI.
Okay, concept number two, moving on from
tokenization, jagged intelligence. I
talk about jagged intelligence a lot and
again, people's eyes glaze over. Jagged
intelligence simply means that we have
AIs that are in some ways as smart as
Einstein and in other ways worse than
the worst intern you've ever met. I was
a pretty bad intern to be honest with
you. The the problem here is that AI is
not a continuous intelligence surface.
It has really really large gaps driven
by very known issues particularly around
memory. If AI can't remember something,
it can't learn as it goes. And yes, you
know, LLM teams are working on this, but
it's a hard problem. They haven't made a
ton of progress yet. And for the moment,
it is very very difficult to get an AI
to consistently do certain simple things
that require memory. So for example, if
you talk to the AI about your role and
ask it to fulfill an assignment and
write you an excellent article or write
me a proposal or write me this email,
you have to retalk to it again and again
and again and again. And if you make any
mistake, it will make a mistake. That's
jagged intelligence. It's good enough to
write those emails. It's good enough to
write that proposal or that article. It
is not good enough to remember how to do
it. It's not good enough to not be
extremely sensitive to mistakes you make
in the briefing. In a sense, you have
Shakespeare who who is just obsessed
with following instructions. And if you
make any mistake in the instructions,
Shakespeare is going to make mistakes.
This is why prompting matters so much
because you're essentially prompting and
you're trying to get the LLM to do what
it does best rather than getting stuck
in a place it doesn't do well. Other
examples of places that LLMs don't do
well, the low points in jagged
intelligence, math. They will call other
tools to do math and there are
specialized models. So Gemini has one,
OpenAI apparently has one that does math
olympiad problems. But when it comes to
is 9.9 or 9.11 bigger LLMs can still
struggle with that. And so if you were
trying to look at mathematical modeling
of concepts, if you're trying to
understand how to weigh the levers of a
business, you get some insight from AI,
but I find that the insight tends to
cluster around the existing distribution
on strategic advice, McKenzie decks more
broadly. It doesn't tend to be deeply
insightful unless you are
extraordinarily good at giving it
strategic intent and excellent context
and then it can reason across your
information specifically. And that
highlights another one of the tricky
things about jagged intelligence. Jagged
intelligence can be made less jagged if
you prompt better. And so if you are
better at communicating your intent, you
can erase some of those gaps a little
bit. You're still going to feel the
gaps. I still feel it because I find
that AI is really, really good at things
like outlining and often not as good at
things like capturing tone the way I
want it to capture tone. I'm very picky
about tone. I feel it when I'm asking AI
to think about strategy and I feel like
AI is good as a sounding board, but it
doesn't feel like it's as refined as it
needs to be. The more you cultivate high
taste, the more you cultivate saying it
can be better and I know how it can be
better, the more you are going to be
sensitive to jagged intelligence. And so
my challenge to you is basically to ask
where is my taste bar? If I can't sense
jagged intelligence, have I insisted on
a high enough bar with AI? Because I bet
you know something better than AI and
you can start to insist on a high bar
there. So that's the idea of jagged
intelligence. You have a sense of a
Shakespeare that has to follow
instructions and has amnesia. Third
concept, when do you apply big prompts
versus casual chats? I get this question
because I think people perceive me as
the kind of guy that always does the
fancy prompts. I get it. I I write the
big prompts. I understand that. What I
want to suggest to you is that the
planning and the thoughtfulness that go
into an excellent prompt pay off when
you have an important task that you want
to do. And when you have something that
you need to iterate on and discover as
you go, it pays more often just to start
with a sharp one or twoliner and go from
there. And so really what I'm saying,
and maybe people are like, well that's
obvious, right? Obviously if it's
important you put more time into it.
It's a little more nuanced than that. If
it's important and if it needs to be
anchored around a lot of context that
you give it, big prompt can make sense.
If it's casual and or if it's iterative
in nature, it can make sense to have a
longer sort of conversation as you go
with shorter prompts to start. In other
words, iterative tasks, you are
discovering meaning with the AI as you
go. So, it makes sense to start shorter
and just have a little bit of intent.
larger prompts are for anchoring around
a specific topic. And sure, you're going
to have multi-turn, but it's going to be
a conversation that happens inside that
box you've set with a big piece of
strategic intent at the top in a big
prompt. If you're trying to iterate and
riff and brainstorm and think through
things, it often is actually much more
useful to start with a very short prompt
and leave the model room to expand. And
so to me, it's a little bit deceptive to
think about it as meaningful work gets
done with big prompts because I can get
very meaningful work done with short
prompts that I am kicking back and forth
rapidly if I need to discover the
meaning iteratively. And so my
encouragement to you is, is this piece
of work something that I already know
enough about that I want to be focused
on production? Probably a bigger prompt.
Is this piece of work something I don't
even know the shape of and I need to
discover it? probably a shorter prompt.
Both can be valuable. And for casual
chats where you're just trying to like
brainstorm and riff around, again,
iterative, it's going to be a shorter
prompt. You can use a more formal
brainstorming process if you have
context and you want to constrain it and
set your assumptions. This is often what
we do with humans when we have a formal
brainstorming session. But we all know
that humans also think well at drinks
after work. And so you can have that
equivalent conversation with AI and
still get a ton of value. And so one of
the things I want to I guess encourage
you with is
don't think of it as Nate writes big
prompts and I have to write them too.
Think of it as know when to use a larger
more formal prompt versus when to use a
casual one. So if you put this all
together I think that you are going to
get farther with AI this week if you can
do a couple of little exercises that
help you to think through these
challenges. So find something you can
tokenize this week. Find something maybe
that you haven't tokenized this week. I
scribble stuff on notepads all the time.
It's terrible handwriting. I find that
with the right model, 03 is better at
this. I can visually process that data
and get it into text and I can tokenize
it. That's an example of tokenization
for me. You can find one for you.
Second, look for something that feels
jagged with AI and be intentional about
how you cultivate the strength, the
peak, the good part, the Shakespeare
part of AI rather than the part that
isn't so intelligent, isn't so good. And
then third, just keep an eye on how
often you feel like the prompt fit the
project. If you feel like the prompt fit
the project some of the time and like 60
70% of the time it's kind of where about
where I am. I sometimes have to restart
prompts because I'm like no that wasn't
the right prompt. Let me retry this. And
if you feel like you just are never
getting the right prompt that's a signal
like you can dive in. I've got lots of
material on prompts. It's a signal for
you to think about how you communicate
intent and what kind of work you want to
do. Where do you want to iterate for
value versus where do you want to anchor
and define and have a big conversation
first? Those three pieces, if you get
them, it's going to help you enormously
in understanding how to use AI and
augment it. So, I hope this has been
helpful for you. I hope you understand
tokenization better. I hope you
understand jagged intelligence better.
And I hope you understand that it's not
just all about big prompts. But I do
like big prompts. And I cannot lie.