Documents worth chatting to

Posted on Jan 9, 2024

I have reams of documents from over the years. Initially handwritten, post-college as digital text—the documents cover communications, all kinds of notes, recordings, personal metadata, etc. To plumb together a personal RAG system I’ll need to nail down the sources.

Let’s start by brainstorming easily accessible digital sources:

  • org/obsidian files (how to chunk?)
  • emails (pull using offlineimap)
  • calendar (one-time export)
  • health data, (Apple HealthKit, Google Fit
  • voice/video chat recordings (which I don’t have)
  • source code (not really useful to bring in)
  • ‘sketches’, blueprints, mind maps which don’t really exist

Beyond the digital sources I’ve been accumulating from are the potential sources of the future. For example, better logging my thoughts and processes in the hope LLMs will give that personal text personal utility and accessibility (yes, that’s what this doc is.).

Another potential source could come from “auto-logging” all of my digital actions to produce a digital paper trail which can be leveraged. Bringing the auto-logging idea into meatspace yields ambient recording, e.g., those Snapchat/Oakley glasses that include a camera and can capture what someone is seeing and/or hearing. I imagine the depth and richness of such an ambient ‘audio’ as a source would be able to accumulate & become into the relevant-for-training quantity necessary for training.

Getting way out on a limb, I wonder if there is some sort of ‘auto-logging’ for thoughts, getting Chains of Thought material easily. To a certain extent that’s what a journal notebook provide, e.g., one might learn about learning by looking at the notes of everyone who, say, took a class. A textbook might provide a better absolute reference, but the notes would show the path to grok’ing the concept across a population of students.

While analog documents require a digitization step, unlocking their value re: LLMs/software can be super compelling – we’ve seen photos come into focus over the last ten years of ML, and other artifacts might undergo similar expansion. The most obvious untapped source are handwritten notes, though I’d expect the signal to noise of all my raw notes to be quite low. I will try to start handwriting my notes so that they can be easily mulched by AI’s, e.g. clear hand-writing and consistent markup format.

More to come. Over the next few days I’ll aim to pull, index, and query against my emails.