Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate the --save feature from openai-to-sqlite similar #230

Open
simonw opened this issue Sep 5, 2023 · 8 comments
Open

Duplicate the --save feature from openai-to-sqlite similar #230

simonw opened this issue Sep 5, 2023 · 8 comments
Labels
embeddings enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 5, 2023

https://github.com/simonw/openai-to-sqlite/blob/361d98a7f260a1420e6e698481f298848b922253/README.md#saving-similarity-calculations-to-the-database

This is the feature that can be used to save calculated similarity scores to the database. I use it to serve related TILs on my TILs site: https://til.simonwillison.net/llms/openai-embeddings-related-content

openai-to-sqlite similar embeddings-bjcp-2021.db \
  --all --save

And this feature too:

openai-to-sqlite similar embeddings-bjcp-2021.db \
  '23G Gose' '01A American Light Lager' \
  --save \
  --recalculate-for-matches \
  --count 20
@simonw simonw added enhancement New feature or request embeddings labels Sep 5, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

The similarities table is pretty simple: https://til.simonwillison.net/tils/similarities

CREATE TABLE [similarities] (
   [id] TEXT,
   [other_id] TEXT,
   [score] FLOAT,
   PRIMARY KEY ([id], [other_id])
);

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

Maybe similarity score functions should be provided by plugins, and stored in a scoring_functions table with an integer primary key (as a foreign key from similarities) plus a text column that stores the path to the function - so if it's in core it's llm.scoring.cosine_similarity but if it's from some plugin it's llm_manhattan.manhattan.

The same mechanism could work for chunking functions too, see:

@simonw simonw added this to the 0.10 milestone Sep 10, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

Since I have a migrations system in place I can ignore that idea for the moment and add it in the future if appropriate.

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

I'm going to implement --save and --print and --recalculate-for-matches but not --table.

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

I need to land this first, since it has a migration in already:

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

The migration for this will be:

@embeddings_migrations()
def m006_similarities(db):
    db["similarities"].create({
        "collection_id": int,
        "id": str,
        "other_id": str,
        "score": float,
    }, pk=("collection_id", "id", "other_id"))

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

The compound primary keys make this a bit harder, since sqlite-utils and Datasette don't really support those for foreign keys yet. Already filed one bug:

simonw added a commit that referenced this issue Sep 12, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

This was getting a bit fiddly. decided to drop it from 0.10.

@simonw simonw modified the milestones: 0.10, 0.11 Sep 12, 2023
@simonw simonw modified the milestones: 0.11, 0.12 Sep 19, 2023
@simonw simonw removed this from the 0.12 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant