Auto-generate dataloaders from sqlc queries #1233

radazen · 2023-10-18T15:31:55Z

The README included in this PR serves as a decent PR description and is pasted below. Also worth noting: this PR introduces the sqlc code generator, but it doesn't replace our existing dataloaders with the new generated code. I'll handle that in a follow-up PR.

Dataloader Generator

Automatically generates dataloaders based on sqlc queries

Requirements

sqlc.yaml must be set up to use sqlc's sqlc-gen-json example plugin to generate a JSON manifest file with information about generated queries

Quickstart

From the go-gallery root directory, run:

make sqlc-generate

Overview

This tool will read the manifest created by sqlc-gen-json and use the go/types package to figure out which SQL statements can be turned into dataloaders.

By default, all :batchone and :batchmany statements will create dataloaders
Dataloaders can also be generated for SQL queries that don't use sqlc's :batch syntax. See Custom Batching.

A dataloader can receive and cache results from other dataloaders. This happens automatically for dataloaders that appear to look up objects by their IDs, and can be set up for other dataloaders with minimal effort. See Caching Results.

Configuration options for individual dataloaders can be set with a -- dataloader-config: comment in the sqlc queries file. For example:

-- name: GetUserByID :batchone
-- dataloader-config: maxBatchSize=10 batchTimeout=2ms publishResults=false

See Configuring Dataloaders for a full list of available options.

Generated dataloaders are aware of sqlc.embed syntax, which can be used to return multiple generated types from a single query (e.g. a coredb.Token and a coredb.Contract). Each embedded type will be sent to dataloaders that can cache objects of that type (e.g. the coredb.Token in the example above will be sent to dataloaders that can cache coredb.Token results).

It's possible for sqlc to generate parameter types that go doesn't consider comparable. For example, a query might accept a list of Chains as a parameter, but a go struct with a slice field (e.g. chains []Chain) is not comparable. Generated dataloaders support these non-comparable keys by converting them to JSON internally, and using their JSON strings as comparable cache keys.

Running make sqlc-generate creates three files: dataloaders_gen.go and api_gen.go

manifest.json is the JSON manifest generated by the sqlc-gen-json plugin
dataloaders_gen.go contains definitions for all the generated dataloaders
api_gen.go contains a Loaders struct with fields for all the generated dataloaders, and sets up connections between them to cache results from one dataloader in another

Caching Results

Dataloaders will attempt to publish their results for other dataloaders to cache. A dataloader can opt in for caching by implementing one of these interfaces (where TKey and TResult are the key and result types of the dataloader itself):

// Given a TResult to cache, return the TKey value to use as its cache key
type autoCacheWithKey[TKey any, TResult any] interface {
	getKeyForResult(TResult) TKey
}

// Given a TResult to cache, return multiple TKey values to use as cache keys.
// The TResult value will be cached once for each provided cache key.
// Useful for things like GetGalleryByCollectionID, where the same Gallery result
// should be cached with each of its child collection IDs as keys.
type autoCacheWithKeys[TKey any, TResult any] interface {
	getKeysForResult(TResult) []TKey
}

If a sqlc query appears to look up an object by its ID, the generated dataloader will automatically implement autoCacheWithKey for that object type. This happens if the dataloader has:

a persist.DBID key type, and
a sqlc-generated result type (e.g. a coredb.Xyz) with a persist.DBID field named ID

Because ID-based lookups are the most common caching need, it's rare to need to implement one of the autoCache interfaces manually. If the need arises, add an entry to autocache.go.

Configuring Dataloaders

Configuration options for individual dataloaders can be set with a -- dataloader-config: comment in the sqlc queries file. For example:

-- name: GetUserByID :batchone
-- dataloader-config: maxBatchSize=10 batchTimeout=2ms publishResults=false

Available options:

maxBatchSize: the maximum number of keys to fetch in a single batched query. Defaults to 100.
batchTimeout: the duration to wait before sending a batch (unless it reaches maxBatchSize first, at which point it will be sent immediately). Defaults to 2ms.
publishResults: whether to publish results for other dataloaders to cache. Defaults to true.
skip: whether to skip generating a dataloader for this query. Defaults to false.

Custom Batching

The easiest and most common way to generate dataloaders is to use sqlc's :batch syntax, which uses the Postgres batching API to send many queries to the database in a single round trip. The batching API reduces round trip overhead, but it still executes one SQL query for each provided key. In some performance-critical circumstances (e.g. routinely looking up thousands of objects by their IDs), it's better to perform a single query that returns an entire batch of results.

A dataloader will be generated for SQL statements that don't use sqlc's :batch syntax, if:

the query uses the sqlc :many keyword
the query returns an int column named batch_key_index

batch_key_index should be a 1-based index that maps keys to results, and is typically created via the generate_subscripts function. For example, to create a dataloader that looks up contracts by their IDs:

with keys as (
    select unnest (@contract_ids::varchar[]) as id
         , generate_subscripts(@contract_ids::varchar[], 1) as batch_key_index
)
select k.batch_key_index, sqlc.embed(c) from keys k
    join contracts c on c.id = k.id
    where not c.deleted;

This example is a good template for looking up objects by IDs via custom batching, and can be reused for other types.

Note: because the SQL query above does not have a persist.DBID key type (it uses a []varchar), the generated dataloader will not automatically implement autoCacheWithKey for the result type. autoCacheWithKey will need to be implemented manually.

radazen · 2023-10-18T15:33:54Z

db/gen/coredb/manifest.json

This is a gigantic(5MB) generated file that will change every time we add new sqlc queries. We could add it to .gitignore, or we could just commit it and not pay attention to it. I think both approaches are fine! I kept it around for the time being.

jarrel-b

This is so cool, I'm excited to see this in action and start using it. Also appreciate all the comments and documentation, it made it a lot easier to follow 👍

jarrel-b · 2023-10-20T16:44:45Z

cmd/dataloaders/README.md

+```
+See **[Configuring Dataloaders](#configuring-dataloaders)** for a full list of available options.
+
+Generated dataloaders are aware of `sqlc.embed` syntax, which can be used to return multiple generated types from a single query (e.g. a `coredb.Token` and a `coredb.Contract`). Each embedded type will be sent to dataloaders that can cache objects of that type (e.g. the `coredb.Token` in the example above will be sent to dataloaders that can cache `coredb.Token` results).


This is so cool

jarrel-b · 2023-10-20T19:10:49Z

cmd/dataloaders/generator/dataloader.go

+
+		// Prevent lock contention within a batch by allowing only the first maxBatchSize callers
+		// to obtain the lock.
+		numAssigned := atomic.AddInt32(&b.numAssigned, 1)


nit: I got a bit confused with the name numAssigned, since it refers to the number of callers so far in the batch, maybe something like num callers, current caller count, caller slot, etc?

Good call! This is the last thing I added, and I kind of threw it in there haphazardly. I'll call it numCallers!

jarrel-b · 2023-10-20T19:20:42Z

cmd/dataloaders/generator/dataloader.go

+
+func NewDataloader[TKey comparable, TResult any](ctx context.Context, maxBatchSize int, batchTimeout time.Duration, cacheResults bool, publishResults bool,
+	fetchFunc func(context.Context, []TKey) ([]TResult, []error)) *Dataloader[TKey, TResult] {
+	return newDataloader(ctx, maxBatchSize, batchTimeout, cacheResults, publishResults, fetchFunc, indexOf[TKey])


Could searching through keys linearly become an issue, or in practice, the batch size is never very large?

I think the linear search should be okay in practice. The existing dataloaders do it, and I've never noticed a bottleneck there. I was debating whether we should create a map per batch to make these lookups faster, but I'm honestly not sure if speed would improve, and memory usage would definitely go up a bit.

jarrel-b · 2023-10-20T19:28:43Z

graphql/dataloader/dataloaders_gen.go

+	return d
+}
+
+func loadCountAdmiresByFeedEventIDBatch(q *coredb.Queries) func(context.Context, []persist.DBID) ([]int64, []error) {


Wow, this is unreal!!

jarrel-b · 2023-10-20T19:32:59Z

graphql/dataloader/api_gen.go

🔥 🔥 🔥

radazen added 2 commits October 18, 2023 11:26

The rest of the owl

1190b31

Merge remote-tracking branch 'origin/main' into ezra/dataloaders

5b6f669

radazen requested review from benny-conn and jarrel-b as code owners October 18, 2023 15:31

radazen requested review from Robinnnnn and kaitoo1 October 18, 2023 15:32

radazen commented Oct 18, 2023

View reviewed changes

radazen added 3 commits October 19, 2023 09:36

Replace old dataloaders with new ones

01f5909

Better lock contention handling within batches

a59b27c

Add a comment

6d1d44e

github-actions bot added query dataloader labels Oct 20, 2023

radazen added 3 commits October 20, 2023 12:28

Better handling for media lookups

bb5936f

MediaByTokenID -> MediaByMediaID

6681569

More TokenID -> TokenMediaID updates

e2e3ae9

jarrel-b approved these changes Oct 20, 2023

View reviewed changes

radazen added 7 commits October 25, 2023 18:36

Better not found error handling, v1

4db784a

Rename numAssigned to numCallers

a5de6fd

Add getNotFoundError implementations for existing pgx.ErrNoRows cases

34f5301

Update the README with docs for pgx.ErrNoRows

5a810b1

Merge remote-tracking branch 'origin/main' into ezra/dataloaders

56f5f23

make sqlc-generate after merging main

c10e2fd

Fix tests

7f0be23

radazen merged commit a5562d7 into main Oct 31, 2023
7 checks passed

radazen deleted the ezra/dataloaders branch October 31, 2023 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-generate dataloaders from sqlc queries #1233

Auto-generate dataloaders from sqlc queries #1233

radazen commented Oct 18, 2023

radazen Oct 18, 2023

jarrel-b left a comment

jarrel-b Oct 20, 2023

jarrel-b Oct 20, 2023 •

edited

Loading

radazen Oct 26, 2023

jarrel-b Oct 20, 2023

radazen Oct 26, 2023

jarrel-b Oct 20, 2023

jarrel-b Oct 20, 2023

Auto-generate dataloaders from sqlc queries #1233

Auto-generate dataloaders from sqlc queries #1233

Conversation

radazen commented Oct 18, 2023

Dataloader Generator

Requirements

Quickstart

Overview

Caching Results

Configuring Dataloaders

Custom Batching

radazen Oct 18, 2023

Choose a reason for hiding this comment

jarrel-b left a comment

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

radazen Oct 26, 2023

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023

Choose a reason for hiding this comment

radazen Oct 26, 2023

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023

Choose a reason for hiding this comment

jarrel-b Oct 20, 2023 •

edited

Loading