Solving the GraphQL N+1 Problem

📖 Click here for a fully-worked example of the code in this blog post

If I log all SQL queries in my GraphQL consultancy application and then execute a nested query against the GraphQL schema:

{
  developers {
    name,
    project {
      name
    }
  }
}

The following output is printed to the console:

SELECT d.id, d.name, d.role FROM developer d  - undefined
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":1}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":2}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":3}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":4}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":5}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":6}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":7}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":8}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":9}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":10}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":11}
SELECT a.projectId, p.name, p.description FROM assignments a LEFT JOIN project p ON (p.id = a.projectId) WHERE a.developerId = $id - {"$id":12}

That's a lot of SQL for a relatively small query!

Breaking it down, the first line isn't too bad. It's retrieving all of the developers in one go - you can probably guess that this is analagous to the developers field on the GraphQL root query. But as the query above is requesting the project field on each Developer type that is returned, for each developer there is a subsequent database lookup.

This is a common problem in the area of object-relational mapping, known as the n+1 problem. ORM frameworks such as Hibernate and Entity Framework can normally help you avoid this problem by batching requests, which is incredibly important when your database is slow, or scalability and performance are a concern. Efficient batching of SQL also helps avoid strange data inconsistencies that can arise from inefficient queries.

Over this series I have deliberately built the GraphQL application in a naïve manner, mainly to demonstrate that it is quite easy to run into performance issues such as this, if you're not careful. I see this as both a strength and weakness of GraphQL. When Facebook conceived GraphQL, they didn't want to replace an entire carefully-optimized backend stack. GraphQL was therefore envisaged as a thin wrapper around an existing backend stack. As such it makes no assumptions about your underlying data store. It could be SQL, it could be text files, it could even come from another web service. There's little magic under the hood, so effective GraphQL usage requires profiling, inspection and optimisation based on your own context. You are best-judged to make those decisions.

So, how can I solve this? First, let's take a step back.

What's in a Query?

Ultimately a GraphQL query is just a bunch of function calls. At any point in time there's an object in scope that defines a set of functions. These are your GraphQL types, and the fields on that type, respectively.

The query above therefore breaks down to:

On the root query object, call the function to get all developers.
On each object that is returned, call the function to get the project.

GraphQL is asynchronous by default. It will not respond to a caller until all fields have been resolved. Some fields can be resolved immediately (e.g. returning a property on an already-initialised object), whereas others may require some time to resolve (e.g if they require a database lookup). The entity that handles incoming queries will await all fields resolving.

To avoid the n+1 problem, I'd like to somehow batch the SQL queries into a single query. Something like SELECT * from project WHERE id IN (1, 2, 3) would be ideal. This would require us to reduce the incoming requests into one query and then map the results back out to the individual fields.

So here's an idea. Rather than making an individual SQL request per project, I could instead:

In the root query field, create a data structure to represent queries that should be batched.
Add the id to a collection of queries to be batched
Compose an optimized SQL query
Resolve each field with the entities from the database.

Introducing Data Loader

DataLoader is Facebook's open-source utility for asynchronous batching of data loading. This is commonly used within GraphQL applications to do just the type of batching discussed above.

A DataLoader instance is constructed by passing a function, which maps an array of ids into an array of values. As this too is asynchronous by default you can return a Promise that resolves to the array of values.

In my code, I define a function that accepts the database accessor and returns an object of loaders. One DataLoader is defined to retrieve the project assignments for a list of developers, identified by id.

const buildLoaders = db => ({
  projectAssignments: new DataLoader(developerIds => getProjectAssignments(developerIds, db))
});

When constructing your GraphQL endpoint, it's possible to add to a context object, which is passed down to all fields. This technique is useful if you have a singleton that you'd like to share. I already use this to give fields access to the database object, so I can also add the object containing my loaders to it:

sqlite.open('consultancy.db', { cached: true })
  .then(() => sqlite.run('PRAGMA foreign_keys=on'))
  .then(() => {
    const graphQLApp = express();
    const db = {
      get: (...args) => sqlite.get(...args),
      all: (...args) => sqlite.alll(...args),
      run: (...args) => sqlite.run(...args)
    };
    const loaders = buildLoaders(db);

    graphQLApp.use('/', graphQLHTTP({
      graphiql: true,
      pretty: true,
      schema,
      rootValue,
      context: {
        db,
        loaders
      }
    }));

    graphQLApp.listen(GRAPHQL_PORT);
  });

Now on the field project on type Developer, rather than calling a function that accesses the database directly, I add the id onto the loader:

class Developer {
  constructor(id, name, role) {
    this.id = id;
    this.name = name;
    this.role = toRole(role);
  }

  project(obj, ctx) {
    return ctx.loaders.projectAssignments.load(this.id);
  }
}

When the data loader function resolves with its array of values, it fans these values back out to their callers, so the result of the load function above will be a single object.

Finally, I need to change the database query that I construct, because now it's looking up several entities rather than a just one:

export const getProjectAssignments = (ids, db) =>
  db.all('
    SELECT d.id, p.id as projectId, p.name, p.description
    FROM developer d
    LEFT JOIN assignments a
    ON (d.id = a.developerId)
    LEFT JOIN project p
    ON (p.id = a.projectId)
    WHERE d.id
    IN (${ids.map(i => '?')})
  ', ids.join()).then(rows => rows.map(
      id => {
        const entry = rows.find(row => row.id.toString() === id);
        if (entry) {
          if (entry.projectId) {
            return new Project(r.projectId.toString(), r.name, r.description);
          }
          return null;
        }
        return new Error(`No Developer exists with id ${id}`);
      }
    ));

Obviously the latter is more complex than the former. It took me a while to get right.

DataLoader throws an error if the array of values is different in length to the array of ids. Therefore it's important to make sure that the loader function returns something for each requested key. The order matters too. The query above uses bound parameters, so that if an id isn't found in the database it doesn't cause the entire query to fail. This also gives you some protection against SQL injection, compared to embedding the ids into the query directly. But, when mapping the resultant rows to instances of the Project type, I must ensure that for ids that didn't produce a result they're padded out with an Error, and for ids that did produce a result but not an inner result (i.e a developer with no project), I return a null.

(Note: I ended up extracting the above function into a generic "whereIn" function to avoid code duplication. You can see that form of the solution in the example linked at the beginning of the post.)

Re-running the original GraphQL query, the database logging output becomes:

SELECT d.id, d.name, d.role FROM developer d  - undefined
SELECT d.id, p.id as projectId, p.name, p.description
  FROM developer d
  LEFT JOIN assignments a
  ON (d.id = a.developerId)
  LEFT JOIN project p
  ON (p.id = a.projectId)
  WHERE d.id IN (?,?,?,?,?,?,?,?,?,?,?,?) - ["1","2","3","4","5","6","7","8","9","10","11","12"]

Hooray!

Where Else?

One other area that you can consider using DataLoader for is when your schema has a field to resolve a single instance of some type. Seems strange, but remember that with GraphQL, you can assign these to values, making it possible to actually request several single instances in one query.

This query does just that, except one of the ids is obviously erroneous.

query {
  sam: developer(id: "2") {
    name
  },
  gary: developer(id: "3") {
    name
  },
  dustin: developer(id: "eleven") {
    name
  }
}

By defining another DataLoader to load developers, these can be batched. As the method I used above maps any unfound entries to instances of an Error, GraphQL can accommodate this in its response, by highlighting the error:

{
  "errors": [
    {
      "message": "No developer exists with id eleven",
      "locations": [
        {
          "line": 8,
          "column": 3
        }
      ],
      "path": [
        "dustin"
      ]
    }
  ],
  "data": {
    "sam": {
      "name": "Gary"
    },
    "gary": {
      "name": "Chris"
    },
    "dustin": null
  }
}

Conclusion

We've seen how easy it can be to write sub-optimal SQL using GraphQL. DataLoader is a simple mechanism by which you can avoid the n+1 problem by batching multiple database requests into a more efficient database call. I'd highly recommend considering this solution early in development to prevent scalability issues. But always remember to profile these things!