My next job for the Pick'Em app is to define the application's data model. In the previous post, I set up all the cloud infrastructure with AWS Amplify.

I chose to use DynamoDB for the backend storage, mainly to familiarise myself more with this style of database. DynamoDB is a scalable 'NOSQL' database, which in some respects can be considered a Key-Value hash map, with arbitrary runtime data attributes.

Traditional SQL databases specify a data schema up front, encoding entities and their relationships into tables, with primary and foreign keys as identifiers. Data storage is highly normalized, removing the need for data dpplication. At runtime, a user can execute arbitrary queries through constructs such as JOINs. Although there have been decades of optimization, SQL performance degrades with data size, where the joining becomes a bottleneck.

DynamoDB does away with all of this in the name of scalability. Using DynamoDB effectively requires putting quite a lot of ingrained SQL knowledge aside and paying close attention to the application's data access patterns. Indeed, the prefered practice in DynamoDB land is to represent the application's data with a single table! DynamoDB's scalability comes from using the key-value structure as a means of partitioning data - with the right partitioning strategy, reads are distributed and fast, even under high loads. This also comes with the benefit of having a much simpler query structure, as SQL-style data joins are not supported.

The resulting schema looks nothing like what you may be conventionally used to from a traditional SQL database. Before delving in to the thinking and schema I devised for the Pick'em app, a note of caution. I'm in no way an expert and I'm sure there are improvements that could be made here. Out of a personal curiosity, please do reach out if there's anything I should be aware of! As it stands though, I'm quite happy with this design.

Entities and Relationships and Keys (Oh My!)

NoSQL doesn't mean there aren't entities and relationships! They're still crucial to developing a workable schema.

In a nutshell: Users will make Predictions for the Matches in an Event, and their Results will be aggregated into league Standings.

As I am aiming for a single table, I've opted for some generic names for some key attributes.

  • The primary key, known as the partition key, is named PK
  • The sort key is named SK
  • The attribute that acts as the partition for the global secondary index is GSI_PK
  • The attribute that acts as the sort key for the global secondary index is GSI_SK.
  • Each data item will have an attribute called type defining which entity it represents.

Together, a partition and sort key comprise a composite key to uniquely identify a data item. It is possible to store many items with the same primary key (partition key) in DynamoDB, providing the sort key uniquely identifies them. Further, as expected by the name, the sort key identifies the order in which queries items are returned.

With generic names for the key attributes, it's natural that I would also use a generic structure for the values for these partition keys too. A DynamoDB convention is to use the ENTITY#ID format for the key values. In some cases it may be required to go further, to ENTITY#ID#EXTRAS. The purpose of this structure is to enforce consistency, which becomes particularly important when it comes to sorting. As strings, these will be lexicographically sorted by default.

The Database Schema

Working this back-to-front, here's the single-table design I came up with. It requires a main table and one global secondary index. After which, I'll cover how this satisfies each of the data access patterns. Trust me, it will make sense.

Main Table

Primary Key Attributes
Partition Key: PK Sort Key: SK
USER#sam USER type name
user Sam Hogarth
STANDINGS#year type GSI_PK GSI_SK belongsto eplayed mcorrect points
standings STANDINGS#year SCORE#00140#sam USER#sam 1 10 140
EVENT#2024-03-03-aew-revolution type GSI_PK GSI_SK mcorrect points predictions
prediction EVENT#2024-03-03-aew-revolution SCORE#140#sam 10 140 { ... }
EVENT EVENT#2024-03-03#aew-revolution type GSI_PK GSI_SK date name state
event EVENT#2024-03-03#aew-revolution EVENT 2024/03/03 AEW Revolution scored
EVENT#2023-03-03-aew-revolution MATCH#03973653... type GSI_PK GSI_SK id championships name points preshow result teams
match EVENT#2023-03-03-aew-revolution MATCH#01 03973653-e1f9-4262-86d5-02cbfd2cf962 [...] Tag Team 5 true 1 [ ... ]

Global Secondary Index

Primary Key Attributes
Partition Key: GSI_PK Sort Key: GSI_SK
EVENT#2023-03-03-aew-revolution EVENT type PK SK date name state
event EVENT EVENT#2023-03-03-aew-revolution 2024/03/03 AEW Revolution broadcast
MATCH#01 type PK SK id championships name points preshow result teams
match EVENT#2023-03-03-aew-revolution MATCH#03973653... 03973653-e1f9-4262-86d5-02cbfd2cf962 [...] Tag Team 5 true 1 [ ... ]
SCORE#140#sam type PK SK mcorrect points predictions
prediction USER#sam EVENT#2024-03-03-aew-revolution 10 140 { ... }
STANDINGS#year SCORE#00140#sam type PK SK belongsto eplayed mcorrect points
standings USER#sam STANDINGS#year USER#sam 1 10 140

Data Access Patterns

In SQL schema design it's possible to initially define a normalized schema pretty much independent of the application's data access patterns, but in DynamoDB single table design this isn't really an option. I forsee the following data requests from the applications:

  1. Get a list of events, most recent first, which may be filtered/limited/paginated in future as the number of events grow
  2. Get the league standings - a table of user's scores, highest first
  3. Get the details of an event (broadcast time, matches) so that a user can make a prediction (and if they've already made a prediction, return that too so they can amend it)
  4. The same view as 3, but when the event has concluded and been scored, also return the results and a league table just based on scores for this event.
  5. Get a user's details - their league standings, their points totals from previously-broadcasted events.

These map pretty closely to the 'screens' of the application. What I'm aiming to do is get all of the data needed for one screen of the application in a single query. I don't want to have to make multiple calls to the databse, whether independent or sequential, to service one screen - as this is where latency catches up to you.

With this schema design, I've aimed to utilise sort keys to sort:

  • the reverse-chronological order of events (most recent first)
  • the order of matches on an event card (in general, pre-show first, main event last)
  • the league standings (highest points first)

Footgun: Changeable Sort Orders

Sort keys on the main table can't be changed by a PUT or UPDATE operation. Technically, they can be changed by a transaction with a DELETE and a PUT combination. This constraint doesn't apply to secondary index sort keys! So, I essentially had to 'flip' the data schema around, any sortable data where the sort order can change has that sort key stored in GSI_SK. Anything that's static, such as a date, can remain on SK.

User data, such as a user's predictions and their individual standings in the league, are stored within a partition keyed on the user's ID. This is then projected out into the secondary index so that queries can be made for a particular league standings or event.

Get a List of Events

The details of an event, such as its state and broadcaste date are stored in a unique item. On the main table these live as entries in a single partition, with PK being EVENT. Although this creates a 'hot' partition, it does mean that querying for just the list of events is simple to achieve. The sort key (SK) for each event is the lexicographically-sortable EVENT#date-name.

To retrieve these items in reverse-chronological order, specify ScanIndexForward=false in the query.

PK = EVENT, ScanIndexForward = false

Each event item swaps around the PK and SK values for the corresponding GSI_PK and GSI_SK attributes, effectively creating a partition key on the secondary index that is the unique ID for the event.

Get the League Standings

As a user's standing in the league lives within their partition, getting all the standings relies on the secondary index. Each item of the standings type has a GSI_PK of the league name, i.e STANDINGS#year. The GSI_SK is where the heavy-lifting is done. It's a combination of SCORE, the number of points, and the username: if user sam has 140 points, this is stored as SCORE#140#sam.

Note earlier I mentioned that strings in DynamoDB are lexicographically sorted. This is important as it leads to a potential hazard.

Consider the list of unsorted numbers: 11, 2, 10, 1:

  • Sorted numerically: 1, 2, 10, 11
  • Sorted lexicographically: 1, 10, 11, 2

Avoiding this requires applying padding to the numbers. Based on the point scoring system, the number of matches per typical event and the number of expected events in a year, I've estimated an extremely gifted user would be able to score a total amount of points in the order of tens-of-thousands within a calendar year. Therefore, scores would be stored in the sort key as: 10000, 01000, 00100 00010.

Prefixing the username to the sort key adds further disambiguation in the event of a tie in scoring.

Index=GSI, GSI_PK = STANDINGS#2024, ScanIndexForward=false

Get the Details of an Event

To predict, the user must first see which matches are on the card for an event. They may also want to revise their existing predictions, so if a user has any existing predictions, they should be returned too.

Each match is represented as a single item, with a PK being the event unique id, and the SK being the unique uuid for the match. Given that I'm already projecting out into the secondary index entries that have GSI_PK=EVENT#id, I can piggyback on this to add the matches to this view. Also bearing in mind the constraint around changeable sort orders on the main SK (note, the match card order can and does change!), I can also set the GSI_SK to be the match's sort order. The same lexicographical sorting technique for points is applicable here too. Sort orders are padded to support tens of matches per event.

There are also a bunch of attributes to list the championship(s) being defended, how many points are available, whether it's a pre-show match (boolean), and an array of teams. And, for when it is eventually scored, a result attribute storing the index of the winning team.

Actually, it's probably worth diving into the teams object. Wrestling matches can have multiple participants, optionally grouped into teams. Those teams may have names. One (or more!) of the participants, or teams, may be defending a championship. To satisfy the available combinations, a match consists of:

  • An array of teams, minimum 2
  • Each team has at least one member
    • A team may (optionally) have a name
    • A team may (optionally) be defending a championship

Let's illustrate with an example:

{
  "championships": [ "AEW World Tag Team Championship" ],
  "id": "05389aab-1f1b-44d0-9cfc-2e6d9cfd487f",
  "name": "Tornado Tag Team",
  "points": 25,
  "result": 0,
  "teams": [
    {
      "champion": true,
      "members": [ "Sting", "Darby Allin" ]
    },
    {
      "members": [ "Matthew Jackson", "Nicholas Jackson" ],
      "name": "The Young Bucks"
    }
  ]
}

User's predictions are partitioned by the user's ID, so similar to the standings, predictions are projected out into the secondary index. To do this I set the GSI_PK=EVENT#id on items of type prediction. A prediction item also contains the user's score for that event. Again, given the changeable-sort-order, the score value is used as the GSI_SK. The secondary index sort key uses the same SCORE#value#username format as the league standings, with the points value supporting hundreds of points. In order to associate a prediction/results item to a user, I add a belongsto attribute, which is crucial when trying to retrieve just a single user's prediction.

In the spirit of DynamoDB, with this schema I can get all of the data about a single query just by looking at the partition key:

Index=GSI, GSI_PK = EVENT#id

In fact, at the point where an event is in the scored state (i.e. the results are in!), this is what I query for, with ScanIndexForward = false so that the highest-points scores come first.

When an event is still upcoming and accepting predictions, I just need to return the event details, matches and that particular user's predictions. DynamoDB supports filter expressions, which are executed after the Scan operation. Using the belongsto attribute on the user's prediction/result item, it's possible to create a filter expression that says "on a result where the belongsto attribute is present, it must equal the user's id". This will still pass through the event detail and match entries, but only then pass through a particular user's predictions, rather than everyone's.

Index=GSI, GSI_PK = EVENT#id, Filter = attribute_not_exists(belongsto) Or belongsto = user

Get a User's Details

Given that I'm already partitioning the user's predictions and standings, alongside an item of type user for a user's name, this is a simple read from the main table. I still set ScanIndexForward so that date-based data are returned reverse-chronologically.

PK = USER#id, ScanIndexForward=false

Requirement Params Table Operation
Get Events - main PK = EVENT, ScanIndexForward=false
Get Standings year GSI GSI_PK = STANDINGS#year, ScanIndexForward=false
Get Event Listings and Predictions event, user GSI GSI_PK = EVENT#id, Filter = attribute_not_exists(belongsto) Or belongsto = user
Get Event Listings, Scores and User's Predictions event, user GSI GSI_PK = EVENT#id, ScanIndexForward=false
Get a User's Details user main PK = USER#id, ScanIndexForward=false

Where We're Going, We Don't Need Servers

Now that the data layer is in place, it's time to move over to the world of serverless! The next post will cover implementing the HTTP API for the application user interface, using AWS API Gateway and a Lambda function.