Single Table Design for the Pick'Em App
My next job for the Pick'Em app is to define the application's data model. In the previous post, I set up all the cloud infrastructure with AWS Amplify.
I chose to use DynamoDB for the backend storage, mainly to familiarise myself more with this style of database. DynamoDB is a scalable 'NOSQL' database, which in some respects can be considered a Key-Value hash map, with arbitrary runtime data attributes.
Traditional SQL databases specify a data schema up front, encoding entities and their relationships into tables, with primary and foreign keys as identifiers. Data storage is highly normalized, removing the need for data dpplication. At runtime, a user can execute arbitrary queries through constructs such as JOIN
s. Although there have been decades of optimization, SQL performance degrades with data size, where the joining becomes a bottleneck.
DynamoDB does away with all of this in the name of scalability. Using DynamoDB effectively requires putting quite a lot of ingrained SQL knowledge aside and paying close attention to the application's data access patterns. Indeed, the prefered practice in DynamoDB land is to represent the application's data with a single table! DynamoDB's scalability comes from using the key-value structure as a means of partitioning data - with the right partitioning strategy, reads are distributed and fast, even under high loads. This also comes with the benefit of having a much simpler query structure, as SQL-style data joins are not supported.
The resulting schema looks nothing like what you may be conventionally used to from a traditional SQL database. Before delving in to the thinking and schema I devised for the Pick'em app, a note of caution. I'm in no way an expert and I'm sure there are improvements that could be made here. Out of a personal curiosity, please do reach out if there's anything I should be aware of! As it stands though, I'm quite happy with this design.
Entities and Relationships and Keys (Oh My!)
NoSQL doesn't mean there aren't entities and relationships! They're still crucial to developing a workable schema.
In a nutshell: User
s will make Prediction
s for the Match
es in an Event
, and their Result
s will be aggregated into league Standing
s.
As I am aiming for a single table, I've opted for some generic names for some key attributes.
- The primary key, known as the partition key, is named
PK
- The sort key is named
SK
- The attribute that acts as the partition for the global secondary index is
GSI_PK
- The attribute that acts as the sort key for the global secondary index is
GSI_SK
. - Each data item will have an attribute called
type
defining which entity it represents.
Together, a partition and sort key comprise a composite key to uniquely identify a data item. It is possible to store many items with the same primary key (partition key) in DynamoDB, providing the sort key uniquely identifies them. Further, as expected by the name, the sort key identifies the order in which queries items are returned.
With generic names for the key attributes, it's natural that I would also use a generic structure for the values for these partition keys too. A DynamoDB convention is to use the ENTITY#ID
format for the key values. In some cases it may be required to go further, to ENTITY#ID#EXTRAS
. The purpose of this structure is to enforce consistency, which becomes particularly important when it comes to sorting. As strings, these will be lexicographically sorted by default.
The Database Schema
Working this back-to-front, here's the single-table design I came up with. It requires a main table and one global secondary index. After which, I'll cover how this satisfies each of the data access patterns. Trust me, it will make sense.
Main Table
Primary Key | Attributes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Partition Key: PK | Sort Key: SK | ||||||||||
USER#sam | USER | type | name | ||||||||
user | Sam Hogarth | ||||||||||
STANDINGS#year | type | GSI_PK | GSI_SK | belongsto | eplayed | mcorrect | points | ||||
standings | STANDINGS#year | SCORE#00140#sam | USER#sam | 1 | 10 | 140 | |||||
EVENT#2024-03-03-aew-revolution | type | GSI_PK | GSI_SK | mcorrect | points | predictions | |||||
prediction | EVENT#2024-03-03-aew-revolution | SCORE#140#sam | 10 | 140 | { ... } | ||||||
EVENT | EVENT#2024-03-03#aew-revolution | type | GSI_PK | GSI_SK | date | name | state | ||||
event | EVENT#2024-03-03#aew-revolution | EVENT | 2024/03/03 | AEW Revolution | scored | ||||||
EVENT#2023-03-03-aew-revolution | MATCH#03973653... | type | GSI_PK | GSI_SK | id | championships | name | points | preshow | result | teams |
match | EVENT#2023-03-03-aew-revolution | MATCH#01 | 03973653-e1f9-4262-86d5-02cbfd2cf962 | [...] | Tag Team | 5 | true | 1 | [ ... ] |
Global Secondary Index
Primary Key | Attributes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Partition Key: GSI_PK | Sort Key: GSI_SK | ||||||||||
EVENT#2023-03-03-aew-revolution | EVENT | type | PK | SK | date | name | state | ||||
event | EVENT | EVENT#2023-03-03-aew-revolution | 2024/03/03 | AEW Revolution | broadcast | ||||||
MATCH#01 | type | PK | SK | id | championships | name | points | preshow | result | teams | |
match | EVENT#2023-03-03-aew-revolution | MATCH#03973653... | 03973653-e1f9-4262-86d5-02cbfd2cf962 | [...] | Tag Team | 5 | true | 1 | [ ... ] | ||
SCORE#140#sam | type | PK | SK | mcorrect | points | predictions | |||||
prediction | USER#sam | EVENT#2024-03-03-aew-revolution | 10 | 140 | { ... } | ||||||
STANDINGS#year | SCORE#00140#sam | type | PK | SK | belongsto | eplayed | mcorrect | points | |||
standings | USER#sam | STANDINGS#year | USER#sam | 1 | 10 | 140 |
Data Access Patterns
In SQL schema design it's possible to initially define a normalized schema pretty much independent of the application's data access patterns, but in DynamoDB single table design this isn't really an option. I forsee the following data requests from the applications:
- Get a list of events, most recent first, which may be filtered/limited/paginated in future as the number of events grow
- Get the league standings - a table of user's scores, highest first
- Get the details of an event (broadcast time, matches) so that a user can make a prediction (and if they've already made a prediction, return that too so they can amend it)
- The same view as 3, but when the event has concluded and been scored, also return the results and a league table just based on scores for this event.
- Get a user's details - their league standings, their points totals from previously-broadcasted events.
These map pretty closely to the 'screens' of the application. What I'm aiming to do is get all of the data needed for one screen of the application in a single query. I don't want to have to make multiple calls to the databse, whether independent or sequential, to service one screen - as this is where latency catches up to you.
With this schema design, I've aimed to utilise sort keys to sort:
- the reverse-chronological order of events (most recent first)
- the order of matches on an event card (in general, pre-show first, main event last)
- the league standings (highest points first)
Footgun: Changeable Sort Orders
Sort keys on the main table can't be changed by a PUT
or UPDATE
operation. Technically, they can be changed by a transaction with a DELETE
and a PUT
combination. This constraint doesn't apply to secondary index sort keys! So, I essentially had to 'flip' the data schema around, any sortable data where the sort order can change has that sort key stored in GSI_SK
. Anything that's static, such as a date, can remain on SK
.
User data, such as a user's predictions and their individual standings in the league, are stored within a partition keyed on the user's ID. This is then projected out into the secondary index so that queries can be made for a particular league standings or event.
Get a List of Events
The details of an event, such as its state
and broadcaste date
are stored in a unique item. On the main table these live as entries in a single partition, with PK
being EVENT
. Although this creates a 'hot' partition, it does mean that querying for just the list of events is simple to achieve. The sort key (SK
) for each event is the lexicographically-sortable EVENT#date-name
.
To retrieve these items in reverse-chronological order, specify ScanIndexForward=false
in the query.
PK = EVENT, ScanIndexForward = false
Each event item swaps around the PK
and SK
values for the corresponding GSI_PK
and GSI_SK
attributes, effectively creating a partition key on the secondary index that is the unique ID for the event.
Get the League Standings
As a user's standing in the league lives within their partition, getting all the standings relies on the secondary index. Each item of the standings
type has a GSI_PK
of the league name, i.e STANDINGS#year
. The GSI_SK
is where the heavy-lifting is done. It's a combination of SCORE
, the number of points, and the username: if user sam
has 140
points, this is stored as SCORE#140#sam
.
Note earlier I mentioned that strings in DynamoDB are lexicographically sorted. This is important as it leads to a potential hazard.
Consider the list of unsorted numbers: 11, 2, 10, 1
:
- Sorted numerically:
1, 2, 10, 11
- Sorted lexicographically:
1, 10, 11, 2
Avoiding this requires applying padding to the numbers. Based on the point scoring system, the number of matches per typical event and the number of expected events in a year, I've estimated an extremely gifted user would be able to score a total amount of points in the order of tens-of-thousands within a calendar year. Therefore, scores would be stored in the sort key as: 10000
, 01000
, 00100
00010
.
Prefixing the username to the sort key adds further disambiguation in the event of a tie in scoring.
Index=GSI, GSI_PK = STANDINGS#2024, ScanIndexForward=false
Get the Details of an Event
To predict, the user must first see which matches are on the card for an event. They may also want to revise their existing predictions, so if a user has any existing predictions, they should be returned too.
Each match is represented as a single item, with a PK
being the event unique id, and the SK
being the unique uuid for the match. Given that I'm already projecting out into the secondary index entries that have GSI_PK=EVENT#id
, I can piggyback on this to add the matches to this view. Also bearing in mind the constraint around changeable sort orders on the main SK
(note, the match card order can and does change!), I can also set the GSI_SK
to be the match's sort order. The same lexicographical sorting technique for points is applicable here too. Sort orders are padded to support tens of matches per event.
There are also a bunch of attributes to list the championship(s) being defended, how many points are available, whether it's a pre-show match (boolean), and an array of teams
. And, for when it is eventually scored, a result
attribute storing the index of the winning team.
Actually, it's probably worth diving into the teams
object. Wrestling matches can have multiple participants, optionally grouped into teams. Those teams may have names. One (or more!) of the participants, or teams, may be defending a championship. To satisfy the available combinations, a match consists of:
- An array of teams, minimum 2
- Each team has at least one member
- A team may (optionally) have a name
- A team may (optionally) be defending a championship
Let's illustrate with an example:
{
"championships": [ "AEW World Tag Team Championship" ],
"id": "05389aab-1f1b-44d0-9cfc-2e6d9cfd487f",
"name": "Tornado Tag Team",
"points": 25,
"result": 0,
"teams": [
{
"champion": true,
"members": [ "Sting", "Darby Allin" ]
},
{
"members": [ "Matthew Jackson", "Nicholas Jackson" ],
"name": "The Young Bucks"
}
]
}
User's predictions are partitioned by the user's ID, so similar to the standings, predictions are projected out into the secondary index. To do this I set the GSI_PK=EVENT#id
on items of type prediction
. A prediction item also contains the user's score for that event. Again, given the changeable-sort-order, the score value is used as the GSI_SK
. The secondary index sort key uses the same SCORE#value#username
format as the league standings, with the points value supporting hundreds of points. In order to associate a prediction/results item to a user, I add a belongsto
attribute, which is crucial when trying to retrieve just a single user's prediction.
In the spirit of DynamoDB, with this schema I can get all of the data about a single query just by looking at the partition key:
Index=GSI, GSI_PK = EVENT#id
In fact, at the point where an event is in the scored
state (i.e. the results are in!), this is what I query for, with ScanIndexForward = false
so that the highest-points scores come first.
When an event is still upcoming
and accepting predictions, I just need to return the event details, matches and that particular user's predictions. DynamoDB supports filter expressions, which are executed after the Scan operation. Using the belongsto
attribute on the user's prediction/result item, it's possible to create a filter expression that says "on a result where the belongsto
attribute is present, it must equal the user's id". This will still pass through the event detail and match entries, but only then pass through a particular user's predictions, rather than everyone's.
Index=GSI, GSI_PK = EVENT#id, Filter = attribute_not_exists(belongsto) Or belongsto = user
Get a User's Details
Given that I'm already partitioning the user's predictions and standings, alongside an item of type user
for a user's name, this is a simple read from the main table. I still set ScanIndexForward
so that date-based data are returned reverse-chronologically.
PK = USER#id, ScanIndexForward=false
Requirement | Params | Table | Operation |
---|---|---|---|
Get Events | - | main | PK = EVENT, ScanIndexForward=false |
Get Standings | year | GSI | GSI_PK = STANDINGS#year, ScanIndexForward=false |
Get Event Listings and Predictions | event, user | GSI | GSI_PK = EVENT#id, Filter = attribute_not_exists(belongsto) Or belongsto = user |
Get Event Listings, Scores and User's Predictions | event, user | GSI | GSI_PK = EVENT#id, ScanIndexForward=false |
Get a User's Details | user | main | PK = USER#id, ScanIndexForward=false |
Where We're Going, We Don't Need Servers
Now that the data layer is in place, it's time to move over to the world of serverless! The next post will cover implementing the HTTP API for the application user interface, using AWS API Gateway and a Lambda function.