No matter the project, the smallest microservice to a large monolith, data is a key consideration. Even considering schemaless NoSQL databases have become more common it doesn’t mean data design has become any less important.
A data store can take on many different faces and have many different implementations, however without sufficient consideration, it can become a costly part of the system.
Draw it out, it will save a lot of pain later.
I am wary of using the word database throughout this post as it implies a single instance. The shape of your data can take on many different forms depending on which is the most applicable, therefore “data store” is probably more appropriate.
This blog is written from the perspective of starting BetaBud, which uses a single DynamoDB table for now...
This blog attempts to be as transparent as possible about how BetaBud has been implemented. Here we will lift the lid on the data implementation down to detail that you usually only see in tutorial scenarios.
Take a look here for the architecture design of BetaBud.
Velocity, Variety, and Volume are three important factors to consider when creating your data store.
Velocity is the rate of the reads and writes to the store. Are the values proportional, or is there a far greater number of reads than writes? Consider the rate at which datasets change and are queried. Perhaps not all data is frequently read from, or it is forever constant.
Variety alludes to how structured is the data. If the structure is highly defined and highly related rather than being a large BLOB. Additionally, consider whether the store needs to be aware of any nested data.
Volume is both the quantity of the total dataset and each record itself.
The first step to structuring data is to think about the patterns for both reading and writing. By patterns, I mean in what form will the commands and queries look? Will we always be selecting a list of the same object type? Will we be selecting individual pieces of data? What values will we have to query with? What will a typical write command look like?
All of these considerations should consider the most frequent of operations. Designing for the 99% scenarios, then worrying about the 1% cases later.
Let's consider the fundamental scenarios for BetaBud minimal viable product and the data we have access to at each point:
Type | Description | Available Data |
---|---|---|
Query | Querying the main feed - https://betabud.io | We know it is a list of forms |
Query | Querying your feed - https://betabud.io/myforms | We also know it is a list of forms but they belong to a given user |
Command | Creating a form - https://betabud.io/create | The given user and all the data about a not yet saved form |
Query | Loading a form | The user and the form ID from either the main feed or own feed data |
Command | Responding to a form | The user and the form ID from loading the page, as well as all of the responses |
Query | Loading the user’s data - for instance the number of tokens they have | The given user |
As aforementioned, I acknowledge this store needs to be the silver bullet that would remain constant throughout, and be exclusively used. The store will change, and growth allows this store to be broken down and reconsidered.
Yet, with this concise set of initial MVP requirements, all could be achieved with a single store.
Honestly, this was a biased call from the start, I was always inclined to use DynamoDB. Perhaps a relational database would have been easier to query and reduce duplication - will explore later - but I wanted a store quick to set up and configure, easy to maintain, with very fast latency.
All the data above is highly related, so it made sense to store it all together in a single table. The table has a generic “PK” and “SK” name for the primary key and sort key columns. This means they can take any form without creating confusion.
This single table also reduces the complexity when interacting with the data store, especially when using batch read/write operations.
A reason to perhaps consider splitting this table later could perhaps be a vastly differing rate of reads and writes per item type, maybe one type needs higher provisioning throughput than the rest. Or a single type requires a Global Secondary Index that the others don’t, at the stage of MVP this isn’t the case.
Based on the above patterns here is what the primary keys look like.
A form header item that includes any data required for the main feed
PK | SK |
---|---|
HEADER | 1712009988#username#formIdentifier |
The form body item includes all data about a form such as the different questions
PK | SK |
---|---|
username | 1712009988#formIdentifier |
The form response item is created when another user responds to a form
PK | SK |
---|---|
RESPONSE#username | formIdentifier#respondeeUsername |
A user token that contains other data about the user such as tokens earned and spent
PK | SK |
---|---|
USERTOKEN | username |
An entity relationship diagram showing how the different entities can be stored in one table
Time to Live is a clever functionality of DynamoDB. At table level, set a date column that should behave as the TTL, and then whenever that date has elapsed, the item will be deleted.
Currently, there is only a TTL applied to the form header record within BetaBud. The absence of this header indicates that a form is archived and won't show in the main feed. This is currently set to happen after 28 days.
This in-built trigger functionality reduces the need for any scheduled process that would need to be implemented by another resource.
As mentioned, this is the MVP implementation of BetaBud. A data store is an evolving ecosystem as much as your front-end design is. The greater consideration you give it upfront, the less pain it will save in the long run.
There is always the opportunity to perform a data migration later on, but let's try and reduce the headaches of future us with that initial design.