During a streamed poker game, a show will collect a number of metrics related to player performance and style. Typical metrics include:
- Cumulative winnings - The cumulative winnings (or losses) of a given player at the conclusion of the stream.
- Chip count - The size of a players stack at the conclusion of a stream.
- Pre-flop raise - The frequency at which a player elects to raise preflop.
- VPIP - How frequently players voluntarily enter a pot.
This project currently collects and aggregates these metrics for all players, from the most popular operator (HCL). This provides some insight into the on-stream performance of players over time.
This project deploys a number of microservices to coordinate the collection of these statistics:
- Pipeline - Routes events between services.
- Ingest - Queries for new video assets.
- Asset Ripper - Downloads and slices streams into individual frames for analysis.
- Frame Analysis - Detects frames of interest and extracts statistics.
- Inventory - Creates an API and useful read model for the statistics.
- Client - A front-end for consuming the statistics.
The ingest API is responsible for querying and dispatching new video assets into the pipeline. The service maintains a minimal read model to keep track of videos that have already been discovered.
To ensure downstream services have access to the full range of metadata and encoded versions of the asset a fixed duration of time after publishing must pass, before the asset is considered discoverable.
This service is responsible for downloading a segment of the target show and slicing out a number of individual frames for further analysis. Commands dispatched to this service can either be a video ID or URL. Once assets are stored, their location and metadata are recorded.
This service is deployed with a custom Dockerfile that brings in some additional dependencies:
yt-dlp
- A python package written to download YouTube videos and metadata.ffmpeg
- The swiss army knife of video, used to extract individual frames.
This service is dispatched commands to do analysis on individual frames. The following takes place during analysis:
- A frame is taken as input.
- The service was tested with 35 random samples from the corpus.
- The samples were pre-labelled or validated at each stage of the analysis with tests specifically suffixed with " DataBuilder".
- These tests assisted with an exploratory approach to understanding the data, and are thus distinct from the other tests, which have much more focused test cases.
- Frames are preprocessed.
- The center area is cropped.
- A binary threshold is applied to clear up noise.
- OCR is applied to classify the frame as interesting or not.
- This service runs a classification process as a cost saving measure, since detailed analysis with the more accurate Textract service is costly.
- The OCR document is fuzzy matched to certain trigger words.
- If classified as interesting, Textract is used for a more accurate OCR.
- The results include words, tables and geometry of detected words.
- The geometry of certain words are used to locate the arrows indicating if a figure represents a win or a loss.
- A traditional algo is applied to detect if a shape is an up or down arrow.
- The extracted statistics are recorded.
The data model fits into the following hierarchy and relationships:
The inventory builds a read model using a DynamoDB table. On-demand pricing keeps costs minimal, because of the low volume of writes, while CDNs with a high TTL can protect the read-workload.
The schema uses a single-table design, to support one-shot fetching of related entities. Partitions are designed along the axis of operator, show, player and stat type to support the following queries:
- All shows for a given operator.
- All data for a given show.
- All data given player.
- All stats of a given type.
Visually each partition organises according the these access patterns in the following way:
The key schema to build these partitions is documented below:
export type ShowStorage = Show & {
entity_type: "show";
pk: `operator#${OperatorId}`;
sk: `show#date#${Date}#slug#${ShowId}#`;
gsi1pk: `slug#${ShowId}`;
gsi1sk: "show#";
};
type PlayerAppearanceStorage = PlayerAppearance & {
entity_type: "player_appearance";
pk: `player#${PlayerId}`;
sk: `appearance#slug#${ShowId}#`;
gsi1pk: `slug#${ShowId}`;
gsi1sk: `appearance#player#${PlayerId}#`;
};
export type StatStorage = Stat & {
entity_type: "player_stat";
pk: `player#${PlayerId}`;
sk: `stat#stat_type#${StatType}#slug#${ShowId}#`;
gsi1pk: `slug#${ShowId}`;
gsi1sk: `stat#stat_type#${StatType}#player#${PlayerId}#`;
gsi2pk: `stat_type#${StatType}`;
gsi2sk: `stat#player#${PlayerId}#slug#${ShowId}#`;
};
Some of the most interesting insights come from the aggregate of data points spanning the whole dataset. For the volume of data produced by a single operator, each partition could grow by some order of magnitude before impacting query performance. Querying for all data points then aggregating on demand, may eventually prove to not scale, but works for the volume of data in the foreseeable future.
The client is an SPA deployed to an S3 bucket using the following key libraries:
swr
for data fetching.- Next.js using the
export
bundling mode (SPA with no SSR or server component). - Chakra UI as a component library.
The client also ships with a debug mode that is switched on globally to show contextually relevant information for helping to debug and observe behaviour in the app. It shows responses from the inventory API for the current page and links directly to the logs of services, filtered by the content asset you are looking at:
This mode is activated by spamming the shift
key in rapid succession, inspired by the activation of Windows XP
accessibility feature "sticky keys".
Where possible, all components use on-demand pricing (DynamoDB, ECS, Fargate), to keep costs low when infrastructure is idle.
This project deploys to AWS using infrastructure-as-code via a number of CDK Stacks. CDK provides constructs at varying degrees of abstraction for orchestrating the creation of AWS services, using CloudFormation templates as an intermediary.
CloudWatch and X-Ray provide the foundation for logging and traces respectively. The "Trace Map" feature provides a useful visualisation for tracking down the root cause of errors as commands and events propagate through services:
The results are a number of leaderboards refreshed daily, with the ability to drill down on the players and shows that are interesting. Currently hosted at poker.sam152.com.