2) Data Ingestion – Raw & Bronze Layers

Let’s Get On With It

I don’t want this case study to be just blocks of code. The repo for this project is public, with the ingestion notebooks here. These are more snippets of my thought process, and how I have applied the concepts to this particular FPL project.

Here, I will ingest the historic player gameweek stats into bronze Delta tables, as well as build a script that can be run after each gameweek to extract the most recent stats from the FPL API. Getting this data from source systems and putting it in one place allows us to begin to cleanse and transform the data into the silver layer.

You can go back to the case study here.

A word on Medallion Architecture

I’m not saying anything that hasn’t been said before, many people have opinions on ‘medallion architecture’ and I don’t need to make a case for or against. Medallion architecture defined in articles such as What is the medallion lakehouse architecture? is a new rebrand on the way data engineers, architects and DBAs have been managing data for decades. It is a logical flow of how data should progress through your platform, growing in maturity, usability and trust.

Put your raw, dirty data somewhere (Bronze)
Clean it up a bit with ETL so you don’t have to do it every time (Silver)
Model the data with business logic ready for use by end-users (Gold)

Of course, this may not be enough. The ‘traditional’ medallion architecture defines all three of the layers to live in Delta tables, but each use case is its own and I prefer the concept of medallion rather than strict implementation.
What if you want to store the original files and formats? Add another layer.
What if there’s a step between silver and gold that uses silver data but isn’t quite gold yet? Add another layer.

Like in most things data, understanding the concepts and having the foundational knowledge is infinitely more transferable than pigeon-holing yourself into one method, tool or software. If you can justify why another layer is needed, don’t get too hung up on whether or not it follows documentation – just get on with it!

Because I am creating JSON files after each gameweek using the FPL API, we need somewhere to put them – so I am adding a ‘raw’ layer before bronze to dump CSV and JSON files. (I’ve even added a ‘feature’ layer between silver and gold later on for the ML aspect).

Historic and Incremental Loading

Every time I design a pipeline, one of the first things I think about is historic and incremental loading. The historic pipeline ingests and processes all the data and overwrites tables. It runs initially and then whenever we need a full data refresh. Incremental only processes a subset and merges into your tables for faster, less compute-intensive runs on a more regular basis.

I always start by designing the historic pipeline, as I need to use this first to populate the data initially. Then I modularise the majority of the process into reusable functions (think of the DRY Principle), except for the ingestion source and the overwrite/append/merging into the target tables. I do this by adding a ‘PROTOCOL’ variable to the notebook/pipeline so I can switch protocols easily.

Historic Gameweek Stats CSVs

For seasons 2016/17 to 2023/24, I have downloaded the raw CSVs from Vaastav’s Repo. This gives me every player’s stats for every game since the 2016/17 season. From it, I can infer the fixtures, results, goalscorers, team stats etc. Pretty cool, right?

I have also downloaded all of the players_raw data, which contains all the data about player names and teams. It has overall season stats, but like I say I can infer all this using the individual gameweek stats above.

I have saved all the CSVs in a volume in the fpl_raw schema (before the bronze layer), along with all of the API JSON results (below).
We only really need to do this once/in a big load so I don’t need an incremental load of this.

API Ingestion

This really helpful blog documents all the different endpoints that FPL exposes. To get people started with summary data, ‘bootstrap-static’ contains loads of data about teams, players etc. I want something a bit more granular each week, so I want to get the data from the ‘element-summary’ data (as well as ingesting the bootstrap-static because why not?). ‘element-summary’ contains the granular week-by-week stats for every player. Also the fixtures endpoint is useful, as fixture timings can often change due to TV listings, so we want to keep the most up-to-date record of the future fixtures.

I did add a protocol here, to ingest either stats of every gameweek (PROTOCOL = ‘HIST’) or just the most recent gameweek (PROTOCOL = ‘LATEST’), using the fixtures and events endpoint to work out which was the most gameweek round.

These JSON files are stored in the raw data layer in a {SEASON}/{GAMEWEEK} folder. So each week we have bootstrap_static, fixtures and player_gameweek_stats.

Raw to Bronze

Ingesting the raw CSV and JSON files is fairly straightforward. As the bronze layer is an uncleansed version of the source data, the task here is just to build a slick pipeline to read the files and write them into our bronze layer in either a HIST way (all the data overwriting the tables) or an INCR way (just the most recent gameweek).

I did have some problems with reading arrays inside nested JSON into a dataframe/Delta table column, so I have slightly cheated the ‘pure’ medallion way of leaving the data as source, and explicitly set the schema of a few of the bronze tables and exploded the data a little. Don’t report me to the data police!

So now we have a set of bronze tables , we are ready to apply some ETL and get to our silver layer of normalised tables – see you next time!

BW