Writing ETLs from Scratch

As I mentioned in my last post, I wrote Extract, Transform, Load (ETLs) from scratch. In this post I will elaborate on how I went about doing that, the design choices I made, and the modularity I incorporated.

What are ETLs?

ETLs, in short, “extract” data from one source in a certain format, “transform” it as necessary, and “load” it into a destination in a different format.

Why ETLs?

You could go about manipulating your data in multiple different ways. Before using ETLs I did it in a pretty procedural way. I had multi-hundred-line Rake tasks with a lot of duplication between them. They were a pain to maintain, hard to understand, and prone to bugs because variables were defined at the top and used throughout the task.

As a step up from that, I switched to using importers, readers, and writers. It was like ETLs but without the transform step. It’s amazing how much more organized the code is with that extra step.

Design Choices

The design is heavily influenced by the Kiba gem. I chose to not use the gem because it’s both inflexible and too complicated for my needs. It’s inflexible because the preprocessor is limited. It’s too complicated because it requires using a DSL to define ETLs which seemed unnecessary.

Instead, I chose to write my own and used Kiba as an inspiration. The essence of the ETL is this: Sources have an each method that iterates over the source and yields. In the block it passes the row from the source to the transform’s process method and then calls the destination’s write method passing it the transformed row.

Modularity

Since the motivation behind moving away from procedural rake tasks is to reduce duplication, implementing ETLs this way provided me with the modularity I needed. Once I’ve written a source, a transform, or a destination, it could be used by any ETL.

I have tasks that bootstrap data from certain websites. The logic is fairly similar between them but they have slight differences such as the name and format of the CSV file. The best solution I could come up with for that is to design the ETL classes to receive parameters for all the things that could be different. To use the ETLs, I wrote website-specific ETL classes that use the actual ETL classes and encapsulate the website-specific parameters.

Lessons Learned

ETLs are a great solution for a specific category of problems. Had I known about it sooner it would have saved me quite a bit of development time. I discovered it because it was being discussed at work. Maybe a little more research would have lead me to discover them sooner?

More generally, this project was the perfect demonstration for me for why Object-Oriented Programming (OOP) is superior to procedural programming for anything more than a script.