Software Engineering

Writing ETLs from Scratch

As I mentioned in my last post, I wrote Extract, Transform, Load (ETLs) from scratch. In this post I will elaborate on how I went about doing that, the design choices I made, and the modularity I incorporated.

What are ETLs?

ETLs, in short, “extract” data from one source in a certain format, “transform” it as necessary, and “load” it into a destination in a different format.

Why ETLs?

You could go about manipulating your data in multiple different ways. Before using ETLs I did it in a pretty procedural way. I had multi-hundred-line Rake tasks with a lot of duplication between them. They were a pain to maintain, hard to understand, and prone to bugs because variables were defined at the top and used throughout the task.

As a step up from that, I switched to using importers, readers, and writers. It was like ETLs but without the transform step. It’s amazing how much more organized the code is with that extra step.

Design Choices

The design is heavily influenced by the Kiba gem. I chose to not use the gem because it’s both inflexible and too complicated for my needs. It’s inflexible because the preprocessor is limited. It’s too complicated because it requires using a DSL to define ETLs which seemed unnecessary.

Instead, I chose to write my own and used Kiba as an inspiration. The essence of the ETL is this: Sources have an each method that iterates over the source and yields. In the block it passes the row from the source to the transform’s process method and then calls the destination’s write method passing it the transformed row.


Since the motivation behind moving away from procedural rake tasks is to reduce duplication, implementing ETLs this way provided me with the modularity I needed. Once I’ve written a source, a transform, or a destination, it could be used by any ETL.

I have tasks that bootstrap data from certain websites. The logic is fairly similar between them but they have slight differences such as the name and format of the CSV file. The best solution I could come up with for that is to design the ETL classes to receive parameters for all the things that could be different. To use the ETLs, I wrote website-specific ETL classes that use the actual ETL classes and encapsulate the website-specific parameters.

Lessons Learned

ETLs are a great solution for a specific category of problems. Had I known about it sooner it would have saved me quite a bit of development time. I discovered it because it was being discussed at work. Maybe a little more research would have lead me to discover them sooner?

More generally, this project was the perfect demonstration for me for why Object-Oriented Programming (OOP) is superior to procedural programming for anything more than a script.

Software Engineering

Lessons from Launching a Side Project

After almost 2 years since I got the idea, I finally launched my side project.


Let’s start with the stack. I’m still using Rails for the back end except now Rails 6 is out. I did try Vue.js for a while but I quickly abandoned it in favour of React. While Vue is much simpler than Angular, it still has a lot extra complexity compared to React. The distinction between computed and method and template syntax come to mind. I will discuss what extra technologies I ended up using along the way later on.

I started using React without Redux because I’ve seen how much boilerplate having a datastore adds and I thought my application is simple enough that I won’t need to use Redux. This assumption held true at first. However, after a while, I started getting frustrated with how deep I need to pass some properties and handlers through components that didn’t need to know about them. So, I decided to add Redux. While it solved the problem of having to pass props and handlers down, it introduced a few problems of its own such as:

  • Boilerplate.
  • Sharing logic between pages is convoluted.
  • Writing features for the front end became a drag. To add a simple button that changes the state you need to: Create the button with a handler that calls a prop function; change the component to accept said prop; change container component to pass the prop; add the action type; create the action; and create the reducer.

Enter React Hooks. I didn’t know how hooks are supposed to be used so I winged it and I’m pretty happy with how it turned out. I will probably write a separate post about hooks so I’ll glance over it for now.

In addition to all the front end rewrites, I also rewrote the content import code to borrow from the principles of ETL. Inspired by the Kiba gem, I wrote my own ETL from scratch with different sources, transforms, and destinations that I can combine flexibly to import content from APIs and CSVs and populate the database by creating records one at a time, creating records in bulk using PostgreSQL‘s COPY command, or updating individual records’ attributes.

Were all upgrades and technology switching worth it? From a development perspective? Absolutely! It makes development so much more enjoyable and faster. From a pure business value? I don’t know. On one hand, I get features out much faster. On the other hand, I don’t know where the website would be by now had I launched sooner without all the rewrites.


The plan was simple: create the database, seed it, add content from the websites I support for the initial launch, and start telling people about it. Oh, how I wish it was that easy!

Lack of (Thorough) Testing

The first issue I encountered was with the import script. The COPY command works by taking a CSV file and copying its content to a table. The import script shards the many-to-many table data every 10 million records. One mistake I made was simply not testing that during the import script rewrite because I was running the script with 10,000 records at a time. There was a bug with the sharding that prevented content population from completing. The fix was simple but hunting the bug down took a while. This is 100% my fault (then again, so is everything that goes wrong with this project since I’m the only one working on it).

Unforeseen Scale Issues

Along the way, seeing certain queries being run repeatedly while returning the same results (the content of the home page, for example), I added caching using Redis (the first technology added along the way). I plan on improving the cache invalidation mechanism eventually but, for now, it caches the content for 5 minutes before invalidating it. It’s simple but it gets the job of reducing queries done.

I always imported 10,000 records and used that for testing on my local development machine which is a mid-2013 MacBook Air with 128 GB of storage. As a result, I didn’t know what issues I would have with scale until I had all almost-19-million entities!

The first issue that presented itself was search. Filtering content based on certain criteria such as tags took a really long time to do using SQL queries. Selecting entries that have any tag present is easy. Selecting entries with all tags present is much, much slower. My first instinct was to split the tags table which was a polymorphic association with 3 different entities. This helped reduce the size of the database and sped up the search but it still wasn’t enough.

Enter Searchkick. Searchkick is a gem that acts as a wrapper for Elasticsearch (the second technology added along the way). Searching is now almost-instant! It came at the cost of the complexity of another technology added and the cost associated with it.

The second issue was a consequence of using Elasticsearch. Elasticsearch provides a recommendation feature out of the box by looking at similar documents. This feature provided excellent recommendations but it was slow. It would take 30 seconds and time out for some entries. It turns out Elasticsearch can do extra indexing that would speed this up. Adding term vectors increased the index size 1.5x but made the recommendations almost instant, too!

Post Launch

Finally! I added all the features I wanted, they work, and they work fast. It’s time to start telling people about it.

The plan all along was to promote it on Reddit: Post content from my website to Reddit, people see it, visit my website, immediately love it, create accounts, and keep coming back. Simple, right?

No. One issue I ran into is that subreddits restrict which websites can be linked to which limited my options when it comes to which subreddits I can post in. No problem, I was still getting thousands of daily visitors, surely some of them will register, right?

No. Due to the nature of the site, and probably other factors, users are really, really hesitant to sign up. Of the over 22,000 users so far, 84 clicked on UI elements that require sign up and saw a message asking them to sign up, of those, 16 visited the sign up page, and of those, 0 registered. No problem, clearly, people aren’t seeing the need to register and they’re not registering. I will show a prompt that tells users about the features that come with having an account. Surely that will work, right?

No. I decided to try different versions of the prompt. To accomplish that, I set up goals in Google Analytics and set up an A/B experiment with Google Optimize with 4 options: No prompt (control), a prompt asking people to register, a prompt that mentioned the possibility of signing up with Reddit, and a prompt that assured users that emails are optional during sign up. Over 3,700 users saw the prompts… 0 registrations.

Along the way (and this probably didn’t help with encouraging users to sign up), I ran into issues connecting to Elasticsearch. They were intermittent, frequent, and lasted for hours at a time. While that was happening, recommendations and search wouldn’t work, and a notification saying something went wrong would show up. It turns out this was caused by a gem that Searchkick recommends: Typhoeus. It is supposed to speed up Ealsticsearch searches by maintaining an open connection. Instead, it brought it to a halt. Removing it eliminated the problem. While I added the gem before launch, it only became apparent after launch. Lesson learned though: be careful with what you add in production. When things fail, they’re seen by people. Although, it’s not as scary as it seems. The stress of worrying about launch and how data is important to users is far worse than the stress of something going wrong while the website is live, especially if it’s still usable.

Current State

So, where do I stand? Well, this is the beauty of having my own project: It’s entirely up to me. In addition to learning a lot of technologies that helped me in my daily job (it helped me fix an Elasticsearch bug at work much faster than if I hadn’t used it before), it is really up to me to determine what I want to do and how you want to do it. While the website is far, far from making enough to cover its cost, it costs little enough that I can afford to keep it up for a while while I come up with other ways to promote it.

Upcoming Posts

I left a few things undiscussed that will be likely have their own posts soon:

  • Isomorphic Applications: Since SEO is important in this area, I had to make sure that pages load with their content pre-loaded instead of acting like a single-page application.
  • React Hooks.
  • Caching: When I get around to improving cache invalidation I will likely have a post to talk about that.