Categories
Software Engineering

Writing ETLs from Scratch

As I mentioned in my last post, I wrote Extract, Transform, Load (ETLs) from scratch. In this post I will elaborate on how I went about doing that, the design choices I made, and the modularity I incorporated.

What are ETLs?

ETLs, in short, “extract” data from one source in a certain format, “transform” it as necessary, and “load” it into a destination in a different format.

Why ETLs?

You could go about manipulating your data in multiple different ways. Before using ETLs I did it in a pretty procedural way. I had multi-hundred-line Rake tasks with a lot of duplication between them. They were a pain to maintain, hard to understand, and prone to bugs because variables were defined at the top and used throughout the task.

As a step up from that, I switched to using importers, readers, and writers. It was like ETLs but without the transform step. It’s amazing how much more organized the code is with that extra step.

Design Choices

The design is heavily influenced by the Kiba gem. I chose to not use the gem because it’s both inflexible and too complicated for my needs. It’s inflexible because the preprocessor is limited. It’s too complicated because it requires using a DSL to define ETLs which seemed unnecessary.

Instead, I chose to write my own and used Kiba as an inspiration. The essence of the ETL is this: Sources have an each method that iterates over the source and yields. In the block it passes the row from the source to the transform’s process method and then calls the destination’s write method passing it the transformed row.

Modularity

Since the motivation behind moving away from procedural rake tasks is to reduce duplication, implementing ETLs this way provided me with the modularity I needed. Once I’ve written a source, a transform, or a destination, it could be used by any ETL.

I have tasks that bootstrap data from certain websites. The logic is fairly similar between them but they have slight differences such as the name and format of the CSV file. The best solution I could come up with for that is to design the ETL classes to receive parameters for all the things that could be different. To use the ETLs, I wrote website-specific ETL classes that use the actual ETL classes and encapsulate the website-specific parameters.

Lessons Learned

ETLs are a great solution for a specific category of problems. Had I known about it sooner it would have saved me quite a bit of development time. I discovered it because it was being discussed at work. Maybe a little more research would have lead me to discover them sooner?

More generally, this project was the perfect demonstration for me for why Object-Oriented Programming (OOP) is superior to procedural programming for anything more than a script.

Categories
Software Engineering

Lessons from Launching a Side Project

After almost 2 years since I got the idea, I finally launched my side project.

Development

Let’s start with the stack. I’m still using Rails for the back end except now Rails 6 is out. I did try Vue.js for a while but I quickly abandoned it in favour of React. While Vue is much simpler than Angular, it still has a lot extra complexity compared to React. The distinction between computed and method and template syntax come to mind. I will discuss what extra technologies I ended up using along the way later on.

I started using React without Redux because I’ve seen how much boilerplate having a datastore adds and I thought my application is simple enough that I won’t need to use Redux. This assumption held true at first. However, after a while, I started getting frustrated with how deep I need to pass some properties and handlers through components that didn’t need to know about them. So, I decided to add Redux. While it solved the problem of having to pass props and handlers down, it introduced a few problems of its own such as:

  • Boilerplate.
  • Sharing logic between pages is convoluted.
  • Writing features for the front end became a drag. To add a simple button that changes the state you need to: Create the button with a handler that calls a prop function; change the component to accept said prop; change container component to pass the prop; add the action type; create the action; and create the reducer.

Enter React Hooks. I didn’t know how hooks are supposed to be used so I winged it and I’m pretty happy with how it turned out. I will probably write a separate post about hooks so I’ll glance over it for now.

In addition to all the front end rewrites, I also rewrote the content import code to borrow from the principles of ETL. Inspired by the Kiba gem, I wrote my own ETL from scratch with different sources, transforms, and destinations that I can combine flexibly to import content from APIs and CSVs and populate the database by creating records one at a time, creating records in bulk using PostgreSQL‘s COPY command, or updating individual records’ attributes.

Were all upgrades and technology switching worth it? From a development perspective? Absolutely! It makes development so much more enjoyable and faster. From a pure business value? I don’t know. On one hand, I get features out much faster. On the other hand, I don’t know where the website would be by now had I launched sooner without all the rewrites.

Launch

The plan was simple: create the database, seed it, add content from the websites I support for the initial launch, and start telling people about it. Oh, how I wish it was that easy!

Lack of (Thorough) Testing

The first issue I encountered was with the import script. The COPY command works by taking a CSV file and copying its content to a table. The import script shards the many-to-many table data every 10 million records. One mistake I made was simply not testing that during the import script rewrite because I was running the script with 10,000 records at a time. There was a bug with the sharding that prevented content population from completing. The fix was simple but hunting the bug down took a while. This is 100% my fault (then again, so is everything that goes wrong with this project since I’m the only one working on it).

Unforeseen Scale Issues

Along the way, seeing certain queries being run repeatedly while returning the same results (the content of the home page, for example), I added caching using Redis (the first technology added along the way). I plan on improving the cache invalidation mechanism eventually but, for now, it caches the content for 5 minutes before invalidating it. It’s simple but it gets the job of reducing queries done.

I always imported 10,000 records and used that for testing on my local development machine which is a mid-2013 MacBook Air with 128 GB of storage. As a result, I didn’t know what issues I would have with scale until I had all almost-19-million entities!

The first issue that presented itself was search. Filtering content based on certain criteria such as tags took a really long time to do using SQL queries. Selecting entries that have any tag present is easy. Selecting entries with all tags present is much, much slower. My first instinct was to split the tags table which was a polymorphic association with 3 different entities. This helped reduce the size of the database and sped up the search but it still wasn’t enough.

Enter Searchkick. Searchkick is a gem that acts as a wrapper for Elasticsearch (the second technology added along the way). Searching is now almost-instant! It came at the cost of the complexity of another technology added and the cost associated with it.

The second issue was a consequence of using Elasticsearch. Elasticsearch provides a recommendation feature out of the box by looking at similar documents. This feature provided excellent recommendations but it was slow. It would take 30 seconds and time out for some entries. It turns out Elasticsearch can do extra indexing that would speed this up. Adding term vectors increased the index size 1.5x but made the recommendations almost instant, too!

Post Launch

Finally! I added all the features I wanted, they work, and they work fast. It’s time to start telling people about it.

The plan all along was to promote it on Reddit: Post content from my website to Reddit, people see it, visit my website, immediately love it, create accounts, and keep coming back. Simple, right?

No. One issue I ran into is that subreddits restrict which websites can be linked to which limited my options when it comes to which subreddits I can post in. No problem, I was still getting thousands of daily visitors, surely some of them will register, right?

No. Due to the nature of the site, and probably other factors, users are really, really hesitant to sign up. Of the over 22,000 users so far, 84 clicked on UI elements that require sign up and saw a message asking them to sign up, of those, 16 visited the sign up page, and of those, 0 registered. No problem, clearly, people aren’t seeing the need to register and they’re not registering. I will show a prompt that tells users about the features that come with having an account. Surely that will work, right?

No. I decided to try different versions of the prompt. To accomplish that, I set up goals in Google Analytics and set up an A/B experiment with Google Optimize with 4 options: No prompt (control), a prompt asking people to register, a prompt that mentioned the possibility of signing up with Reddit, and a prompt that assured users that emails are optional during sign up. Over 3,700 users saw the prompts… 0 registrations.

Along the way (and this probably didn’t help with encouraging users to sign up), I ran into issues connecting to Elasticsearch. They were intermittent, frequent, and lasted for hours at a time. While that was happening, recommendations and search wouldn’t work, and a notification saying something went wrong would show up. It turns out this was caused by a gem that Searchkick recommends: Typhoeus. It is supposed to speed up Ealsticsearch searches by maintaining an open connection. Instead, it brought it to a halt. Removing it eliminated the problem. While I added the gem before launch, it only became apparent after launch. Lesson learned though: be careful with what you add in production. When things fail, they’re seen by people. Although, it’s not as scary as it seems. The stress of worrying about launch and how data is important to users is far worse than the stress of something going wrong while the website is live, especially if it’s still usable.

Current State

So, where do I stand? Well, this is the beauty of having my own project: It’s entirely up to me. In addition to learning a lot of technologies that helped me in my daily job (it helped me fix an Elasticsearch bug at work much faster than if I hadn’t used it before), it is really up to me to determine what I want to do and how you want to do it. While the website is far, far from making enough to cover its cost, it costs little enough that I can afford to keep it up for a while while I come up with other ways to promote it.

Upcoming Posts

I left a few things undiscussed that will be likely have their own posts soon:

  • Isomorphic Applications: Since SEO is important in this area, I had to make sure that pages load with their content pre-loaded instead of acting like a single-page application.
  • React Hooks.
  • Caching: When I get around to improving cache invalidation I will likely have a post to talk about that.
Categories
Software Engineering

Minimum Viable Product (MVP), Vision, and CRUD

If I were to define my own definition for Minimum Viable Product (MVP) to be used in a textbook it would be the following: The set of features that is necessary to gain and maintain early users.

It doesn’t sound that hard, does it? Twitter? Users should be able to post messages, they should be able to follow other users, and they should have a feed where they see posts from users they follow. Stack Overflow? Users should be able to ask, vote on, answer, and comment on questions. A todo list app? Users should be able to create tasks and be able to check and uncheck them.

Creating MVP is hard because of one reason: Vision.

Vision

The creator of a product has a vision for what they want their product to be, what features it should have, and how it’s going to improve its users’ lives. It’s good to be ambitious and want a lot of features but it makes launching so much harder. Let’s look at Twitter again. Let’s assume the creator had the vision for retweeting and hashtags when they first created Twitter. To them, all these features are part of their vision of how great the website will be. “Imagine not having to copy and paste a Tweet to share it!” or “Imagine clicking any tag and seeing all the other tweets that have the same tag! Wouldn’t that be cool‽”.

To be able to create a realistic MVP, one must be able to put their vision aside and focus on the users and their needs. Of course users would love to have all these features but is it necessary to have them all at launch for Twitter to be successful? No. Users just need a platform to share short text posts on.

The challenge here is putting aside vision. How does one ignore the desire to create the best product possible? It’s the fruit of their labour. They will have spent months on it before they launch it. They want it to be the best it can be. Anything short of that is a compromise.

The answer, in my opinion, is simply to realize that the choice is not between launching with feature X or without it. Rather, the choice is between launching with the right set of features and seeing their vision come to reality slowly or being late and seeing their vision fail miserably.

Without this realization, there are no consequences to adding more features. Throw more features in. The more the merrier! Now, there is a risk associated with adding features that they can do without initially. That ought to make it easier to accept that their product will launch without their full set of features.

The CRUD Problem

Once the creator decides on an MVP, they might realize that their application does nothing more that create, read, update and/or delete (CURD) entities. CRUD is seen as being very basic. It’s almost an insult to call an application “CRUDy”. A task that’s “CRUDy” in nature is seen as simple and low effort. Back to Twitter. The basic set of features described above is very CRUDy. Users should be able to CREATE post messages, they should be able to CREATE follow relationships with other users, and they should be able to READ a feed where they see posts from users they follow. Of course they can DELETE posts too.

The problem arises when, after spending months working on an application, the creator realizes they have nothing but a CRUDy application to show for it. What were those months spent on? Was it all a waste of time? Were they slow? Are they inefficient? Probably not.

The likely reason for the application taking a long time is all the details that need to be ironed out. How many characters should a tweet be limited to? How are tweets in feeds sorted? You get the idea. And, if, at some point during development, they realize they need to change one of these details, the changes might not be trivial.

I’m afraid the solution to this is another realization: Users do not care what goes on behind the scenes and whether the product is “CRUDy” or not. All they care about is features. Build a product people want and they will stick around.

Categories
Software Engineering

Complexity in Software

When it comes to complexity, somewhere between Hello World and Google exists every other piece of software ever written. What makes one piece of software more complex than another? Lines of code might be an obvious answer especially considering the link above but I believe that complexity causes an increase in the number of lines of code not the other way around.

The number of components that need to play well together is directly proportional to complexity. Web applications need a database, a back end, and a front end. The front end is rendered, it asks the back end for data, the back end retrieves the data from the database, formats it, and sends it to the front end. Let’s think of software in terms of MVC. Even though the front end and the back end both seem to utilize a variation of MVC, for our purposes, we’ll consider the database to be the model, the front end to be the view, and the back end to be the controller. Most features involve changing all 3 components and they all need to work well, and work well together, for the feature to work. This adds complexity. Not only because there’s code whose sole purpose is to communicate with the other layers, but also because you need to test all three components together.

This applies to all software of the same category though but some pieces of software are inherently more complex than others. Let’s take the current side project I’m working on. It is, in its core, a very CRUDy application. Users are presented with things, they can save said things, and the things they saved are presented back to them. Simple, right? Well, sure, but that’s just the techy elevator pitch. The things need to be presented nicely, they need to be laid out in a meaningful way, the pages need to make sense, the things need to have other meta data associated with them, the things need to be imported from 3rd parties, and there needs to be a way to keep the things up to date. The list goes on, and on, and on.

The number of features in a piece of software is an indicator of complexity but it’s not enough on its own since features have different complexities. A list of featured items is not as complex as a recommendation engine. In the former, data needs to be retrieved from the database, aggregated, and displayed. A recommendation engine, assuming one is written from scratch, requires certain usage metrics to be saved and an algorithm to be written that determines what people are interested in with a lot of planning involved.

If lines of code are caused by complexity, and features don’t correlate 1:1 with complexity, then what determines complexity in a piece of software? Requirements do. Features have multiple requirements. Generally speaking, the more complex a feature is, the more requirements it has and/or the more complex each requirement is.

Let’s take the two example features above as an example. Both features display a set of things to the user. The first, a list of featured things, might have the following requirements:

  • Featured things must refresh every 10 minutes.
  • Featured things are things that are saved most recently by users.

Now let’s look at the recommendation engine’s requirements:

  • Each user sees a recommendation based on their own activity.
  • Similar things are determined based on what users save together.

They both have two requirements, how can we say that one is more complex than the other? That’s because requirements don’t map directly to code. The second requirement here involves creating new tables, code to save user activity, and a relatively-complex query to determine which items are saved together. More granular requirements and subtasks of a feature provide these details.

Once we get to very granular details like “create users table”, “create API endpoint to sign in”, “create signin page”, we have a pretty good idea of what the complexity will be but almost no one bothers to create such granular subtasks beforehand and that’s why when a developer sets out to build a “simple” application, they underestimate the complexity involved. This is also what makes estimating so hard but that’s another topic for another day.