How to onboard new real estate data sets in a scalable, efficient, and flexible way.
After years of learning and refining our process at Cherre, we have reached a point where we learned what works and what does not. We learned the bottlenecks in our approach. We learned what we need to do to scale at a level unprecedented in Cherre’s history (not kidding here).
Let’s briefly take a trip down memory lane to understand how Cherre got to 2022. Previously, we were not using dbt, our data transformation tool, or Meltano (Singer), our data ingestion tool. We were building everything from scratch. However, building everything from scratch is a huge problem; there’s no standard to follow, everyone has their unique implementation, and building optimization at scale is unfeasible. This violates the DRY (do-not-repeat-yourself) principle and the Single Responsibility Principle.
Scalability was nearly impossible. Processing any new data would take months, which our product team or clients were not happy about. We knew it did not have to be this way. We understood that the issue was not in the data; the main problem was coupling the entire pipeline as one service.
Ingestion is arguably one of the most challenging tasks in our pipeline. Working with an external source is not easy. We have to deal with authentication issues, encoding issues, nested data, et cetera. Using Meltano helped segregate some of the problems. Meltano encompasses two concepts: taps and targets. Taps are the external data source, and the target is our data lake. Decoupling ingestion into two parts made our lives easier. As we would learn time and time again, decoupling services/code would reduce dependency issues.
After a few more iterations, we positioned ourselves to where the tap would be our main focus when ingesting data. Even better, we built a Cherre ingestion framework based on Meltano and containerized the ingestion process. What does this mean?
We have decoupled our ingestion and services from one another, decreasing the likelihood of dependency issues. Our Cherre ingestion framework is well-tested and versioned. Thus, we can confidently build new taps (i.e., ingesting new data sources). We do not have to fear breaking other ingestions, and it decreases the likelihood of our ingestion breaking due to code changes by other engineers.
Breaking our ingestion process into individual components where each ingestion is separate has helped us create new and custom ingestions quicker, making it scalable.
Probably an overused term nowadays, but at Cherre, metadata describes the pipeline. The metadata specifies the processes and transformations needed before the data is available via our API. We no longer develop or integrate our Connections Services (name service, address service) into the pipeline. Before our metadata model, engineers were required to have a tremendous understanding of our Connections Services. Now, the metadata tells the services if we need it or not, and if we do, it knows the inputs given the metadata.
Microservices help us segregate responsibilities. Our services are independent, so we can reduce the mental overhead of implementing our services as long as the interface remains the same. With metadata, we significantly reduce writing code from scratch. We can generate code, which is what we do for dbt. We don’t have to violate DRY, and we no longer write Airflow DAGs. We only have one DAG that creates DAGs by iterating through the metadata.
We are still refining the final version of Cherre metadata. But the vision is that YAML files will describe the metadata; our pipeline will use it to orchestrate the data transformation, invoke any services needed, and expose it via a variety of options. There will be cases where we might need some more complex data transformation.
The metadata is responsible for being the source of truth. Thus it helps flag, skip, and automate steps making our codebase and Connections Services efficient.
While leveraging microservices has been optimal for us, sharing code among our codebase is always needed. Having code that performs an identical function in different locations makes maintenance almost impossible.
Using internal version libraries helps us share code throughout our codebase. There are two main benefits: we define functionality in one location and avoid having side effects when updating our internal libraries.
Each library has a purpose, whether that’s defining functionality for Cherre-specific concepts or a wrapper on certain libraries to more appropriately fit our needs. Also, by versioning our libraries, we can freely make updates while decreasing side effects. Code using our internal libraries will be “pinned” to a specific version, and thus, any new updates will not instantly modify the existing behavior.
Our internal libraries help us to avoid writing duplicating code. We can continuously improve the functionality of our libraries without instantly impacting code using our libraries, making our codebase flexible.
Alex Guanga is a Data Engineer at Cherre.