Core principles for dataset onboarding automation at Cherre: measure everything, optimize relentlessly, and deliver fast and often.
At Cherre, we do a lot of cool stuff with data: owner unmasking, building the industry’s largest knowledge graph, providing quick updates via continuous data delivery, and unlimited query flexibility via our GraphQL API. It’s all powered by the data we pull into our highly secure environment in a process we refer to as Dataset Onboarding. With our client and data partner networks growing faster than ever, we are optimizing this process to continue expanding the set of data we have available to power insights and serve our clients.
Real estate data comes in many shapes and sizes, and from a huge range of sources. Cherre has datasets with billions of rows, and datasets with only a dozen rows. We ingest data from modern APIs, legacy government FTP servers and static files, and everything in between. This variance is one of the greatest challenges to the dataset onboarding process, where speed and quality are critical.
Our goal is to provide our customers with fully connected real estate data within hours, if not minutes, while maintaining our rigorous standards for security and data tenancy.
To know we are improving, we have to understand where we are coming from.
Early on, our only benchmarks for dataset onboarding process timing were manual and anecdotal. The data variance mentioned above made it difficult to establish a consistent workflow, let alone track it – the common ground we did find was high-level and required significant interpretation to implement.
Now that Cherre has onboarded enough datasets, once-opaque patterns are becoming clear. We have created a granular, thoroughly documented, and, crucially, highly repeatable workflow that can be used for any new dataset.
We’re using DevOps principles to monitor and improve this new standard workflow. Metrics like percent complete and accurate (a measure of the quality of inputs from one stage of the process to the next), lead time (the time between initiation and completion of a process), and process time (the actual work time during a process) support our data-driven decisions around process improvement.
In our case, lead time arises primarily from handoffs. And the most glaring observation from our tracking efforts is that lead time is by far the biggest drag on our efficiency, often taking 5-10x the actual process time. Automation or technical improvements to the process will skew this ratio even further, drawing our focus instead towards workflow refinement.
To minimize handoffs between our engineering and product teams, we restructured our process to frontload QA. Early product sign-off on internal technical “contracts” (tests) that dictate how the data looks upon delivery allows our engineers to work freely within our standards to build pipelines that are guaranteed to match client expectations. In practice, our product team collaborates with the client and Cherre engineers to establish the end schema, usually in the form of a GraphQL API query. We then codify this as a schema test, and build a pipeline that makes that test pass.
Even the best requirements are subject to change, and this is especially true with the varied use-cases and datasets we see at Cherre. No matter how much we refine our onboarding process, some datasets need further enhancement after their initial delivery.
To enable us to deliver these improvements as efficiently and agilely as possible, we take an iterative approach that can be likened to the “yes, and” model from improv comedy. Instead of taking the premise of a joke and running with it, we make our initial data delivery a “passthrough,” preserving the structure and typing of the data as we get it from the source, and giving ourselves a foundation to build upon.
Once we’ve established this initial pipeline, we begin any improvements to it by duplicating the pipeline and working on the new copy. This ensures that any schema provided to our clients is maintained, and fully preserves data lineage. After we’ve verified that the new work meets our standards, we can release it via our continuous data delivery and preserve the old pipeline in case we need to roll back. For any subsequent enhancements, the process begins again. This “layering” approach allows us to quickly deliver value without compromising on flexibility for future change.
Cherre has come a long way with our process for dataset onboarding, and the journey is only beginning. We’re currently working on completely automating the delivery of the initial layer of passthrough data, meaning that our engineers and data analysts can soon spend all their time adding value above and beyond our baseline of connected real estate data. After we’ve reached that summit, the next peak is self-service onboarding: empowering clients and potential clients to explore the power of connected data themselves.
There’s a lot more climbing to be done, and we’re excited about it – after all, “at the top of the mountain, we’re all snow leopards.”
Gus Rasch is a data engineer at Cherre.