Building a Knowledge Graph Using Messy Real Estate Data

Building an effective knowledge graph requires a combination of technical skills (data science and data engineering) and domain knowledge. These skills are especially important when trying to build a knowledge graph for commercial real estate data as it has many unique challenges to construct.

During his presentation this week at the NYC Data Council, Senior Data Scientist John Maiden shared some of the data science and data engineering challenges that Cherre has encountered along the way. But first, he cleared up an important question – just what IS a knowledge graph?

In its simplest form, a knowledge graph is just that – a way of storing data as a relationship between two entities, compared to a traditional knowledge base. It’s easier to visualize, it’s traversable, and relationships are a core component that can be analyzed and measured. It’s also straightforward to add new connections.

Knowledge graphs are only as good as the questions that drive them. For example, we use knowledge graphs of available CRE data to answer questions like:

  • Who is the true owner for the property?
  • Which properties has this owner bought and sold in the past five years?
  • Which lenders are seeing larger than average number of defaults?

We’re also working on analyzing questions like:

  • What is the owner’s strategy? What types of properties do they buy?
  • What models can we build from graph data?

But here’s the challenge, and where the data science and data engineering come in, this is what the New York City graph looks like:

The NYC graph alone has millions of edges and nodes! 

All data gets equal weight in a knowledge graph – nodes can be properties, people, corporations, or even contact information. Adding complexity is the fact that addresses and names (people and corporations) can come in different formats with spelling variations or typos. There’s also a need to disambiguate common names. 

We do a lot to programmatically standardize the data (names, addresses, and many other objects), which helps us build a more ideal knowledge graph:

When it comes to standardization, here are some of the key lessons we learned:

  • Business knowledge is critical because it helps provide context to understand the data.
  • Learn to deal with scale. We have to ingest hundreds of millions of names and addresses.
  • Be “OK” with ambiguity with some of the data (it can be really noisy!), but work hard to improve it over time.

If you’re interested in joining the team transforming real estate investing and underwriting into a science, check out our open jobs – we are hiring!