Skim Engine Updates: Address Extraction

Skim Engine™ Updates: Address Extraction

Updates with our customers in mind

The main advantage of our address model is its ability to accurately detect and extract addresses from any web page.
With this new feature of the Skim Engine™ we are trying to provide solutions to the absence of address detection model in either open source or commercial form.

Technical approach

Our new address feature was developed by following a two-step approach: first, we developed an address detection model, i.e., a Machine Learning classification model that predicts which blocks in a web page are likely to contain an address; then, we combined it with an address parser in order to extract the components of the address (e.g., road, postcode, city, country).

We framed the problem of detecting address blocks from web pages as a Machine Learning classification problem. Thus, the goal was to predict if a given HTML block contained an address or not. To do so, we devised a holistic set of features to provide the Machine Learning algorithm with the information needed to accurately perform the task. This set included HTML-based, NLP-based, and domain-knowledge inspired features. We also had to account for the natural imbalance of the data (i.e., the proportion of address blocks to non-address blocks in web pages is typically low) and adopt smart strategies to balance out the training dataset. The final address detection model achieved a very good performance on new/unseen webpages (the F1 measure of the address class is above 90% on the test set).

After using the address detection model to predict the blocks that are likely to contain an address, we apply an address parser to the corresponding strings of text to extract the address elements and output a normalised version of the address. The locations associated with each detected address are also geotagged and separately provided in the locations feature of the Skim Engine.

Solve a problem

The Postal Address extraction model recently released to the Skim Engine™ offers our clients the ability to take an address from a website, PDF, or other unstructured sources (including cells in an Excel sheet). This is a highly valuable feature for many sectors. We originally developed it for the Insurance industry, where premise location and distance to Emergency Services can be used for further modelling risk. But since then have found multiple other use cases in financial services and healthcare. If you have a use for our Address Extraction feature, please get in touch.

Other recent releases can be found here: Geo Tagging and Entity Extraction