Extracting Media from HTML

Over the last decade, there has been a rapid development from static webpages with little interactive content to media-rich and dynamic web pages. This can be attributed to new web technologies that make embedding of multimedia content incredibly simple. For example, adding a video in HTML5 is as easy as wrapping it in <video> tags. At the same time there is a variety of JavaScript libraries allowing easy embedding of external content, such as the react-player supporting videos from all big video content platforms. Thus, even the hobby web developer can easily use multimedia in their website.

This development facilitates not only the use of multimedia, but also any external content that may be useful within a web page for reference. For example, news articles now often contain Tweet reactions as well as gallery applications associated with the reported news event. This also has strong implications for the task of content extraction. A good HTML extractor needs not only to extract the correct text from the page, but also all relevant multimedia and distinguish them from boilerplate media such as background sound and icon images. While most extraction services focus primarily on article-like content, we at Deeper Insights™ (formerly Skim Technologies) believe there is a growing demand for a more accurate and versatile content extractor. As a consequence, we rise to the challenge of performing equally well on text-focussed pages like articles and forum discussions, as well as on media-rich pages such as galleries, social media posts and landing pages. After looking at many web pages containing different types of embedded content, we identified four most frequent embedded content types: images, videos, audio and social media posts.

In this post we would like to share our journey and the the lessons learned when we set out to extract those four content types. Finally, we present a short evaluation of our extractor’s performance.

Why not extract everything?

As a minimum requirement, it has to be able to filter the media encountered on the page. We believe that most of our users are interested in the media that are part of the main content, but not in the media contributing to the page’s design or navigation (also called boilerplate media). In this example, the relevant media are the main video and the Twitter tweet that the article is referring to. Additionally, the author image could be relevant for some use cases. In contrast, the images in the sidebar serve a purely navigational purpose and thus should not be extracted. Optimally, an extractor should not only clean the output of boilerplate, but also assign functional roles to the extracted media. In this particular example, the extracted media could be labelled with roles like AUTHOR_IMAGE, MAIN_VIDEO and TWITTER_REACTION.

Why the variety of web technologies makes extraction more difficult

The biggest challenge of media extraction is the number of ways to embed media into a web page. For example, a simple image can be defined in SVG format or as a byte array within the main HTML file. More commonly, however, images are fetched from an external source. HTML5 supports numerous ways of embedding external data into an HTML file. An image source can be specified in the metadata of the page or within the page’s body. Within the body, an image can be specified in <img> tags, added via CSS styling, generated from a script or be inserted by a web plugin (note that plugin technologies become increasingly obsolete). When a browser renders a web page, one of its tasks is to generate images from SVG or a byte specification where necessary and to fetch all required external resources. Thus, to be able to capture all media within a page, a model would equally have to execute the same steps as a browser. However, running a browser for the task of content extraction is costly and thus it is often preferable to employ a simpler model at the expense of missing some media. Such a model could, for example, spot a JavaScript-embedded image or video in HTML without needing to execute the script that generates the according <img> or <video> objects.

How to choose only the relevant media from a page

As mentioned above, media have a functional role within the web page, which determines whether they are considered content or boilerplate with regard to a specific purpose. We started out by building a general purpose media classifier distinguishing between content and boilerplate images. We believe that this is a first step towards a more fine-grained media classification approach. Note that the content-boilerplate distinction is primarily relevant for images since videos, audio or social media posts are very rarely boilerplate (with the exception of background sound and video ads).

In general, content media tend to be close to or embedded within the main text content and are often more salient than boilerplate images, for example by being larger. Boilerplate images are more likely to be linked to external content and thus serve a navigational purpose. However, the current example challenges some of these assumptions: Unrelated images lie in proximity to the text content, relevant images are anchored (linking to related articles) and relevant gallery image icons are the same size as unrelated boilerplate images.

What media should be extracted on pages that carry very little text content and thus the extracted text cannot be used as a point of reference? Since the example we took into consideration is a homepage, it doesn’t contain one single piece of content. Instead, there are links to one main story and many other currently relevant articles. Should only the image leading to the main story be extracted or images linking to the other stories too? And if so, should the images linking to the BBC Three media in the footer be included? At the very least, all icons and banner images should be filtered out by extraction. These are the sorts of decisions we face every day at Deeper Insights.

How we extract social media posts

What if the page you are extracting contains embedded social media posts? Many large content sharing platforms specify a standard way of embedding their content into your page, which is commonly used. For example, YouTube video pages and and Twitter tweets let the user generate an HTML code snippet to embed the relevant piece of content into their page. This enables us to apply simple rules to spot embedded content from large content providers, but at the same time introduces the vulnerability to any changes in the embed pattern by the provider. Moreover, all large social media platforms allow one to retrieve an embed representation for a piece of content via a single API call by implementing the oEmbed standard. oEmbed was developed to facilitate the embedding of content on third party sites without having to parse the resource directly. Figure 4 and 5 show a response for a Twitter oEmbed API call and the embedded tweet representation. By asking the provider, we can thus use the returned information to support the extraction and summarisation of social media posts.

Media Extraction with Deeper Insights

We have been working hard at Deeper Insights to increase the quality of our extraction services on text as well as embedded media content. We believe that this is the only way to accurately and reliably extract media content, as embedded content becomes increasingly relevant and important to brands, businesses and their analysts. This shows in the quality of our image and video extraction and we are currently working on the extraction of embedded audio content and social media posts which we’ll share more with you soon.

High quality extraction is crucial to the success of data projects. However, extraction can be surprisingly tricky because HTML can be used in creative ways to achieve the same effect. Deeper Insights manages to extract all relevant images on the site, while Competitor A only extracts the placeholder image of the main video. We are also currently deploying the functionality to extract embedded tweets, check out our APIs for this feature in a few weeks time. However, both services fail to extract the two videos on the site. The videos are embedded into the page via a Javascript-generated media player. To extract these videos is a challenge we’re currently working on, ensuring we have the breadth of content types extracted that people are coming to expect from us.

Media Extraction Benchmark Evaluation

To compare our performance to similar services on the market, we evaluated our image extraction and benchmarked it against Competitor A by posting snapshots of HTML pages to our and their service. We created a data set consisting of 83 media-rich pages containing images. The data was gathered by making Google queries on the 14th August 2017 for a broad range of topics (from healthcare to car engines). To assess the image extraction performance, we labelled each page with all image URLs we would expect to be extracted for that page. Note that the process of labelling web pages for the occurrence of media is a complex and partially subjective procedure. Firstly, it is worth mentioning that the choice of browser and the browser settings can influence which media are displayed and thus may bias the labelling procedure. Note also that sometimes images are declared inline, for example via a definition within <svg> tags and thus don’t have a source URL. These images could not be processed automatically during evaluation and were revisited manually in the evaluation of both services. A final note is that media (especially images) are often represented in various sizes, formats, resolutions within one page. During the evaluation, we used a heuristic to identify image URLs that point to variations of the same image: We use the filename to group images, i.e. stripping off the URL path, attributes and the file extension.

The precision measure can be intuitively understood as how many selected images are actually relevant, while recall measures how many relevant images were selected. The F1 score is combining both measures into one. The difference in the extraction performance between Deeper Insightsand Competitor A shows primarily in the precision measure which is associated with the number of false positives. On our data set, we observed that Competitor A extracts a much higher number of images which would not classify as relevant, for example icons, banners or navigation elements, but are not related to the main content of the page. However, we also outperform Competitor A on the image-level recall and the number of documents containing false negatives, i.e. the Deeper Insights extractor misses fewer relevant images. Note that all errors (by the Deeper Insights extractor as well as by Competitor A) were manually reviewed, but of course are subject to the reviewer’s judgment.

As the world of unstructured data grows, we tirelessly work to build solutions that help bring structure to the web. Our content extractor is not the fastest in the world, but is certainly one of the most accurate. If you are interested in working with us, or for us, in developing products that utilise structured, accurate and reliable data, then please get in touch.

We help companies like yours win with AI

From finance to healthcare, from market research to media monitoring, we can help your people make better decisions. We work alongside companies like yours to help deliver successful AI and ML projects - to make a real business value impact

...
...

The challenge: Deloitte’s partners and account managers found they were drowning in news from sales support teams, and unable to react quickly to market changes

The solution: Deeper Insights built a prototype Automated Insights app allowing them to have better conversations with clients and close more business

The outcome: Account managers at Deloitte close more business thanks to actionable insights delivered straight to their phones

Client said: "There are a number of gems we’ve found that are far better than the standard services we use" - Dimitar Milanov, Partner, Deloitte

...
...

The challenge: Help the sales and marketing teams know more about their customers to enable them to drive deeper customer engagement in sales meetings

The solution: Deeper Insights developed a CMS that scraped the web and automatically identified and summarised customer events relating to key accounts at JLL

The outcome: The automation of the whole previous manual process, and being able to identify 60% more news stories than the manual process, enabling JLL have better and more informed conversations with their clients

Client said: "We have lots of researchers and people who generate insights for our clients, Deeper Insights™ (formerly Skim Technologies) helped us improve the speed at which we get insights and have better conversations with our clients." - Chris Zissis, CIO, Jones Lang LaSalle

...
...

The challenge: How to turn billions of valuable data into actionable insights that a human can read and understand? Quant Insight had developed a complex algorithm used by financial analysts, but now wanted to reach a consumer client.

The solution: Using Microsoft’s LUIS intent engine and some custom NLP models Deeper Insights developed a chatbot that could query the QI API and translate the financial data into natural language.

The outcome: QI’s retail investors were able to easily access stock information via a friendly chatbot. The chatbot was able to achieve over 94% accuracy for selecting the correct stock and factors related to a question. An impressive outcome.

Client said: "Deeper Insights™ were a pleasure to work with, and extremely knowledgeable in the field of NLP. Their ability to take our idea and turn it into a working product made them the perfect partner for our fast-growing business."

Talk to us to see if we can help you

Discuss your AI project with us and lets see if we can help. We can dive into the data you have, the data we can gather from the web and other data sources, how we can manipulate that data for you and how we can output it in a dashboard that your business can actually use.

Our Data Scientists will help you deliver your AI project

Our Data Science experts are recognised globally with over 500+ citations and patents

We have a combined experience of over 40 years in developing cutting edge, and innovative Artificial Intelligence from both academia and industry. We are specialists in Computational Linguistics, Natural Language Processing, Machine Learning, Deep Learning and Data Analytics

...

"I leverage my business, scientific and technical backgrounds to design and build high-impact AI solutions that turn data into something useful for our clients. My expertise lies in the fields of Network Science, NLP, Computer Vision, Imbalanced Learning and Unsupervised Learning."

Dr Márcia Oliveira, Lead Data Scientist

...

"I use Design Thinking as my approach to deliver invaluable technology."

Ygor Durães, Full Stack Engineer

...

"The development pipeline is my favorite technical and cultural tool. My mission is to help to achieve the balance between business and business materialization."

Eduardo Piairo, Operations Manager