Extracting media from HTML

Published on

November 7, 2023

Authors

No items found.

Advancements in AI Newsletter

Subscribe to our Weekly Advances in AI newsletter now and get exclusive insights, updates and analysis delivered straight to your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Over the last decade, there has been a rapid development from static webpages with little interactive content to media-rich and dynamic web pages. This can be attributed to new web technologies that make embedding of multimedia content incredibly simple. For example, adding a video in HTML5 is as easy as wrapping it in <video> tags. At the same time there is a variety of JavaScript libraries allowing easy embedding of external content, such as the react-player supporting videos from all big video content platforms. Thus, even the hobby web developer can easily use multimedia in their website.

This development facilitates not only the use of multimedia, but also any external content that may be useful within a web page for reference. For example, news articles now often contain Tweet reactions as well as gallery applications associated with the reported news event. This also has strong implications for the task of content extraction. A good HTML extractor needs not only to extract the correct text from the page, but also all relevant multimedia and distinguish them from boilerplate media such as background sound and icon images. While most extraction services focus primarily on article-like content, we at Deeper Insights™ (formerly Skim Technologies) believe there is a growing demand for a more accurate and versatile content extractor. As a consequence, we rise to the challenge of performing equally well on text-focussed pages like articles and forum discussions, as well as on media-rich pages such as galleries, social media posts and landing pages. After looking at many web pages containing different types of embedded content, we identified four most frequent embedded content types: images, videos, audio and social media posts.

In this post we would like to share our journey and the the lessons learned when we set out to extract those four content types. Finally, we present a short evaluation of our extractor's performance.

Why not extract everything?

As a minimum requirement, it has to be able to filter the media encountered on the page. We believe that most of our users are interested in the media that are part of the main content, but not in the media contributing to the page's design or navigation (also called boilerplate media). In this example, the relevant media are the main video and the Twitter tweet that the article is referring to. Additionally, the author image could be relevant for some use cases. In contrast, the images in the sidebar serve a purely navigational purpose and thus should not be extracted. Optimally, an extractor should not only clean the output of boilerplate, but also assign functional roles to the extracted media. In this particular example, the extracted media could be labelled with roles like AUTHOR_IMAGE, MAIN_VIDEO and TWITTER_REACTION.

Why the variety of web technologies makes extraction more difficult

The biggest challenge of media extraction is the number of ways to embed media into a web page. For example, a simple image can be defined in SVG format or as a byte array within the main HTML file. More commonly, however, images are fetched from an external source. HTML5 supports numerous ways of embedding external data into an HTML file. An image source can be specified in the metadata of the page or within the page's body. Within the body, an image can be specified in <img> tags, added via CSS styling, generated from a script or be inserted by a web plugin (note that plugin technologies become increasingly obsolete). When a browser renders a web page, one of its tasks is to generate images from SVG or a byte specification where necessary and to fetch all required external resources. Thus, to be able to capture all media within a page, a model would equally have to execute the same steps as a browser. However, running a browser for the task of content extraction is costly and thus it is often preferable to employ a simpler model at the expense of missing some media. Such a model could, for example, spot a JavaScript-embedded image or video in HTML without needing to execute the script that generates the according <img> or <video> objects.

How to choose only the relevant media from a page

As mentioned above, media have a functional role within the web page, which determines whether they are considered content or boilerplate with regard to a specific purpose. We started out by building a general purpose media classifier distinguishing between content and boilerplate images. We believe that this is a first step towards a more fine-grained media classification approach. Note that the content-boilerplate distinction is primarily relevant for images since videos, audio or social media posts are very rarely boilerplate (with the exception of background sound and video ads).

In general, content media tend to be close to or embedded within the main text content and are often more salient than boilerplate images, for example by being larger. Boilerplate images are more likely to be linked to external content and thus serve a navigational purpose. However, the current example challenges some of these assumptions: Unrelated images lie in proximity to the text content, relevant images are anchored (linking to related articles) and relevant gallery image icons are the same size as unrelated boilerplate images.

What media should be extracted on pages that carry very little text content and thus the extracted text cannot be used as a point of reference? Since the example we took into consideration is a homepage, it doesn't contain one single piece of content. Instead, there are links to one main story and many other currently relevant articles. Should only the image leading to the main story be extracted or images linking to the other stories too? And if so, should the images linking to the BBC Three media in the footer be included? At the very least, all icons and banner images should be filtered out by extraction. These are the sorts of decisions we face every day at Deeper Insights™.

How we extract social media posts

What if the page you are extracting contains embedded social media posts? Many large content sharing platforms specify a standard way of embedding their content into your page, which is commonly used. For example, YouTube video pages and and Twitter tweets let the user generate an HTML code snippet to embed the relevant piece of content into their page. This enables us to apply simple rules to spot embedded content from large content providers, but at the same time introduces the vulnerability to any changes in the embed pattern by the provider. Moreover, all large social media platforms allow one to retrieve an embed representation for a piece of content via a single API call by implementing the oEmbed standard. oEmbed was developed to facilitate the embedding of content on third party sites without having to parse the resource directly. Figure 4 and 5 show a response for a Twitter oEmbed API call and the embedded tweet representation. By asking the provider, we can thus use the returned information to support the extraction and summarisation of social media posts.

Media Extraction with Deeper Insights™

We have been working hard at Deeper Insights™ to increase the quality of our extraction services on text as well as embedded media content. We believe that this is the only way to accurately and reliably extract media content, as embedded content becomes increasingly relevant and important to brands, businesses and their analysts. This shows in the quality of our image and video extraction and we are currently working on the extraction of embedded audio content and social media posts which we'll share more with you soon.

High quality extraction is crucial to the success of data projects. However, extraction can be surprisingly tricky because HTML can be used in creative ways to achieve the same effect. Deeper Insights™ manages to extract all relevant images on the site, while Competitor A only extracts the placeholder image of the main video. We are also currently deploying the functionality to extract embedded tweets, check out our APIs for this feature in a few weeks time. However, both services fail to extract the two videos on the site. The videos are embedded into the page via a Javascript-generated media player. To extract these videos is a challenge we're currently working on, ensuring we have the breadth of content types extracted that people are coming to expect from us.

Media Extraction Benchmark Evaluation

To compare our performance to similar services on the market, we evaluated our image extraction and benchmarked it against Competitor A by posting snapshots of HTML pages to our and their service. We created a data set consisting of 83 media-rich pages containing images. The data was gathered by making Google queries on the 14th August 2017 for a broad range of topics (from healthcare to car engines). To assess the image extraction performance, we labelled each page with all image URLs we would expect to be extracted for that page. Note that the process of labelling web pages for the occurrence of media is a complex and partially subjective procedure. Firstly, it is worth mentioning that the choice of browser and the browser settings can influence which media are displayed and thus may bias the labelling procedure. Note also that sometimes images are declared inline, for example via a definition within <svg> tags and thus don't have a source URL. These images could not be processed automatically during evaluation and were revisited manually in the evaluation of both services. A final note is that media (especially images) are often represented in various sizes, formats, resolutions within one page. During the evaluation, we used a heuristic to identify image URLs that point to variations of the same image: We use the filename to group images, i.e. stripping off the URL path, attributes and the file extension.

The precision measure can be intuitively understood as how many selected images are actually relevant, while recall measures how many relevant images were selected. The F1 score is combining both measures into one. The difference in the extraction performance between Deeper Insights™ and Competitor A shows primarily in the precision measure which is associated with the number of false positives. On our data set, we observed that Competitor A extracts a much higher number of images which would not classify as relevant, for example icons, banners or navigation elements, but are not related to the main content of the page. However, we also outperform Competitor A on the image-level recall and the number of documents containing false negatives, i.e. the Deeper Insights™ extractor misses fewer relevant images. Note that all errors (by the Deeper Insights™ extractor as well as by Competitor A) were manually reviewed, but of course are subject to the reviewer's judgment.

As the world of unstructured data grows, we tirelessly work to build solutions that help bring structure to the web. Our content extractor is not the fastest in the world, but is certainly one of the most accurate. If you are interested in working with us, or for us, in developing products that utilise structured, accurate and reliable data, then please get in touch.