Back
Jan 28, 2017

Creating a site preview like in slack (using aiohttp)

Igor Tokarev

In this article we will write a small library for creating a preview for a site. The description of how the preview works in slack you can check here.

4 Data sources will be used:

  1. oEmbed
  2. Twitter Cards
  3. Open Graph
  4. HTML meta tags

In this exact order we will try to retrieve the data.

Technology stack

For parsing we will use:

  • Beautiful Soup - to navigate and search through the tree of an HTML document.
  • html5lib - to parse HTML documents, most correctly works with an invalid HTML markup.
  • aiohttp - an asynchronous client for receiving web pages + an asynchronous server.

oEmbed

oEmbed is an open format designed to simplify the embedding of the contents of one web page to another. In a role of the content there may be photos, videos, links or other types of data. You can find specification and description of the format here.

The basic idea of how this format works: a site provides specific endpoint on which you can make a GET request. This request allows passing the URL of the page you want to extract data for, in the parameters. Here’s the example:

A popular social network for sharing and adding ratings to photos and short videos Instagram supports oEmbed.

Endpoint for receiving data: https://api.instagram.com/oembed

Page example: https://www.instagram.com/p/BOcL9FQFKAU/

The final URL for receiving information about a page (url has to be encoded): https://api.instagram.com/oembed?url=https%3A%2F%2Fwww.instagram.com%2Fp%2FBOcL9FQFKAU%2F

The main problem is that we need to know if the site supports the oEmbed format or not. A full list of providers does not actually exist, so we need to add those providers we know. For this, I decided to use a separate JSON file to which we will add providers.

The format is as follows (borrowed from the specification site, a small list can be found here):

false

If the reference falls under one of the masks mentioned in the schemes, the endpoint specified in the url will be used.

Let’s write support functions to download a list of providers

false

Then the parser itself:

false

Example of using:

false

if the provider has been added to the file, we'll get a link to the oembed endpoint from which we will be able to receive data. The get_oembed_url_from_html method will be described below.

Open Graph

Open Graph protocol is a special set of meta tags, which are integrated into the page html-code, the format was created in Facebook. Description of the protocol can be found here. Open graph data is embedded directly into the page and has a simple structure, so it is quite simple to remove it.

Open Graph parser code:

false

This parser supports retrieval of data from the first level of Open Graph tags and supports arrays, but does not support nested structures, i.e. it can extract data from these tags:

false

but not from the nested ones:

false

Twitter Cards

This format is very similar to the Open Graph, special meta tags are added into the page HTML that contain information about the page. Tags description.

Parser code:

false

We retrieve all the data from the meta tags with the name attribute, which starts with twitter:

Meta tags

Parser:

false

Retrieving data

We will receive data by means of an asynchronous HTTP client aiohttp.

The main two functions are:

  • fetch_all - getting all data from all sources
  • get_preview_data - getting basic information title, description, image to create a preview

Both functions take the same parameters (only the first two parameters are mandatory):

  • session - aiohttp.ClientSession
  • url - URL, which we want to get the data for
  • loop - asyncio event loop object
  • oembed_providers - a list of oEmbed providers
  • oembed_params - parameters for the oEmbed query (maxwidth and maxheight)

Regarding the session object transfer, this is done in order to create a single session on the application, from the aiohttp documentation:

Don’t create a session per request. Most likely you need a session per application which performs all requests altogether. A session contains a connection pool inside, connection reusage and keep-alives (both are on by default) may speed up total performance.

false

Example of using

Use as a library in your application.

For the validation of the data I use a small library marshmallow.

srv.py

false

requirements.txt

false

Run python srv.py, then you can make a GET request to the address http://127.0.0.1:8080/extract?url=your_link to get all site data or http://127.0.0.1:8080/preview?url=your_link to get the data which is necessary for the preview.

I also added an example in aiounfurl example repository with GUI, examples:

screenshot_373.png
screenshot_376.png
screenshot_374.png
screenshot_375.png

Running the example in Docker.

I added a docker image with the example in http://hub.docker.com/ to run the sample as a separate independent service.

Running in the background:

false

then you can open our example http://127.0.0.1:8080/.

Using the list of oEmbed providers (a json file with a list of providers /path_to_file/providers.json has to be preliminarily created):

false

The application on github.

More thoughts