The internet can be thought of as a big database where URLs are keys, the values for which equal a bunch of unstructured raw HTML. But, what is the use of this knowledge if it is not easily comprehensible? Getting structured information from raw HTML is vital to mining the knowledge available out there. Indix solves this problem in the product information space. Our parser infrastructure turns HTML into semi-structured product/price data. Machine learning and AI are then applied to this product data to clean, extract, transform and further refine it into a fully structured product record.
Here is an example – you can see the product HTML on the left and the structured content that our parser extracts on the right.
The Indix Cloud Catalog holds product records for over a billion products and insights are derived to enable Product-Aware apps and websites, among other use cases. Today, I want to talk about the parser ecosystem and the tools we’ve built that allow us to parse billions of products across thousands of sites.
Here are some key terms that we will repeat in this article:
The first version of our parsers comprised of custom parsers written in Java for each site. The parser job running on task trackers in Hadoop takes the HTML pages written by the crawler infra and applies the parser for that site to extract the attributes, which are then sent to our downstream systems for analytics and storage.
This system worked well for a few years. However, we were growing at a very fast pace and that meant more customers, which in turn meant more sites to get data from. Site ingestion started consuming significant engineering bandwidth. All our other systems were able to scale beautifully with the rapid growth – building quality systems for scale is the principle of every engineer at Indix. However, site ingestion was the only manual process and we were fortunate to quickly figure out that this process was slowing us down for two major reasons:
We realized that our engineering resources could be used for other projects, which included moving our pipelines from batch to stream. This gave birth to the parser-tagger and its success inspired us to move other stages of our site ingestion pipeline to something similar.
This was a major milestone in automating our ingestion process. We went from ingestion that could only be done by engineers with programming skills to “anyone can tag a site” (taking inspiration from our favorite Ratatouille).
Tagging a site is as simple as clicking the element on the page that you are interested in extracting. In addition to the Parser Job, the ecosystem comprised a Parsing Library and an App to easily tag a site by directly clicking on elements on the page. From these elements and the extracted selectors, the app creates a DSL (using the parsing library) and stores it in our central configuration store. This is a persistent store hosting all site configurations across systems and is built using Event Sourcing and Akka Persistence. The parsing library takes a DSL and a HTML page and applies the parser to get the product attributes.
Here is an example of tagging a site and the DSL that the app generates.
The Tagger-Parser was working very well for us. Soon, we had enough data about sites and selector patterns that, we wanted to remove this little manual process of tagging and move to an auto-generated parser.
The first step we decided on was to recommend parsers and later with enough confidence, we would be able to auto-generate them. The recommendation system endorses parsers based on different input signals. Most of these use machine learning to extract selectors for the interested elements by looking at different signals like location on the page, tagName, hierarchical position in the DOM, or co-location with other elements in the page. Some signals take intelligent decisions based on the corpus of parsers we have, page pattern, etc. Another signal looks for specifications from schema.org on the sites. For example, check out https://github.com/indix/web-auto-extractor.
As a step toward the auto-generated parser, we built the Attribute-Level Parser. Keeping the ecosystem as it is, we revamped the DSL of the old parser and designed it in an inductive way. The attributes extracted from a site were parsers themselves and they collectively form the parser for a site. The parser is also built as a functional model and included many features from the functional paradigm like immutable variables, lazy evaluation, non-leaky values, and also selector caches. The parser executes bottom-up starting by parsing each attribute that finally parses the webpage.
Here’s an example of tagging a site and the new DSL from the Attribute-Level Parser:
Thus, parsers being one of the core components of our data pipelines have evolved through time, from manually written custom parsers to taggable ones for a page and soon we’ll have auto-generated parsers. To quote some numbers, our parser parses an average daily load of 10 million+ products. There have been instances where we have seen more than 40 million products being parsed in a day. Keep watching this space for more updates.
Also published on Medium.