At Indix, we deal with product information, where the primary source of such data is obtained via crawling and parsing of retailer and brand websites. Parsing is an operation to extract content which is associated directly or indirectly with a product. Parsing a large scale of websites is a complex problem, as each website has its own structure. If it’s a complex problem, how do search engines identify and list product pages, irrespective of the page pattern? SEO techniques come handy here and many websites adopt their own SEO optimizations to make themselves appear higher in the search ranking.
In the blog 9 Steps to Perfect Product Page SEO, we have recommended SEO techniques for a perfect product page and one of them is “Product Schema Markup”. “Product Schema Markup” tells search engines that your page is; About a product, what the product is, and the details about the product. A collaborative, community activity to come-up with such schema is Schema.org. Schema.org not only caters to product webpages, but has a broader vision of viewing the internet as structured data. Because of that, and the benefits such structured data can enable, majority of websites follow schema.org either partially or completely for its web-pages. At Indix, the Product, Offer schemas and its sub-schemas to auto-parse retailer and brand websites and produce structured product information.
There are 3 different types of markups:
There is a 4th type, Meta tags, not specifically defined in schema.org, but defined in W3C specification for SEO purposes with motivation to make internet a structured repository of information.
Code snippet: https://github.com/indix/web-auto-extractor#input
Code snippet: https://github.com/indix/web-auto-extractor#output
Code snippet: https://github.com/indix/web-auto-extractor#usage
We picked 1840 retailer and brand websites to understand the scale at which their product pages are schema.org compliant. A list of top 12 fields within the Product and Offer schemas are picked to understand the coverage of each field for the 1,840 websites. The table below represents metrics around the collected information:
Below is the summary of the analysis:
As it can be seen, not all fields are 100% compliant. Meta-tags are the highest contributor in terms of Name, Image and Description. Whereas RDFa is the poorest contributor for majority of the fields. The adoption of Micro-data is on the higher band for the top 4 fields (Name, description, Image and Price). The surprising aspect is that of JSON-LD, which seem to have a broader adoption across all the fields. As more of websites adopt the JSON concepts, they are enabling themselves for JSON-LD.
To achieve the broader vision of making the internet data structured, a lot more websites need to open themselves to being structured and schema.org compliant. Indix is part of such a vision to organize and structure product information and one of our contribution to the vision is web-auto-extractor. Try out the library and share your feedback / issues faced in the GitHub repository – https://github.com/indix/web-auto-extractor.
Also published on Medium.