An easy way for online shops and stores to improve their webpages is categorize their products. Categorization allows their users to find products more easily, you can apply for filtering if categories are available and you can add subpages for categories and group products there, which means additional webpages present in search engines, and thus more visits from them.
But how does one approach the product categorization of an ecommerce shops?
First step is to decide on the taxonomy that should be used. By taxonomy we mean the categories that can be used.
For product categorization the best candidates are google product taxonomy, you can learn more about it here:
https://support.google.com/merchants/answer/6324436?hl=en
The google product taxonomy offers several Tiers of categories, of different depths.
Here are a few examples of google taxonomy path:
1 2 3 4 |
Apparel & Accessories > Costumes & Accessories > Costume Shoes Apparel & Accessories > Costumes & Accessories > Costumes Apparel & Accessories > Costumes & Accessories > Masks Apparel & Accessories > Handbag & Wallet Accessories |
Then there is another taxonomy for product categorizations, by Facebook. You can find more information on their version here:
https://developers.facebook.com/docs/marketing-api/catalog/guides/product-categories/
Note that they offer a conversion between google product taxonomy and their facebook product taxonomy. Which is very useful if you have categorizes your product in one taxonomy and want to switch or additionally have the products in the other one.
Once you decide on the particular taxonomy that you want to have, the next step is finding the appropriate solution for this. One approach is to train your own machine learning model.
The key in this approach is finding an appropriate training data set. For this, you can scrape top online stores for different categories or buy off-the shelf product categorized data sets.
Machine learning models
Once you have the data, you must decide on which pre-processing steps and which machine learning models to use.
For machine learning models, you can select from standard ones like Support Vector Machine to Neural nets, e.g. recurring neural nets or convolutional neural nets.
The accuracy that you can achieve to a large degree depends on the amount of data you have in your training data set. A high accuracy, preferably over 90% is key to not having too many products on your online store that are not properly categorized.
You can opt for already made solutions that offer merchandise classifier tools via API. One such solution is productcategorization.com website, which offers free product categorization if you do not have too many requests.
If you decide to try to build product categorization on your own, then tensorflow or sklearn library are a good choice for ML models. A good way to start with this is the following article, which provides many useful tips on product categorization:
https://medium.com/product-categorization/product-categorization-introduction-d62bb92e8515
An important part of product categorization ML models are pre-processing layers, you can implement your own pre-processor for this. You can also use article extractor for this purpose.
Article extractors are usually machine learning models that convert webpages to features which are key differentiators with respect to whether a certain part of webpage is article or not.
E.g. one features is link density and we know that part of texts that are menus have high link density, at least much higher than the article content, so the link density is a useful feature for this.
There are many others, e.g. which tags are used. Article content more generally is within <div> tags whereas menus are more generally within <ul> and <li> tags.
Here is a full list of features for article extractors which are an important part of product categorization API:
- in which specific tag is the article (e.g.,<p>,<u>,etc.).
- Link density – a percentage of words that are withing the anchor tags.
- what are the names of the ancestors and sibling tags.
- Count of a certain type of characters like whitespace and digits
- Position of a block, both relative and absolute, in the source of the webapge document.
- Number of sentences in the block
- what is the median length of the sentence, counting the number of tokens.
Website categorization API
Another important set of text classification models that are a bit more general than product categorizations are those for Website categorization API.
Website classification restful API producing json results can be done for either base domains, like e.g. www.economist.com or for full URL paths that can be subpages of a website.
The results produced by this Website Classification Tools are usually not a single category but rather a dictionary of values in form of category A: probability for category A, where probability denotes the probability that given URL is of category A.
This gives the user also the ability to assign, depending on probability levels, more than one category to the given URL.
NodeJs modules for website categorization API
We are adding below NodeJS modules for website categorization API:
https://npmmirror.com/package/websitecategorization
https://yarnpkg.com/package/websitecategorization