Product categorization API for online stores and ecommerce in general

An easy way for online shops and stores to improve their webpages is categorize their products. Categorization allows their users to find products more easily, you can apply for filtering if categories are available and you can add subpages for categories and group products there, which means additional webpages present in search engines, and thus more visits from them.

But how does one approach the product categorization of an ecommerce shops?

First step is to decide on the taxonomy that should be used. By taxonomy we mean the categories that can be used.

For product categorization the best candidates are google product taxonomy, you can learn more about it here:

The google product taxonomy offers several Tiers of categories, of different depths.

Here are a few examples of google taxonomy path:

Then there is another taxonomy for product categorizations, by Facebook. You can find more information on their version here:

Note that they offer a conversion between google product taxonomy and their facebook product taxonomy. Which is very useful if you have categorizes your product in one taxonomy and want to switch or additionally have the products in the other one.

Once you decide on the particular taxonomy that you want to have, the next step is finding the appropriate solution for this. One approach is to train your own machine learning model.

The key in this approach is finding an appropriate training data set. For this, you can scrape top online stores for different categories or buy off-the shelf product categorized data sets.

Machine learning models

Once you have the data, you must decide on which pre-processing steps and which machine learning models to use.

For machine learning models, you can select from standard ones like Support Vector Machine to Neural nets, e.g. recurring neural nets or convolutional neural nets.

The accuracy that you can achieve to a large degree depends on the amount of data you have in your training data set. A high accuracy, preferably over 90% is key to not having too many products on your online store that are not properly categorized.

You can opt for already made solutions that offer merchandise classifier tools via API. One such solution is website, which offers free product categorization if you do not have too many requests.

If you decide to try to build product categorization on your own, then tensorflow or sklearn library are a good choice for ML models. A good way to start with this is the following article, which provides many useful tips on product categorization:

An important part of product categorization ML models are pre-processing layers, you can implement your own pre-processor for this. You can also use article extractor for this purpose.

Article extractors are usually machine learning models that convert webpages to features which are key differentiators with respect to whether a certain part of webpage is article or not.

E.g. one features is link density and we know that part of texts that are menus have high link density, at least much higher than the article content, so the link density is a useful feature for this.

There are many others, e.g. which tags are used. Article content more generally is within <div> tags whereas menus are more generally within <ul> and <li> tags.

Here is a full list of features for article extractors which are an important part of product categorization API:

  • in which specific tag is the article (e.g.,<p>,<u>,etc.).
  • Link density  – a percentage of words that are withing the anchor tags.
  • what are the names of the ancestors and sibling tags.
  • Count of a certain type of characters like whitespace and digits
  • Position of a block, both relative and absolute, in the source of the webapge document.
  • Number of sentences in the block
  • what is the median length of the sentence, counting the number of tokens.

Website categorization API

Another important set of text classification models that are a bit more general than product categorizations are those for Website categorization API.

Website classification restful API producing json results can be done for either base domains, like e.g. or for full URL paths that can be subpages of a website.

The results produced by this Website Classification Tools are usually not a single category but rather a dictionary of values in form of category A: probability for category A, where probability denotes the probability that given URL is of category A.

This gives the user also the ability to assign, depending on probability levels, more than one category to the given URL.

NodeJs modules for website categorization API

We are adding below NodeJS modules for website categorization API:







A few useful links for platform BittsAnalytics

If you are interested in status of our BittsAnalytics API and dashboards, we publish on this on

API documentation was written with a very useful tool called slate:

For our company Alpha Quantum we set up a telegram channel that is available here:

You can find a summary of our blog posts on feedburner:

Notion is a very useful tool to track your projects, especially if you are collaborating with more people. We set up a notion:

Recently we were playing with interesting deep learning model which allow style of image to be transferred from one picture to another.

We posted a few of our results here:

We hope to publish a lot of articles on theme of machine learning on our hackmd blog:

BittsAnalytics on

Best seo niches and how one does niche keyword research

SEO is still one of the best and cost efficient ways to attract visitors. Although PPC advertising is always a good alternative, it is often expensive and once you stop paying for the ads, the visitors mostly stop coming (there may be some residual effect due to recurring visitors). Whereas with SEO, the investment you make in SEO has long-term implications with a lot of traffic coming over the years. Of course, if your optimisation is good enough.

To accomplish this, it is important to both focus on the right niches as well as on right keywords within niches. Best niches for seo are often those where the products or services are of considerable value, e.g. high ticket, expensive items or services. And where the competition is still not too intensive. This can happen often in niches that are just emerging or are neglected.

How does one find the best niches to focus on?

A data science driven approach is first find the list of interesting niches. Next, one has to find the top relevant keywords for this niche and determine the SEO rankings e.g. by using tools like Data For SEO API service.

Once one has the rankings, one can compute how difficult is the niche in terms of on-page optimization, how difficult is it in terms of average OpenPageRank of domains present in rankings. One can also look at the average domain age, the younger are domains the better (it is harder to compete against older domains).

Then the next thing is to look at the top 10 domains of the niche and how much search volume do they take up over all main keywords of domain. This is possible by automated keyword research where the script checks the rankings for all top keywords of the niche.

Using niche keyword research tool UnicornSEO, this is easily done. One of the features they support is dominators assessment of each niche.

Some of the best niches right now are in the areas of crypto, NFT (non-fungible tokens) and others. Crypto has become interesting due to its recent price increase, especially of Bitcoin and Ethereum. But after the recent surge of Ethereum to almost 4000 USD it is hard to estimate where are the current Ethereum support and resistance levels.

Crypto market sector has gone through many boom and bust cycles in the past.



Portfolio Optimization Software supporting Mean CVaR

Portfolio optimization software has seen increased use in the recent decades. Portfolio optimization is based on modern portfolio theory which basically says that expected return for a given financial asset and its risk are related. The higher the risk of given asset, the higher should be expected return. William Sharpe got the Nobel Prize for its work on modern portfolio theory.

There are however other approaches on how to determine optimal portfolio composition or in specific case, asset allocation.

One is Black-Litterman approach. Black is another Nobel Prize winner, co-inventor of the Black-Scholes formula for valuation of options. Litterman was working in Goldman Sachs Asset Management and wrote several interesting papers about asset allocation.

You can read more about Black-Litterman approach here:

Portfolio optimization software most commonly uses mean variance approach where the risk metric is the variance of returns.

However, there are other possible options for risk metric, one is conditional value at risk or CVaR. It is expected value of value-at-risk or VaR below given confidence threshold.

In this case, the corresponding portfolio optimization approach is known as mean cvar portfolio optimization. Portfolio optimization software that supports both mean variance and mean cvar methods is for example alpha quantum portfolio optimiser.

Here is a screenshot from their website, showing how the optimal portfolio weights change with target return:

Or another one showing interactive comparison of current and optimal portfolio:


You can do many sensitivity analysis on various parameters, example from software:

Backtesting is an important of building quant strategies. It involves testing how your strategy would perform in some past historical period. There are many ratios which allow you to evaluate the strategy:

  • sharpe ratio
  • sortino ratio
  • Jensen’s alpha
  • Treynor metric