Tuesday, 10 March 2015

What it takes to create a news classification app?

Fooyo recently received a request to make a news classification app. It sounds quite challenging and interesting. We feel excited to solve hard problems. The request is a bit too urgent, but we managed to deliver it with good feedbacks from the client.

The requests is to crawl news data from various news sources, delete the duplications, classify the news into different categories and then display them on a mobile app.

It sounds quite straightforward, however, each module actually involves quite hard components that need deep knowledge/understanding to solve them.

In summary there are four components:
1. A web news crawler
2. A duplication deletion mechanism
3. A news classification mechanism
4. A mobile app which interacts with the server

A web crawler is used to crawl news list data and the corresponding page links. The web contents of the page links are then saved into the server. There are quite a few good open sourced crawler out there, e.g., scrapy.  Some sites may have anti-crawler feature which will block IPs when they crawl too much. Thus we may need different proxy servers to crawl the data when some are blocked by the news site.

A duplication deletion mechanism firstly need a good text parser which involves the following tasks:

1. Texts need to be tokenised into list of words(This step would be important for languages like Chinese).
2. Each word needs to be be stemmed(say ran, running and run are all stemmed into run).
3. Remove the stop words(a, to, the, and, etc)[may not be a necessary step], punctuations, lower case, lemmatising, etc.

After parsing the text data, we need to calculate the text similarity to remove the duplications: cosine similarity, Jaccard similarity coefficient, etc.

To categorize the news into different groups, we need to adopt suitable text classification algorithms, e.g., Naive Bayes , Tf-idf, etc. Some recommended python libraries includes: nltk, scikit-learn, etc.

To transfer the data into the client app, we also created a data storage mechanism which saves the crawled data into database and then gets retrieved according to category&page number.

The final app is like the following, with a news list tableview page and and a news detail page. The category tabs are displayed at the top scrollview, which makes the experience more user friendly.

There are a lot of challenges. One of the challenge occurs when we use our news duplication deletion algorithm, the complexity is high when we need to compare one article with every others(n^2). If the similarity comparison itself is not efficient enough, the whole thing would be very inefficient. Imagine there are 1,000 articles, then we'll need 1,000,000 comparisons. If every second can process 1000 comparisons, then it can take 17 minutes to process the whole thing. When it's less efficient, that would be a disaster. Initially, we compare the whole text body. It took so long time. We switched to compare title which is much more efficient. The result turns out to be OK.  A better strategy would be to process the text body to abstract the top N keywords and then compare the keywords instead of the whole article.

The urgent task was not confirmed until 3 days before the deadline. It looked like an impossible task to finish in such a short time at this complicity level. Thankfully, three out of four developers in our team learned NLP/Machine learning and we confirmed it doable. Our team worked really hard staying overnight and together we made it happen. Thanks for the help from many open source communities for providing very good NLP/Machine learning libraries. Also a special thank to a friend from Tencent who gave us nice guidances in the very early stage.

This article is a personal sharing and there may be some parts which are not academically rigorous, please feel free to point out the parts which you find anything improper. Thanks!

Monday, 9 March 2015


I officially quitted my job in ViSenze in Feb 2015 to chase my little dream -- to create a lean tech startup that can make the world a better place.

In fact, my cofounders and I started working together in the early 2013 where we developed an iPad clock game which solves a small problem in a fun way.  Ever since then, we have never stopped our passions in making products that can benefit people. Though we worked on many projects in different teams in the past two years, we still cannot forget the most memorable days when we worked together in the midnight in School of Computing and go for supper together. That was the best time in my life so far.

On a usual day in April 2014, I proposed the idea to make a small and beautiful tech studio that makes elegant apps/products.  A lot of feedbacks came. Most feedbacks are just comments without any actionable suggestions. Only one friend, Rick(one of the cofounders) who was exchanging in Switzerland(ETH) saw my posts and then created a Wechat Group together with two others to discuss about the plan. The plan was to start doing outsourced projects to make some real successful cases. At that time, we had no clients, no money, nothing at all. But we never stopped seeking. We named our studio "Fooyo" which sounds like "浮游" in Chinese. It stands for a pure and simple object that can bring beauty and freedom.

The first potential client was introduced by Zhixing, who is not only a developer but also a professional photographer. The potential client was a photographer who would love to start her own photograph studio. She needed an official website that can showcase her work and potentially help attract more photographing clients. We spent a lot of time planning the most suitable proposal which can fit her requirements. The plan was good. However, the client was not buying. The budget was beyond her expectation though we already gave her a very friendly pricing. She didn't think the official website worths that much to pay for when it cannot really help attract more customers as Social Networks(Facebook pages, etc) do.  Her final decision was to build the website on Wix, which served the purpose and also saved some money.  The failure helped us realise some facts:

Technology is changing a lot of industries, including technology service itself. The low end markets will be replaced by automation and there can hardly be profits for services which still need labours to process. If we are still moving on with the tech consultancy service, we've got two ways to go:

          1. To aim for the high end clients.
          2. To develop products which still don't have optimal automation solutions.
          3. Create a solution that is automatic and can scale with minimal marginal cost.

The first approach is quite hard especially for young startups. High end clients will seek for high end consultants and there are few chances for the young ones. Though there are cases where the high-end consultants may distribute some tasks to new comers, the tasks are usually not that interesting. Other factors may change the situation, say relationship. If we know a big guy,  he may give us some good projects at a descent price. Unfortunately, it was not our case at that time.

The third approach requires initial investments and it can take quite some money. We were not at that stage at that time.

The real situation led us to the second approach where we aim more for app development than website development. The fact that four of us are full-stack developers and can make both apps and websites, our competitive advantage is to make full-stack solutions for entrepreneurs who have already received fund or traditional industry leaders who would like to do more innovations on IT.

Thankfully, we've helped some entrepreneurs building websites/apps before and they trusted us for our ability and frankness. We've got our first client Wanmen.org who received seeds fund from Renren inc and was in desperate to seek for tech teams that can really deliver good quality products. As a friend and a previous active developer, I suggested the CEO of Wanmen to seek for a good outsourcing team at the first stage. The CEO was not a tech person and he had limited connections with good engineers, it is indeed the best way out for their company at that time. After some comparisons, he still found that reliability is somewhat more important than capability when multiple teams have the capability to do something. He asked whether my team is capable of doing that. I consolidated with my team members via Skype(one in Toronto, one in Zuich, one in Singapore) and we confirmed that it was doable. He trusted us and we made a first deal to make a website + iOS app + Android app full stack solution.

Four of us distributed our talents and really delivered the products on time(with some help from professional designers and project managers). The first version was quite a success. In two weeks time, there are around 10, 000 registered users and 10, 000 total app downloads. The investor gave 3 million USD more fund to Wanmen and the company is growing well.

Since then, we are more confident that we can deliver high-quality full-stack products in a very efficient manner. More project requests are coming. More biz entrepreneurs are trying to hunt us to be their CTOs. That was also the time when I cannot balance my work and my passions. After months of mental struggling, I decided to quit my job although ViSenze is a nice company and I can possibly become a more professional engineer staying there. I decided to be a tech entrepreneur creating tech products to improve people's lives rather than an obeying engineer.

Now, fooyo is moving fast. We initialise our own products as well as helping other people realising their dreams. We've got passionate people working as Business Director, Senior UX designer, etc. One of my Swedish friends also joined us as a strategy director. There are still a lot of problems we need to conquer along the way, but I'm confident that our team can move forward and really make a difference.