Skip to main content

What it takes to create a news classification app?


Fooyo recently received a request to make a news classification app. It sounds quite challenging and interesting. We feel excited to solve hard problems. The request is a bit too urgent, but we managed to deliver it with good feedbacks from the client.

The requests is to crawl news data from various news sources, delete the duplications, classify the news into different categories and then display them on a mobile app.

It sounds quite straightforward, however, each module actually involves quite hard components that need deep knowledge/understanding to solve them.

In summary there are four components:
1. A web news crawler
2. A duplication deletion mechanism
3. A news classification mechanism
4. A mobile app which interacts with the server

A web crawler is used to crawl news list data and the corresponding page links. The web contents of the page links are then saved into the server. There are quite a few good open sourced crawler out there, e.g., scrapy.  Some sites may have anti-crawler feature which will block IPs when they crawl too much. Thus we may need different proxy servers to crawl the data when some are blocked by the news site.

A duplication deletion mechanism firstly need a good text parser which involves the following tasks:

1. Texts need to be tokenised into list of words(This step would be important for languages like Chinese).
2. Each word needs to be be stemmed(say ran, running and run are all stemmed into run).
3. Remove the stop words(a, to, the, and, etc)[may not be a necessary step], punctuations, lower case, lemmatising, etc.

After parsing the text data, we need to calculate the text similarity to remove the duplications: cosine similarity, Jaccard similarity coefficient, etc.

To categorize the news into different groups, we need to adopt suitable text classification algorithms, e.g., Naive Bayes , Tf-idf, etc. Some recommended python libraries includes: nltk, scikit-learn, etc.

To transfer the data into the client app, we also created a data storage mechanism which saves the crawled data into database and then gets retrieved according to category&page number.

The final app is like the following, with a news list tableview page and and a news detail page. The category tabs are displayed at the top scrollview, which makes the experience more user friendly.

There are a lot of challenges. One of the challenge occurs when we use our news duplication deletion algorithm, the complexity is high when we need to compare one article with every others(n^2). If the similarity comparison itself is not efficient enough, the whole thing would be very inefficient. Imagine there are 1,000 articles, then we'll need 1,000,000 comparisons. If every second can process 1000 comparisons, then it can take 17 minutes to process the whole thing. When it's less efficient, that would be a disaster. Initially, we compare the whole text body. It took so long time. We switched to compare title which is much more efficient. The result turns out to be OK.  A better strategy would be to process the text body to abstract the top N keywords and then compare the keywords instead of the whole article.

The urgent task was not confirmed until 3 days before the deadline. It looked like an impossible task to finish in such a short time at this complicity level. Thankfully, three out of four developers in our team learned NLP/Machine learning and we confirmed it doable. Our team worked really hard staying overnight and together we made it happen. Thanks for the help from many open source communities for providing very good NLP/Machine learning libraries. Also a special thank to a friend from Tencent who gave us nice guidances in the very early stage.

This article is a personal sharing and there may be some parts which are not academically rigorous, please feel free to point out the parts which you find anything improper. Thanks!




Comments

Popular posts from this blog

Imagine I will read it in 5 years(part II)

It is a war and those who fight and survive might become heroes of tomorrow. Top inspirations I learned from this crisis are as follows: 1. As a company or a government, risk management is super important. Those who manage the risks well and planned ahead could possibly overcome hard times and survive strong. One of the key principles for risk management is to distribute the risks over multiple buckets. To a B2B business or country, the key competitiveness would lie in supply chain management, getting the right suppliers and deliver to the end buyers. In the past, the key decision will be primarily influenced by the cost factor. In a low-risk environment, it would be fine. However, in a high-risk environment, this may break, and cost could be much less a factor than the following two factors: The reliability of the supplier The alternative choices in case of the supply chain breakdown. This reminds me of the fruits suppliers in SG's supermarkets. Even for oranges, it c...

IPAD/FB Seminar- Thoughts on Pulse News

Monday's presentation was full of interesting sparks. Of all the eleven teams, only one chose to present FB app, which is Sims Social. The others all chose to present ipad apps. That's not so surprising as Ipad(Tablets) is the most recent platform and there are a lot of blank spaces for us to fill in. The ten ipad apps shared in the seminar covers various fields like e-payment, news-media, education. What surprised me most is that many of us find education a very promising area for mobile app development...As ipads are being utilized as an educational tools in various educational levels, education is really going to be a great pie. Wait, I need to finish comments for my assigned app first. Pulse News, a news media app with good social features. News media are getting more social and mobile and probably "cloudy" in the recently years. Organizing news media contents can be a promising area since there are always interesting things happening around and people just have...

Pause and Retrospect

I recently enjoyed a swift lunch with an old friend, who was once my roommate during our school years. We had an in-depth conversation about life and career. To my mind, his thought process is far more mature than mine, and his decisions seem more future-oriented. Although our conversation was private, I believe certain insights should be shared with a broader audience. 1. The Role of Luck in Life.  His life experiences have significantly highlighted the impact of luck. While I'm not ready to accept this fully, similar sentiments have been echoed by other successful and wise individuals I've interacted with. I also recently watched an informative video that lucidly explains the 80/20 rule and the principle of luck. The video demonstrates through "agent-based modelling" how the wealthiest person isn't necessarily the most talented but is usually someone of average talent who encounters multiple lucky events in life.  I agreed that luck indeed plays a vital role and...