Skip to main content

What it takes to create a news classification app?


Fooyo recently received a request to make a news classification app. It sounds quite challenging and interesting. We feel excited to solve hard problems. The request is a bit too urgent, but we managed to deliver it with good feedbacks from the client.

The requests is to crawl news data from various news sources, delete the duplications, classify the news into different categories and then display them on a mobile app.

It sounds quite straightforward, however, each module actually involves quite hard components that need deep knowledge/understanding to solve them.

In summary there are four components:
1. A web news crawler
2. A duplication deletion mechanism
3. A news classification mechanism
4. A mobile app which interacts with the server

A web crawler is used to crawl news list data and the corresponding page links. The web contents of the page links are then saved into the server. There are quite a few good open sourced crawler out there, e.g., scrapy.  Some sites may have anti-crawler feature which will block IPs when they crawl too much. Thus we may need different proxy servers to crawl the data when some are blocked by the news site.

A duplication deletion mechanism firstly need a good text parser which involves the following tasks:

1. Texts need to be tokenised into list of words(This step would be important for languages like Chinese).
2. Each word needs to be be stemmed(say ran, running and run are all stemmed into run).
3. Remove the stop words(a, to, the, and, etc)[may not be a necessary step], punctuations, lower case, lemmatising, etc.

After parsing the text data, we need to calculate the text similarity to remove the duplications: cosine similarity, Jaccard similarity coefficient, etc.

To categorize the news into different groups, we need to adopt suitable text classification algorithms, e.g., Naive Bayes , Tf-idf, etc. Some recommended python libraries includes: nltk, scikit-learn, etc.

To transfer the data into the client app, we also created a data storage mechanism which saves the crawled data into database and then gets retrieved according to category&page number.

The final app is like the following, with a news list tableview page and and a news detail page. The category tabs are displayed at the top scrollview, which makes the experience more user friendly.

There are a lot of challenges. One of the challenge occurs when we use our news duplication deletion algorithm, the complexity is high when we need to compare one article with every others(n^2). If the similarity comparison itself is not efficient enough, the whole thing would be very inefficient. Imagine there are 1,000 articles, then we'll need 1,000,000 comparisons. If every second can process 1000 comparisons, then it can take 17 minutes to process the whole thing. When it's less efficient, that would be a disaster. Initially, we compare the whole text body. It took so long time. We switched to compare title which is much more efficient. The result turns out to be OK.  A better strategy would be to process the text body to abstract the top N keywords and then compare the keywords instead of the whole article.

The urgent task was not confirmed until 3 days before the deadline. It looked like an impossible task to finish in such a short time at this complicity level. Thankfully, three out of four developers in our team learned NLP/Machine learning and we confirmed it doable. Our team worked really hard staying overnight and together we made it happen. Thanks for the help from many open source communities for providing very good NLP/Machine learning libraries. Also a special thank to a friend from Tencent who gave us nice guidances in the very early stage.

This article is a personal sharing and there may be some parts which are not academically rigorous, please feel free to point out the parts which you find anything improper. Thanks!




Comments

Popular posts from this blog

InnovFest 2015

I attended the innovFest 2015 event. It was quite eye opening. Besides the booth, some topics in the forums also interested me. The first topic I joined was the Kopi Chat with Yossi Vardi, a famous Israeli entrepreneur and investor. He is straightforward and humorous. When talking about the most important reason why people wake up with a great idea but ended up sleeping without executing anything, he collected answers from the audiences. One answer pretty much fitted his appetite-- "People fear about losing faces". He shared his opinion with the quotes from Theodore Roosevelt, “It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually st

Challenges are just getting started

It's my first month working with real projects(my real, I mean something that's going to be used by the public). It's so different from school projects. Every detail matters, from the backend logic to the page buttons. These two weeks, I worked closely with two talented designers, one from a well-established startup company named Umeng Analytics and another doing his own startup after quitting a mobile gaming company named HappyLatte. I know the first designer,PJ, in a hackathon. He is really talented, should be the best designer I've ever worked with so far. I persuaded Prof.Tung to send me to Beijing to work with him on the first prototype and it turns out to be a right decision. He is not only good at design, but also good at UX. He reads a lot. He already came up with a wireframe of the project. Next week, he's going to finish designing the first round of UI design(around 15 pages). The market price for a very good designer is around 5k RMB(1k SGD)/page, however

2018 New Year Resolution

This year, let me try to draft the new year resolution based on the Willpower Instinct: Step 1:  List down the best things of the past year. 1. I proposed to my fiancee when she and her mum visited Singapore in July. We were then registered as officially married on her birthday on 28th Nov in Beijing. It was really the most important event of my life so far. After years' of long distance relationship, we finally made it. Though it will still take her another half year before coming to Singapore, we were both thankful that we really made it. 2. I bought my own apartment. It was a tough process to look for properties. I was so happy when I received the keys. Before that, I shared with my bedroom with my room mates for 3+ years. 3. Fooyo's sales was quite satisfying. We has two big customers, one in the Tourism Industry and another in the Logistics Industry. The products we produced were pretty pioneering and beneficial to the industry.  4. Fooyo succ