Skip to main content

What it takes to create a news classification app?


Fooyo recently received a request to make a news classification app. It sounds quite challenging and interesting. We feel excited to solve hard problems. The request is a bit too urgent, but we managed to deliver it with good feedbacks from the client.

The requests is to crawl news data from various news sources, delete the duplications, classify the news into different categories and then display them on a mobile app.

It sounds quite straightforward, however, each module actually involves quite hard components that need deep knowledge/understanding to solve them.

In summary there are four components:
1. A web news crawler
2. A duplication deletion mechanism
3. A news classification mechanism
4. A mobile app which interacts with the server

A web crawler is used to crawl news list data and the corresponding page links. The web contents of the page links are then saved into the server. There are quite a few good open sourced crawler out there, e.g., scrapy.  Some sites may have anti-crawler feature which will block IPs when they crawl too much. Thus we may need different proxy servers to crawl the data when some are blocked by the news site.

A duplication deletion mechanism firstly need a good text parser which involves the following tasks:

1. Texts need to be tokenised into list of words(This step would be important for languages like Chinese).
2. Each word needs to be be stemmed(say ran, running and run are all stemmed into run).
3. Remove the stop words(a, to, the, and, etc)[may not be a necessary step], punctuations, lower case, lemmatising, etc.

After parsing the text data, we need to calculate the text similarity to remove the duplications: cosine similarity, Jaccard similarity coefficient, etc.

To categorize the news into different groups, we need to adopt suitable text classification algorithms, e.g., Naive Bayes , Tf-idf, etc. Some recommended python libraries includes: nltk, scikit-learn, etc.

To transfer the data into the client app, we also created a data storage mechanism which saves the crawled data into database and then gets retrieved according to category&page number.

The final app is like the following, with a news list tableview page and and a news detail page. The category tabs are displayed at the top scrollview, which makes the experience more user friendly.

There are a lot of challenges. One of the challenge occurs when we use our news duplication deletion algorithm, the complexity is high when we need to compare one article with every others(n^2). If the similarity comparison itself is not efficient enough, the whole thing would be very inefficient. Imagine there are 1,000 articles, then we'll need 1,000,000 comparisons. If every second can process 1000 comparisons, then it can take 17 minutes to process the whole thing. When it's less efficient, that would be a disaster. Initially, we compare the whole text body. It took so long time. We switched to compare title which is much more efficient. The result turns out to be OK.  A better strategy would be to process the text body to abstract the top N keywords and then compare the keywords instead of the whole article.

The urgent task was not confirmed until 3 days before the deadline. It looked like an impossible task to finish in such a short time at this complicity level. Thankfully, three out of four developers in our team learned NLP/Machine learning and we confirmed it doable. Our team worked really hard staying overnight and together we made it happen. Thanks for the help from many open source communities for providing very good NLP/Machine learning libraries. Also a special thank to a friend from Tencent who gave us nice guidances in the very early stage.

This article is a personal sharing and there may be some parts which are not academically rigorous, please feel free to point out the parts which you find anything improper. Thanks!




Comments

Popular posts from this blog

Time to Write sth about the NOC Israel

After more than a month's waiting, I received an email today from the Financial Aid office saying that I am not permitted to go for the NOC Israel programme. 

Dear ShaohuanI refer to your appeal to participate in NUS Overseas College (NOC) Israel.As spoken on 6 November 2012, we understand that you are very eager to attend the NOC and we had submitted your appeal, together with the appeal from NOC to the sponsor.  However, we regret to inform you that your appeal to participate in NOC Israel is not successful.   The sponsor has advised that you look for an internship locally instead, if needed.We wish you all the best. 

Actually, I was informed by the NOC coordinator that she has been informed by her colleagues that the sponsor officially rejected my appeal(together with Prof.Ben's and NOC's appeals) on 21st Nov. She tried to confirm with her colleagues whether the decision is final or not, hoping to help me for the last try. However, I guess the result won't be positive…

Learning Public Speaking

In the past few weeks, I've been exploring the methodologies of public speaking that work for me. It takes a lot of trials and errors until I finally see some progresses. It would be good to share my thoughts and experiences so that people see my posts can have a more confident and effective start for public speaking.

Firstly of all, you need to overcome the fears to public speaking.  The common excuse I take is that I'm an introvert person. You have to be an extrovert person to master public speaking. However, that's not true. Public speaking is a skill, it can be trained and mastered. It may take some natural talents to be a super-influential speaker. But for a normal person to hit a point to deliver your messages clearly and powerfully, it's doable.

One of the inspiring talks I heard from an introvert speaker is Susan Cain's "The Power of Introverts". Susan says that introverts sometimes can deliver deeper thoughts than the extroverts.


Surprisingly, ev…

About Interview

I went for three interviews these two days. Tuesday morning with Microsoft, midnight with Google for the first round, and Wednesday midnight with Google for the second round. Though the intention for applying those internships is not to get the offers but simply to gain some experience for technical interviews, I still prepare hard for the interviews, hoping to learn as much as possible.

The first round Microsoft interview was conducted in NUS where one HR and one technician flew over to Singapore to give the interview. I had 30 mins with both the HR and the technician. The questions with the HR is the standard interview questions like project experiences+ some brain teasers. The questions are not really that challenging. In case you are interested, I attached the questions in appendix1.

For the technical interview, the technician monitored you to write codes on the spot. The technician this year is a tester from Austin-Taxes working 10 years in MS. I did three questions during this i…