Deep Learning Methods for Short Text Analysis in Disease Control.

Deep Learning Methods for Short Text Analysis in Disease Control.

CHAPTER ONE

Aim and objectives of the study

Implement text analytics on short texts from social media based on deep learning
Achieve disease event surveillance by leveraging social
Develop a recommendation system for disease control decision-making.

CHAPTER TWO

LITERATURE REVIEW

Introduction

Text analytics is a field in natural language processing. It aims to extract the semantic, syntactic and contextual information of any written language (Farzindar & Inkpen, 2015). The desired contents are extracted from a large pool of data for the purpose of knowledge discovery. According to Vijayarani, Ilamathi and Nithya (2015), information extraction and retrieval are common processes peculiar to research areas of text mining, web mining, data mining, graph mining, multimedia mining and structural mining. In this chapter, the current methods in NLP will be treated together with their potential in disease control through social media resources. Also, to better describe the effect of this work and its relevance in existing natural sciences problem-interests, epidemiological compartments for infection transmission are discussed.

Text preprocessing: the rudiments of NLP

Written information comes as a continuous connection of letters to form words – words then form phrases and phrases then form sentences. These chunks of information are further identified as parts of speech and named entities. Before computer programs can make these distinctions, a number of processes are carried out on the raw text. They include the following:

Tokenization: The process used to get discrete words by breaking texts based on punctuation marks and white space occurrence. These words form the vocabulary content of the system.

Stop words elimination: These words are not the major language terms in documents, they usually comprise determiners and conjunctions. Stop words can be removed using a compiled list of words that add no extra purpose other than grammatical completeness to a document. Advanced approaches apply Zipf’s law based on these criteria: words that appear only once in a document, words that appear least in the pool of documents (inverse document frequency), and words whose frequency of appearance are excessively high, according to Vijayarani et al. (2015).

Stemming: This connects the different nuances of a base word. These nuances could be in the form of plural forms, past tense or continuous tense.

Normalization: The different inflections words can take in different contexts is checked. The targets at this stage are hyphenated words, capitalization, acronyms, and in the case of query tasks, it takes care of spelling errors.

Part-of-Speech (POS) Tagging: Depending on a sentence, words tend to assume different functions, the aim of POS-tagging is to determine the part of speech that the words in each sentence take up. The task is semi-automated, stochastic and rule-based. Models are used to automatically tag the dataset and later it is checked for consistency by human annotators (Marcus, Santorini, & Marcinkiewicz, 1993). The Penn Treebank, containing a total of 4.5 million words, is the most common POS-tagger used. The accuracy of a tagger is judged not just by its annotation accuracy but also by its consistency, syntactic function, efficacy and redundancy rating in tags (Marcus et al., 1993).

CHAPTER THREE

ANALYSIS AND PROPOSED METHODS

Introduction

Good disease control entails efficient monitoring of media sources, though social media shows encouraging possibilities with real time information sharing and wide coverage, this information source can only be a viable option if the structure of the data processed from it can be well represented in analysis.

Data collection

Data is the major requirement to consider when designing a model. For generic applications, many samples of relevant annotated corpus are readily available; unlike for task-specific objectives, data has to be sourced, annotated and aggregated from the original stages.

Twitter was selected as the target social media platform for this work because of its wide coverage, provision for data streaming and search requests. Different events in different communities elicit different reactions from the populace; these variations and trends in the tweets generated in the locations of interest are extracted and captured to design a disease monitoring system.

CHAPTER FOUR

IMPLEMENTATION AND SIMULATION

Introduction

The Convolutional Neural Network (ConvNet) model, discussed in Section 3.3.2.2, used for short text analysis was implemented and its performance was measured by benchmark evaluations. All the results presented in this chapter were based on the data collated from Twitter using the procedures outlined in Section 3.1.1 and the output classes are as defined in Section 3.1.2.

CHAPTER FIVE

SUMMARY AND RECOMMENDATION

Summary

This work contributes to an emerging field of deep learning for disease control, we applied a character-level approach for text analytics that defeats the need for tasking model augmentation methods used in word vector learning and rule-based procedures. Even with comparatively little data, an NLP model with comparative performance in short text analysis was developed. The disease prediction model was built to check the frailties of some previously failed methods that were implemented for infectious disease monitoring and control; such as, the increase in information search on a particular disease may not translate to an outbreak.

Recommendation

It will be worthwhile if time and resources are invested to make disease-related short text corpus publicly available to aid research in this area and to implement NLP tasks for named entity recognition (NER) to track the location of the outbreak reports.

Work to identify and control factors which directly or indirectly influence the inexplicable occurrence and reoccurrence of disease outbreaks in developing nations would reduce mortality rate. This will make epidemiology branch out into more fields.

References

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., de Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. Advances in Neural Information Processing Systems 29 (NIPS 2016), 1-17. Retrieved from https://papers.nips.cc/paper/6461- learning-to-learn-by-gradient-descent-by-gradient-descent.pdf
Baars, H., & Kemper, H.-G. (2008). Management support with structured and unstructured Data − an integrated business intelligence framework. Information Systems Management, 25(2), 132– 148. https://doi.org/10.1080/10580530801941058
Bansal, S., Chowell, G., Simonsen, L., Vespignani, A., & Viboud, C. (2016). Big data for infectious disease surveillance and modeling. Journal of Infectious Diseases, 214 (Suppl 4), S375–S379. https://doi.org/10.1093/infdis/jiw400
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. https://doi.org/10.1162/153244303322533223
Brauer, F. & Castillo-Chavez, C. (2012). Mathematical models in population biology and epidemiology. DOI: 10.1007/978-1-4614-1686-9
Chan, E. H., Brewer, T. F., Madoff, L. C., Pollack, M. P., Sonricker, A. L., Keller, M., … Brownstein,
S. (2010). Global capacity for emerging infectious disease detection. Proceedings of the National Academy of Sciences, 107(50), 21701–21706. https://doi.org/10.1073/pnas.1006219107
Choi, J., Cho, Y., Shim, E., & Woo, H. (2016). Web-based infectious disease surveillance systems and public health perspectives: a systematic review. BMC Public Health, 16(1), 1238. https://doi.org/10.1186/s12889-016-3893-0

Other Topics