An Analysis of YouTube Comments on Drug Health Effects

Nicholas Genes* Andrew McKenzie, Mount Sinai School of Medicine, New York, United States
Michael Chary, Mount Sinai School of Medicine, New York, United States
Emily Park, Mount Sinai School of Medicine, New York, United States
Julia Sun, Massachusetts Institute of Technology, Boston, United States
Alex Manini, Mount Sinai School of Medicine, New York, United States
Nicholas Genes*, Mount Sinai School of Medicine, New York, United States

Track: Research
Presentation Topic: Web 2.0 approaches for behaviour change, public health and biosurveillance
Presentation Type: Rapid-Fire Presentation
Submission Type: Single Presentation

Building: Mermaid
Room: Room 4 - Queenshithe
Date: 2013-09-24 11:30 AM – 01:00 PM
Last modified: 2013-09-25

If you are the presenter of this abstract (or if you cite this abstract in a talk or on a poster), please show the QR code in your slide or poster (QR code contains this URL).


Background: The substances used by recreational drug users change frequently in response to trends in legislation and the drug market. Therefore, it is a challenge for frontline providers, poison centers and public health officials to recognize and treat patients in need of care. Social media provides large, real-time data sources that may contain clinically relevant information that would help physicians and policy makers in this regard. However, filtering out useful information is an intricate problem for natural language processing and machine learning, because the data are unstructured and heterogeneous.

Objective: We sought to develop a technique to extract the signs and symptoms of emerging drugs of abuse from YouTube comments. As a proof of principle, our method was applied to comments mined from videos about marijuana.

Methods: YouTube’s API was employed to collect comments from 50 videos found using the search term 'marijuana' ordered by relevance on March 1, 2013. Two of the authors (EP, NG) classified 1000 randomly selected comments as to whether they contained discussion of health effects of the drug. Then, a Bernoulli Naive Bayes classifier was trained to identify the comments as belonging to one of these two categories using a small set of features, including the length of the comment and the presence of absence of the drug name and/or commonly used synonyms. The classifier was then applied to an un-rated set of comments. The resulting word frequency distributions for the positively and negatively classified comments was analyzed.

Results (research in progress): A total of 15223 comments were collected from the YouTube API. The character length of comments in our data set followed a bimodal distribution. The classifier identified negative comments with a recall of 93% and a precision of 91% on the testing set of 14223 comments. Increased comment length was the strongest feature for identifying positively rated comments, with an odds ratio of 2.5. Among the comments identified as discussing health effects, two of the top ten bigram collocations were "smoke weed" and "prop 215," while two of the top ten bigram collocations for the negative set of comments were "medical marijuana" and "cannabis oil." Two words that were found in the 50 most frequent words for the set discussing health effects, but not in the negatively classified set of comments were "body" and "addictive."

Conclusions (research in progress): Our results demonstrate that capturing information related to health-related effects of drugs using YouTube comment mining and natural language processing is possible. The bimodal distribution in comment length may reflect a distinction between comments written as a curt reply and those written with more thoughtful intent. When length was included as a feature for our classifier, longer length was a strong predictor that the comment was discussing health effects of the drug. Further distinguishing between comments discussing health effects and those discussing the legality of the drug is an important topic for subsequent research. In addition, ongoing research will apply these tools to emerging as opposed to relatively well-established recreational drugs.

Medicine 2.0® is happy to support and promote other conferences and workshops in this area. Contact us to produce, disseminate and promote your conference or workshop under this label and in this event series. In addition, we are always looking for hosts of future World Congresses. Medicine 2.0® is a registered trademark of JMIR Publications Inc., the leading academic ehealth publisher.
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.