O'Reilly logo
live online training icon Live Online training

Text Mining and Sentiment Analysis in R

An introduction to text analysis for effective, data-driven storytelling

Aleszu Bajak

This course will allow participants to develop fluency in the techniques and applications of textual analysis by training them in easy-to-use open-source tools and scalable, replicable methodologies that will make them stronger data scientists and more thoughtful communicators.

Using RStudio and several engaging and topical datasets sourced from politics, social science, and social media, this course will introduce techniques for collecting, wrangling, mining and analyzing text data. The course will also have participants derive and communicate insights based on their textual analysis using a set of data visualization methods. The techniques that will be used include n-gram analysis, sentiment analysis, and part-of-speech analysis.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • The techniques and applications of textual analysis
  • How to convert unstructured text from political science, data journalism, social science, and social media into data
  • Techniques like n-gram analysis, sentiment analysis and parts-of-speech analysis

And you’ll be able to:

  • Ingest various text formats into RStudio
  • Wrangle and analyze that text
  • Visualize and communicate insights about those textual data

This training course is for you because...

  • You’re a data scientist or analyst looking to explore more text data
  • You work with social media and other qualitative, text-based data
  • You want to become an expert in textual analysis techniques like n-gram and sentiment analysis

Prerequisites

  • Basic knowledge of R and RStudio
  • Familiarity with standard R tidy methods (such as tidytext and dplyr)

Recommended preparation:

  • Download and install R, RStudio Desktop and the Github repo contents locally, if possible.
  • As an alternative, code examples and exercises will be provided through RStudio Cloud. If possible, sign up for a free account with RStudio Cloud. This is optional, but recommended for the best experience.

Recommended follow-up:

About your instructor

  • Aleszu Bajak is a lecturer and graduate programs manager at Northeastern University's School of Journalism, where he teaches courses and runs research on digital journalism, data reporting, and new media. He is the founding editor of Storybench.org, an under-the-hood guide to the future of digital storytelling, a faculty affiliate with Northeastern's Global Resilience Institute and the NU Lab for Texts, Maps, and Networks, and innovation lead for Northeastern’s Co-Laboratory for Data Impact.

    Bajak is a former Knight Science Journalism Fellow at M.I.T., radio producer for Science Friday and freelance reporter in Latin America. His writing has appeared in The Washington Post, M.I.T. Technology Review and Nature.

    He has taught courses and led workshops on data storytelling at the Nieman Foundation for Journalism at Harvard University, Brandeis University, and Boston University's Storytelling with Data Bootcamp and has been invited to speak about digital journalism at the World Conference of Science Journalists in Seoul, South Korea; the European Science Journalists Conference in Toulouse, France; the National Association of Science Writers conference in San Antonio, Texas; the Power of Narrative Conference in Boston, Massachusetts; and the Iberoamerican Seminar of Science, Technology and Innovation Journalism in Querétaro, Mexico.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Text as data and real-world applications (30 minutes)

  • Presentation: Introduce real-world examples of textual and sentiment analysis sourced from journalism, marketing and finance.
  • Discussion: Discuss the strengths and caveats of these projects and how best to outline methodologies for different audiences.
  • Q&A

Text analysis methods (55 minutes)

  • Presentation: Introduce methods for tokenization, n-gram analysis and part-of-speech analysis.
  • Exercise: Run through an activity that isolates top words, top phrases and top parts of speech for a selected dataset.
  • Presentation: Participants will share their results via the group chat.
  • Q&A
  • Break (5 minutes)

Sentiment analysis methods (55 minutes)

  • Presentation: Introduce sentiment analysis methods and sentiment dictionaries.
  • Exercise: Participants will run through an exercise that performs sentiment analysis on a selected dataset.
  • Presentation: Introduce social media dataset and perform sentiment analysis. Participants will discuss findings and caveats via the group chat.
  • Q&A
  • Break (5 minutes)

Visualization and Communication (55 minutes)

  • Presentation: Introduce best practices of data-driven communication and effective visualization with text data.
  • Exercise: Participants will return to one of the textual methods introduced earlier and perfect an insight they wish to communicate via the group chat or through a Google doc by sharing one to three bullet points and one visual.
  • Presentation: The instructor will provide feedback on select submissions.
  • Q&A