PyData Vermont 2024

Introduction to Machine Learning for Text Analysis and Classification with Python
07-30, 09:00–10:30 (US/Eastern), Filmhouse

Machine learning allows humans to create a model that can act as an extension of the creator’s mind and classify data based on predetermined categories. Manually tagging thousands of rows of data can often be cumbersome and time consuming. Forming a human-machine relationship to classify data can save researchers time and help catalyze data analysis and classification on projects that would otherwise take an untenable number of working hours.

This tutorial will teach participants how to use Python for machine learning and text classification, creating a human-machine relationship to process and classify textual datasets. Learn how to use the Natural Language Toolkit (NLTK) to explore data. Use pandas, a Python library with extensive functionality to manipulate data, to clean and manipulate a dataframe (a table in pandas). Participants will also learn how to engineer textual features and build machine learning classification pipelines with SciKitLearn (a popular open source machine learning library). Examples of projects that can be undertaken using these methods include identifying a behavioral health component in police incident narratives, identifying hate speech on Facebook, and identifying wildlife trafficking posts on Twitter. Participants should be familiar with basic python data types and methods of manipulating strings.


Objective
The primary goal of this workshop is to equip participants with the knowledge and skills needed to leverage machine learning techniques for text classification tasks. By the end of the workshop, participants will be able to:

  1. Explore textual data using the Natural Language Toolkit (NLTK) and other python libraries.
  2. Clean and manipulate textual data using pandas, a powerful Python library.
  3. Engineer textual features for machine learning models.
  4. Build and evaluate machine learning classification pipelines using SciKitLearn.

Target Audience
This workshop is designed for researchers, data analysts, and professionals interested in leveraging machine learning techniques for text classification, and is designed to be extremely beginner friendly. There are materials for a 60 minute pre-workshop meant to onboard individuals who may have less python experience. Participants should have basic knowledge of Python programming, including data types and string manipulation.

Topics Covered
- Introduction to machine learning for text classification
- Exploratory data analysis using NLTK
- Data cleaning and manipulation with pandas
- Feature engineering for textual data
- Building machine learning classification pipelines with SciKitLearn

Examples of Applications
- Identifying behavioral health components in police incident narratives
- Detecting hate speech on social media platforms like Facebook
- Identifying wildlife trafficking posts on Twitter

Format
The workshop will consist of a combination of lectures, demonstrations, and hands-on exercises. Participants will have the opportunity to work on real-world text classification projects and receive guidance from experienced instructors throughout the workshop.

Prerequisites
Participants should have a basic understanding of Python programming, including data types and string manipulation.

Duration
The workshop will be conducted over [insert duration, e.g., one full day or two half-days].

Outcome
By the end of the workshop, participants will have the knowledge and skills to apply machine learning techniques to classify textual data efficiently. They will be equipped to undertake text classification projects in their own research or professional work, saving time and improving accuracy.

Requirements
Participants are required to bring their laptops with access to google colab and a google account to run sample code. Detailed installation instructions will be provided prior to the workshop.

Sean P. Rogers is the Assistant Director of the NULab for Texts, Maps, and Networks at Northeastern University, and a PhD student in Complex Systems and Data Science at the University of Vermont. His research spans computational social science with broad interests in the application of natural language processing, machine learning, and data science to understand complex social problems in collaboration with community stakeholders. His recent work has focused on understanding human dimensions of captive wildlife tourism and wildlife trafficking, and detecting signals to behavioral health components in police incident report narratives. Sean teaches workshops and modules on python, natural language processing, statistical computing, and machine learning.