PyData Vermont 2024

Sean P. Rogers

Sean P. Rogers is the Assistant Director of the NULab for Texts, Maps, and Networks at Northeastern University, and a PhD student in Complex Systems and Data Science at the University of Vermont. His research spans computational social science with broad interests in the application of natural language processing, machine learning, and data science to understand complex social problems in collaboration with community stakeholders. His recent work has focused on understanding human dimensions of captive wildlife tourism and wildlife trafficking, and detecting signals to behavioral health components in police incident report narratives. Sean teaches workshops and modules on python, natural language processing, statistical computing, and machine learning.


Sessions

07-30
09:00
90min
Introduction to Machine Learning for Text Analysis and Classification with Python
Sean P. Rogers

Machine learning allows humans to create a model that can act as an extension of the creator’s mind and classify data based on predetermined categories. Manually tagging thousands of rows of data can often be cumbersome and time consuming. Forming a human-machine relationship to classify data can save researchers time and help catalyze data analysis and classification on projects that would otherwise take an untenable number of working hours.

This tutorial will teach participants how to use Python for machine learning and text classification, creating a human-machine relationship to process and classify textual datasets. Learn how to use the Natural Language Toolkit (NLTK) to explore data. Use pandas, a Python library with extensive functionality to manipulate data, to clean and manipulate a dataframe (a table in pandas). Participants will also learn how to engineer textual features and build machine learning classification pipelines with SciKitLearn (a popular open source machine learning library). Examples of projects that can be undertaken using these methods include identifying a behavioral health component in police incident narratives, identifying hate speech on Facebook, and identifying wildlife trafficking posts on Twitter. Participants should be familiar with basic python data types and methods of manipulating strings.

Filmhouse