Back to Home

🌃 Open Data Literacy Internship: NLP on Public Records

🤔 Problem Space

For eight weeks in summer 2018, I worked for the City of Seattle and the State of Washington on an open data-related project. The scope of the internship was to support data literacy for people in the public sector.

My partner and I collaborated on creating a governance document for an open data-related paraprofessional organization. In parallel, I spent time analyzing trends in public records requests to support proactive disclosure of datasets.

The project had two parts: data analysis and a policy document. I was in charge of the data analysis.

🛠 Process

We began by interviewing a number of different records professionals. We collected their names through the snowball sampling method. We spoke with interview subjects for about an hour each, then performed thematic analysis based on our notes and coded transcripts, which you can view in this repo. We sought to better understand the needs and expectations of data professionals that might be involved in the organization.

Once we identified the themes, my partner and I collaboratively drafted a document that acted as a charter for the organization, outlining the expectations for participants.

The end-goal was for Open Data Champions (people that work with open data) to use some kind of online tool or platform as a teaching tool to analyze their own data within their organization. This practice would foster conversations about how to use and develop novel open data projects, bootstrapping interest across their organization.

As a first attempt at meeting this goal, I began with three datasets from different municipalities across Western Washington. I researched current techniques in natural language processing and clustering algorithms, building a data processing pipeline and visualization toolkit.

🎉 Outcomes

We traveled to Olympia to meet with some of the partners who helped facilitate the internship. We also presented our work for discussion, which you can see online.

The final analysis that I produced is available on GitHub, hosted in an interactive Jupyter notebook with Binder. You can try it here.

Throughout the project, we also wrote about our experience. You can view some reflections hereor here on Medium.