In the course of investigating the use of Natural Language Processing and machine learning tools to better extract and navigate our student comments data, it became apparent that many tools have some shortcomings when used on this type of text. Such writing has a very specific style, vocabulary and context which can hinder the effectiveness of generic tools and pre-trained machine learning models. Given our access to a large corpus of student feedback comments, we decided to turn the problem around and see if we could use our data to build new tools.
One of the issues at the very foundation of Natural Language Processing is how to represent words and groupings of words in a format to which mathematical algorithms can be applied. One such representation developed at Google is the Word2Vec model, which represents words as “high dimensional vectors”. That is, each word is represented by a set of numbers (usually a couple of hundred) and these numbers define how a word fits in with other words in the language. This representation is found entirely from the words and their placement relative to other terms in the corpus of text used to train the model. It is a purely machine learning and data-driven approach with no input knowledge or rules about the language itself.
To train our own Word2Vec model we used 250,000 student comments from the past 4 years of subject-level student feedback survey data. As a generic comparison model, we used a popular open-source Word2Vec model available in the Python Natural-Language-Tool-Kit (NLTK) package. This model was trained on a 100 Billion word corpus of Google News stories. While this is far larger than our bespoke model it is also less domain specific and we will demonstrate the effect of this domain-specific aspect in the examples below.
The mathematical representation of words in Word2Vec makes it straight-forward to compute a measure of the similarity between words (or terms) and we will use some examples of this computed similarity to make our comparison between models.
Table 1: Comparison of the most similar terms computed from the Word2Vec models to the terms “occupation” and “real”. This example clearly demonstrates an advantage of a domain specific model. In a university setting the correct context for “occupation” is as a reference to a job or work. This context is correctly picked up in our bespoke model. However, the more generic model relates these terms to military occupations with the most similar words being things like imperialism, tyranny, war etc.
Rank | Most similar terms (Our Model: trained on 250,000 comments from the Student Feedback Survey) |
Most similar terms (Comparison Model: trained on 100 Billion Words from Google News) |
1 | future_job | occupation |
2 | work_environment | war |
3 | work_place | oppression |
4 | project_manager | imperialism |
5 | legal_practitioner | subjugation |
6 | workplaces | genuine |
7 | may_face | tyranny |
8 | working_environment | imperialist |
9 | real_situation | actual |
10 | global_business | oppressors |
11 | real_world | colonialism |
12 | care_women | colonialist |
13 | intended_profession | liberation |
14 | customers | dispossession |
15 | real_project | disengagement |
16 | managers | postwar |
17 | insight_real_world | Zionism |
18 | planners | profoundest |
19 | humanity | militarism |
20 | professional_career | invasion |
Table 2: In the case of less ambiguous terms, it could still be argued that the bespoke model can be superior to the larger but more generic one. In this case, we search for terms similar to “lecture” and “authentic”. While both models give good in-context results some more detailed and specific terms are returned by the purpose trained model (e.g. real-world_cases, class_debates, story_telling) which make sense in the context of student feedback.
Rank | Most similar terms (Our Model: trained on 250,000 comments from the Student Feedback Survey) |
Most similar terms (Comparison Model: trained on 100 Billion Words from Google News) |
1 | interaction_class | lectures |
2 | real-world_cases | authentically |
3 | class_debates | lecture |
4 | practice_theory | lectures |
5 | story_telling | presentation |
6 | easy_remember | contemporary |
7 | economic_models | colloquium |
8 | us_think_critically | authenticity |
9 | pragmatic | seminar |
10 | class_discussion | informative |
11 | lecture_tutorial | presentations |
12 | indigenous_perspective | symposium |
13 | relaxed_learning_environment | enlightening |
14 | throughout_lecture | seminar |
15 | active_learning | oration |
16 | robust | travelogue |
17 | inviting | sermon |
18 | worldly | intellectuality |
19 | real-world_experience | storyteller |
20 | lectures_bit_boring | timeless |