Add word count/ term frequency analysis for transcript
Frankfurt students and staff are in love with the transcipt feature. Really serves their needs and interests. We figured out that in it would be interesting to look at how much they repeate some words/terms while they are giving a class (which is what they are recording and annotating). I wrote a simple JS script which outputs the most used words/terms and their frequency for one transcript json file. However, they need me in order to run it and there is no visualisation. So, it would be great if we could implement it in the frontend.
Main focus at the moment is the transcript but we could later also offer it for manual annotations. However, it might need some more complexity first, to generate a meaningful statistical output.
This is the code I'm using locally, running the script in node (in case that helps for implementing it):
const fs = require('fs'); // only required for file loading from disk
const natural = require('/usr/local/lib/node_modules/natural'); // npm install natural
const tokenizer = new natural.WordTokenizer();
const porterStemmer = natural.PorterStemmer;
// Load the JSON file from disk
const data = JSON.parse(fs.readFileSync('sylvia-recording.mp4_transcript_Automatic.json'));
// Get list with common stop words (English) from the 'natural' package; stop words are words that are exluded from word counting (e.g. 'and', 'to', 'but), would be great if the used stop words could be listed in the UI
const stopWords = natural.stopwords;
// Add custom stop words, would be great if there is an input field to add custom stop words
const customStopWords = ['yes', 'no', 'ok'];
const allStopWords =[...stopWords, ...customStopWords];
// Set this variable to true to enable word stemming or false to disable it, a toggle in the UI would be great
const useStemming = false;
//Number of top terms for output
const topTerms = 20;
// Create a function to tokenize, stem (optional), and count words
function analyzeText(text) {
const tokens = tokenizer.tokenize(text);
const wordCount = {};
tokens.forEach((token) => {
const word = token.toLowerCase();
const stemmedWord = useStemming ? porterStemmer.stem(word) : word; // Apply Porter stemming if useStemming is true
// Exclude stop words
if (!allStopWords.includes(word)) {
wordCount[stemmedWord] = (wordCount[stemmedWord] || 0) + 1;
}
});
return wordCount;
}
//Function to calculate term frequency
function calculateTermFrequency(data) {
const termFrequencies = {};
for (const { text } of Object.values(data)) {
if (text) {
const wordCount = analyzeText(text);
for (const [word, count] of Object.entries(wordCount)) {
termFrequencies[word] = (termFrequencies[word] || 0) + count;
}
}
}
// Sort by term frequency and print the top terms
const sortedTerms = Object.entries(termFrequencies).sort((a, b) => b[1] - a[1]);
sortedTerms.slice(0, topTerms).forEach(([term, frequency], index) => {
console.log(`${index + 1}. Word: ${term}, Frequency: ${frequency}`);
});
}
// Run the term frequency analysis
calculateTermFrequency(data);
Output looks like this:
- Word: yeah, Frequency: 49
- Word: thank, Frequency: 15
- Word: know, Frequency: 8
- Word: okay, Frequency: 8 ...
Would be great if there could be a simple SVG bar diagram that visualises the output that can also be saved as PNG/SVG. Further suggestions regarding the settings one should be able to define in the UI can be found in the code comments.
Suggestion for UI integration:
--> this could open up another overlay or even better: show the term frequency statistics and settings for the analysis instead of the transcript within the same overlay. But maybe there are better options (better place for this).
You can google for different visiualisations of a term frequency analysis. I would definitley go for a simple bar chart for now but the word cloud would be another option.
@christian.hansen: is that something you could take a look at? I know, it`s not 3D but still Dataviz!