Building A Meeting Transcript (Exploratory) Text Analysis Tool With Python & Power BI

Zhijing Eu
5 min readOct 23, 2022

“Post Pandemic, hybrid / virtual working is turning out to be the norm and with every video meeting we attend , we generate huge volumes of textual data in terms of transcripts that we can mine for potential insights”

In this article I do a short walk thru of how anyone with a bit of Python know how and Microsoft’s Power BI can create a simple exploratory tool for text analysis of meeting transcripts.

NOTE : Before we get started, remember to do this in an ETHICAL manner- let people know you are making a recording and capturing the transcripts as well

If you want to you can also just jump straight to this Github Repo

https://github.com/ZhijingEu/VTT_File_Cleaner

https://github.com/ZhijingEu/VTT_File_Cleaner

So if you work in a large organization, you probably use video conferencing tools like Zoom or Microsoft Teams that give you the option to generate audio transcripts automatically that log who said what when. These tend to be in form of VTT (Video Text Track) files that look like this:

Example VTT from MS Teams…

The problem with VTT files is that they tend to be a bit unstructured — it would be great to have these translated into a nice tabular format.

The good news is someone actually already thought of this and came up with a solution:

webvtt-py · PyPI

webvtt-py · PyPI

WebVTT is a Python module for reading/writing VTT caption files.

As per the documents it says you can extract information like so: -

import webvtt

for caption in webvtt.read('captions.vtt'):
print(caption.start)
print(caption.end)
print(caption.text)

You can read the docs yourself and it’s relatively simple to use. Now the bad news is that there are formatting issues where it prefers timestamps in format of 00:00:00.000 but sometime in 2022, Microsoft Teams seems to have defaulted to a format of 0:0:0.0 so VTT files fed into it spits out an error message:

Even after trawling thru Stack Overflow and some scanning the docs, I couldn’t find a solution, so I wrote my own simple regex to clean up timestamps.

filename="SampleWorkshopTranscript.vtt"with open(filename) as f:
text = f.read()
for i in range(0,10):
j=str(i)
text = text.replace(j+":", "0"+j+":")
text = text.replace("."+j+" ", ".00"+j)
text = text.replace(":"+j+".", ":0"+j+".")
for k in range(0,10):
l=str(k)
text = text.replace("."+j+l+" ", ".0"+j+l)
text = text.replace("."+j+l+"\n", ".0"+j+l+"\n")
text = text.replace(":"+j+"0"+l+":",":"+j+l+":")
for i in range(0,10):
j=str(i)
for k in range(0,10):
l=str(k)
text = text.replace(":"+j+"0"+l+":",":"+j+l+":")

for i in range(0,10):
j=str(i)
text = text.replace("."+j+"\n",".00"+j+"\n")

After which it was relatively simple to pull the contents of the VTT into a table like so.

import webvtt
start=[]
end=[]
text=[]
speaker=[]
for caption in webvtt.read(cleanedVTT):
start.append(caption.start)
end.append(caption.end)
text.append(caption.text)
speaker.append(caption.raw_text)
import pandas as pddf = pd.DataFrame(list(zip(start,end,text,speaker)), columns = ['StartTime', 'EndTime',"Text","Speaker"])
listx=df['Speaker'].str.split('>', n=1, expand=True)
df["Speaker"]=listx[0]
df["Speaker"]=df["Speaker"].str.replace("<v ","")
df

Now with the text tabulated in this manner, you can do a whole bunch of really neat things!

For example, with some cleverly placed key phrases— you can use the transcripts to take minutes (Example yell out an unusual phrase Alpha Charlie Tango — Action Party Simon to review XYZ by DD/MM/YYYY and then use text search to extract all the snippets that have these key words)

To take things further, we could open up the Python toybox and bring out natural language libraries NLTK or spaCY but for a change I went instead for Microsoft Power BI which has some Out of The Box text analytics features like Sentiment Analysis and Key Phrase Extraction https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-ai-insights#using-text-analytics-and-vision

Next, we need to do some minor styling plus add two useful (and free!) custom visualizations from the PowerBI marketplace (one for dynamic Word Cloud generation and another for Text Search)

https://appsource.microsoft.com/en-us/product/power-bi-visuals/WA104380752?tab=Overview

https://appsource.microsoft.com/en-us/product/power-bi-visuals/wa104381309?tab=overview

And …. hey presto! We’ve built ourselves a simple text exploration tool that can do things like:

  • View how the sentiment changed over the course of the meeting (which coupled with MS Power BI’s native Anomaly Detection feature can give interesting insights (e.g do certain speakers tend to use more/less positive words?)
  • Identify which individual speaker had the most airtime (either measured in terms of absolute time OR “turns" to speak…)
  • Search key words and see how frequently they were mentioned during the session and by whom and when and generate dynamic word clouds specific to the transcript snippets where these key words were uttered.
  • Etc etc

Anyway - the possibilities are endless, especially if you know a bit more Python, you could do things like :

Hopefully this was somewhat useful — share in the comments below if you have any ideas on what you’d like to do with your meeting transcript info….

Till next time!

--

--

Zhijing Eu

Hi ! I’m “Z”. I am big on sci-fi, tech and digital trends.