I grew up watching Seinfeld and for better or worse it probably formed a large part of my sense of humor. I was too young to watch the early seasons live, but for the later ones I clearly remember the excitement I’d feel at 8:30pm Thursday night when Friends started because that meant it was only half an hour until Seinfeld would come on.
I also typically watched re-runs every night so I saw some of the older episodes, but they were mostly out of order. It wasn’t until the DVDs came out that I actually watched whole seasons end to end, and I developed a deep appreciation of the show.
At some point around then I wrote some Python scripts to pull down Seinfeld scripts. I wanted to play around with various natural language processing algorithms using Seinfeld scripts as a corpus. I recently thought it’d be fun to play around with the data again, so I cleaned up the scraping scripts and verified that everything works.
If you want to generate this data yourself check out the code on my github. If you just want the SQLite DB file please feel free to send me a message on github or email me.
Some Simple Statistics
There are a lot of potentially interesting things to do with this data, most of which would require further processing. There are some basic but interesting questions that can be answered by simple SQL queries.
Which characters speak the most lines?
That seems about right– the show is definitely dominated by the four main characters.
Which characters have speaking roles in the greatest number of episodes?
The appearances are also dominated by the main cast. Interestingly, some lines are attributed to “MAN” and “WOMAN”, which points to some data quality issues. Ideally unnamed characters would have unique names like “MAN WATERING PLANTS”.
One of the main reasons I initially got this data together was to learn some front-end skills by developing a better UI for browsing through Seinfeld scripts. I had imagined all kinds of cross-linking between episodes and possibly links off to Wikipedia.
Exploration of the Characters
- Do particular characters have catch phrases (maybe high TF-IDF ngrams where TF is within the character’s lines and IDF is for all speakers)?
- Are there characters who gain screen time over time?
- How many episodes are heavy on just a few of the main characters (e.g. a Jerry and George episode)?
- How positive, on average, are the various characters? Are there other interesting stylistic characteristics to look at?
Corpus for Exploring NLP Algorithms
I like playing with Wikipedia, but it’ll be fun to have something a bit smaller and closer to my heart. It’d be fun to play around with language models and to generate sentences for particular characters (e.g. a Kramerish sentence).