Youtube Data for Research

Sometimes I interact with folks interested in digital projects that entail some form of video analysis. These noble hypothetical folk, whether they know it or not, join a quest to augment Digital Humanities discourse with a format that doesn’t get enough attention. Brave souls. Just for starters, video data sources can be tough to gain access to. For projects of this type Internet Archive (IA) is a shining star amidst a sea of negligible possibilities. Numerous (>1.9 million videos) and diverse content combined with some excellent tutorials help you bulk download IA data relatively easily. A boon for video research. Up until a couple days ago I would not have thought to consider Youtube as a source worth anything but headaches, or perhaps a laugh at a fainting goat, but a neat command line tool called Youtube-DL definitely changed that.  For research purposes, if you wanted all video content produced by The White House, video content that matches the search query ‘Dragons’, a single video, or perhaps a custom playlist of videos and perhaps even associated text transcripts, then Youtube-DL is a game changer, knowledge of which may possibly make you start spinning in circles right at this very moment.

Enumerated game changing qualities:

  1. Low barrier use – no programming chops needed to get started
  2. Easy to scale – download one video, download all video from a playlist, download one/some/all video created by user (e.g. The White House), download video(s) that match a search query
  3. Manage data – impose file naming conventions on collected data derived from various components of the file (e.g. dateuploaded_user_title.mp4)
  4. Granular control – specify video format and quality, control dataset size (e.g. download up to 1 GB of data and stop)
  5. More than video – download one or all available text transcripts, extract audio from video

In what follows I’ll work through how to install Youtube-DL and implement some of the awesome discussed above.


What You Need

Brew – package management system, basically makes it easier for you to install software
FFmpeg – lets you manipulate multimedia content, basically the all the things of multimedia work
Youtube-dl – command line program for downloading Youtube content


Installing FFmpeg and Youtube-dl

– Open Terminal
– Enter the following commands

brew install ffmpeg
brew install youtube-dl

Use Case – Building a White House dataset 

You want to capture video related to the Obama Presidency. Starting with the Inaugural Address is probably as good a place as any. Maybe you want to study characteristics of video composition (video data), perform some audio analysis (audio data), and maybe even consider a text analysis of the inaugural speech (text data). Eventually you might even decide you want video produced by The White House between a certain period of time. Perhaps you might also want to build a playlist related to White House coverage of Ferguson and download that – videos, video descriptions, audio, and subtitle text data. What follows should give you what you need to approach all of the above.

– Create a folder to contain files you capture
– Open Terminal
– In Terminal navigate to the folder you created, e.g. cd/Desktop/youtubedl/whitehouse

After making your way to the folder, you have a number of different ways to use Youtube-dl:

Single item, default to highest quality video

youtube-dl https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, with file naming conventions imposed

youtube-dl --restrict-filenames -o "%(upload_date)s.%(uploader)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, extract audio 

youtube-dl --restrict-filenames --extract-audio --audio-format "mp3" -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, extract subtitles

youtube-dl --restrict-filenames --all-subs -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Multiple items, download content from user between dates 

youtube-dl --dateafter 20150101 --datebefore 20150107 --restrict-filenames -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/user/whitehouse

Multiple items, search query ‘obama and ferguson’ – five videos

youtube-dl -t ytsearch5:"obama and ferguson" --restrict-filenames

Multiple items, build a playlist – download video, video descriptions, audio, and subtitles

youtube-dl --restrict-filenames --write-description --extract-audio --audio-format "mp3" -k --all-subs -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/playlist?list=PLf7yYLO8w1_lSVBqeZmp17dy7kvql6qTP


And there you have it. Youtube data for research. After working through the above consider some of Youtube-dl’s more advanced features.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *