How to Extract YouTube Playlist Transcripts Using Python for NotebookLM & LLMs

 

Introduction

Ever since I started using NotebookLM, it’s completely changed how I deal with information. I’ll admit it — I’m a bit lazy when it comes to digging through dense resources. Got a research paper? I’ll let two AI bots debate it. Found an intriguing book? I just upload the PDF and chat with it. Two-hour podcast on YouTube? I’ll skim through an AI-generated summary and move on.

NotebookLM is this incredibly useful AI research and note-taking assistant that reshapes how you engage with content. It lets you sift through massive amounts of material, ask questions grounded in your own sources, and even discover insights you might have missed.

Problem

But there’s one snag: NotebookLM only accepts individual YouTube video URLs. You can’t just throw in an entire playlist — which becomes a real limitation when you’re working with full-length courses, lecture series, or curated educational content.

Solution

“Elementary, my dear Watson.”

— Sherlock Holmes, The Adventures of Sherlock Holmes

That’s where this automation comes in. I built a simple Python script to bridge that gap — by turning any YouTube playlist into a single, clean text file that’s ready for NotebookLM or any AI tool you prefer.

Tools Used

To make this automation both efficient and reliable, I leaned on a few powerful tools and libraries that handle the heavy lifting:

  1. yt-dlp This is a command-line tool (and an actively maintained fork of youtube-dl) that lets you extract metadata from YouTube videos and playlists — including video IDs, titles, and more — without downloading the actual videos. It’s fast, script-friendly, and perfect for extracting structured info at scale.

  2. youtube-transcript-api This Python library makes it easy to fetch transcripts from YouTube videos, whether they’re human-generated or auto-generated by YouTube. It also supports fallback options and handles cases where transcripts are partially available or in different languages.

  3. pandas For collecting, organizing, and optionally analyzing the transcript data, pandas is a no-brainer. It helps structure everything into a clean DataFrame before exporting it to a .txt file.

  4. Standard Python Libraries Modules like subprocess, json, and re are used to interact with command-line tools, parse metadata, and clean up the transcript text by removing things like [Music], [Laughter], etc.

These tools work together seamlessly, giving you a script that’s both robust and easy to modify if you want to extend its functionality later.


The Code

The codes automates the extraction of transcripts from an entire YouTube playlist. It works by first using yt-dlp to gather all video URLs from the playlist. Then, for each video, it fetches the transcript using youtube-transcript-api, cleans it by removing unwanted annotations, and saves the final output as a single text file — organized by video title

Step 1: Import Required Libraries

1
2
3
4
5
6
import subprocess, json, re
import pandas as pd
from youtube_transcript_api import (
    YouTubeTranscriptApi, TranscriptsDisabled,
    NoTranscriptFound, VideoUnavailable
)

What we’re doing here:

  • subprocess
    • to run yt-dlp commands.
  • json & re
    • to parse and clean data.
  • pandas
    • to structure and export results.
  • youtube_transcript_api
    • to fetch transcripts directly from YouTube.

Step 2: Extract All Video URLs from the Playlist

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_playlist_urls(playlist_url):
    try:
        result = subprocess.run(
            ['yt-dlp', '--flat-playlist', '--dump-json', playlist_url],
            capture_output=True, text=True, check=True
        )
        return [
            f"https://www.youtube.com/watch?v={json.loads(line)['id']}"
            for line in result.stdout.strip().split('\n')
        ]
    except subprocess.CalledProcessError as e:
        print("yt-dlp error:", e)
        return []

Using yt-dlp, this function pulls only the video IDs from a playlist. We then reconstruct the full video URLs for each item. It’s efficient, fast, and doesn’t download any actual video content.

Step 3: Fetch Playlist Title

1
2
3
4
5
6
7
8
9
10
def get_playlist_title(playlist_url):
    try:
        result = subprocess.run(
            ['yt-dlp', '--flat-playlist', '--dump-single-json', playlist_url],
            capture_output=True, text=True, check=True
        )
        return json.loads(result.stdout).get('title', 'Unknown Playlist')
    except Exception as e:
        print(f"Failed to fetch playlist title: {e}")
        return "Unknown Playlist"

Step 4: Parse and Clean Each Video

Extract the Video ID

1
2
3
4
5
def extract_video_id(url):
    match = re.search(r'(?:v=|\/)([0-9A-Za-z_-]{11})', url)
    if not match:
        raise ValueError(f"Invalid YouTube URL: {url}")
    return match.group(1)

Fetch Video Title

1
2
3
4
5
6
7
8
9
def get_video_title(url):
    try:
        result = subprocess.run(
            ['yt-dlp', '--skip-download', '--print-json', url],
            capture_output=True, text=True, check=True
        )
        return json.loads(result.stdout).get('title', 'Unknown Title')
    except:
        return "Unknown Title"

Again, using yt-dlp, we get the title of the video — handy for labeling each transcript in the final text file.

Clean Up Transcript Text

1
2
3
# Clean unwanted annotations in the subtitle/transcript ([Music], [Applause], etc.)
def clean_transcript(text):
    return re.sub(r'\[.*?\]', '', text).replace('  ', ' ').strip()

Transcripts often come with [Music], [Applause], and similar tags. This function strips those out and tightens up the spacing.

Get the Transcript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Get video transcript with annotations removed
def get_transcript_text(video_id):
    try:
        raw_transcript = YouTubeTranscriptApi.get_transcript(video_id)
    except NoTranscriptFound:
        transcripts = YouTubeTranscriptApi.list_transcripts(video_id)
        raw_transcript = transcripts.find_transcript(
            [t.language_code for t in transcripts]
        ).fetch()
    except (TranscriptsDisabled, VideoUnavailable):
        raise RuntimeError("Transcript disabled or video unavailable.")

    transcript_text = ' '.join(
        entry['text'].strip() for entry in raw_transcript if entry['text'].strip()
    )

    return clean_transcript(transcript_text)

This is the heart of the script. We attempt to fetch the transcript using the YouTubeTranscriptApi. If the default fetch fails (say, due to language issues), we fall back to a list of available transcripts and pick one that works.

It gracefully handles missing transcripts or disabled ones without crashing.

Process a Single Video

1
2
3
4
5
6
7
8
9
10
def process_video(url):
    try:
        video_id = extract_video_id(url)
        title = get_video_title(url)
        transcript = get_transcript_text(video_id)
        print(f"Fetched: {title}")
        return {'video_id': video_id, 'title': title, 'transcript': transcript}
    except Exception as e:
        print(f"Skipped: {url} | Reason: {e}")
        return None

This ties everything together: grab the ID, fetch the title, pull the transcript, and return the structured result. If any part fails, the video is skipped, and we log the reason.

Step 5: Main Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def main(playlist_url):
    print(f"Processing playlist: {playlist_url}")

    playlist_title = get_playlist_title(playlist_url)
    video_urls = get_playlist_urls(playlist_url)

    print(f"Found {len(video_urls)} videos in playlist: {playlist_title}\n")

    results = []
    for url in video_urls:
        data = process_video(url)
        if data:
            data['playlist_title'] = playlist_title
            results.append(data)

    df = pd.DataFrame(results)
    print(f"\nCompleted transcripts extraction for {len(df)} videos.")

    filename = f"{playlist_title.replace(' ', '_').lower()}.txt"
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(f"Playlist Title: {playlist_title}\n\n")
        for _, row in df.iterrows():
            f.write(row['title'].strip() + "\n")
            f.write(row['transcript'].strip() + "\n\n")

    print(f"Transcript file saved as: {filename}")
    return df

This is the high-level function that:

  • Reads the playlist.
  • Loops through each video.
  • Extracts transcripts.
  • Stores everything in a DataFrame.

Outputs a well-structured .txt file containing every video title and its transcript.

Here’s what the .txt output looks like:

Playlist Title: Intro to AI - Stanford

Lecture 1: Welcome to AI
[Transcript content here...]

Lecture 2: Search Algorithms
[Transcript content here...]

...

Step 6: Run It

1
2
playlist_url = "https://youtube.com/playlist?list=PLzfP3sCXUnxFH6JIZqHTLfV40Ii8Heu3v&si=MYJuF889HjfgsvCJ"
df_transcripts = main(playlist_url)

Just change the playlist_url to whatever playlist you want to process, and run the script. The output file will be saved in your current directory with a name based on the playlist title.

Final Thoughts

This script dramatically reduces the friction in working with YouTube-based learning content. It turns long-form videos — lectures, tutorials, seminars — into AI-ready text you can use with tools like NotebookLM, GPT-based agents, or custom RAG pipelines.

Whether you’re a researcher trying to digest entire courses or an engineer feeding transcripts into an LLM backend, this automation saves hours of manual effort and repetitive scraping.


Next Steps

You can build on this core script to:

  • Filter by language Auto-detect transcript language and skip or translate unsupported videos.

  • Export to other formats Support .csv, .json, or .md for richer data workflows or integration with note-taking tools.

  • Translate non-English content Use services like DeepL or Google Translate API to convert transcripts automatically.

  • LLM Integration Plug outputs into retrieval-augmented generation (RAG) setups using tools like LangChain, Haystack, or LlamaIndex.


Alternative Approaches

If you’re looking for other ways to extract transcripts from YouTube playlists, here are some alternatives:

  1. Whisper-based scripts Download audio and run OpenAI Whisper or faster-whisper locally for high-quality transcripts, especially for videos without captions.

  2. Web-based tools Use services like youtubetranscript.com or tactiq.io for quick single-video transcript extraction (playlist support varies).

  3. YouTube API + youtube-transcript-api Fetch playlist items via YouTube Data API and then pull transcripts via Python — similar to this script but with stricter quota management.

  4. Automation platforms Use tools like Apify, Bardeen, or Zapier to chain URL scraping + transcript extraction + export to Sheets or Notion.

These methods vary in accuracy, control, and cost — your best choice depends on your workflow needs.


GitHub & Contributions

You can find the full codebase, install instructions, and example notebooks here: 👉 GitHub Repository: youtube-to-notebooklm

Feel free to fork, submit PRs, or raise issues — especially if you’re integrating this into your AI stack or extending it for production use.