A comprehensive Python toolkit for downloading YouTube videos, extracting transcriptions, and generating beautiful HTML pages with embedded videos, descriptions, and AI-enhanced content.
See the tool in action: YouTube Videos Collection - 2025
This demo shows 35 videos from 2025, all processed and organized using this YouTube Page Builder tool.
- Overview
- Features
- Quick Start
- Project Structure
- Input Structure
- Output
- Advanced Usage
- Error Handling
- Performance
- Troubleshooting
- Contributing
- License
- Support
This project consists of two main components:
- Audio-to-JSON Tool (
audio-to-json/) - Downloads YouTube videos and converts them to structured JSON with transcriptions - YouTube Page Builder (
yt-page-builder/) - Generates beautiful HTML pages from processed video data
- YouTube Download: Download videos, playlists, or entire channels using yt-dlp
- Audio Extraction: Automatically extract audio in Opus format
- Metadata Extraction: Download video descriptions, thumbnails, and upload dates
- Speech Recognition: Transcribe audio using DistilWhisper with MPS acceleration
- JSON Output: Structured output with all video metadata and transcriptions
- Automatic video embedding: Extracts YouTube video IDs and creates embedded players
- Clean, responsive design: Modern, mobile-friendly HTML pages with CSS styling
- Video metadata: Displays video titles and publication dates
- Description section: Includes complete video descriptions with clickable hyperlinks and timestamp links
- Transcript section: Full video transcripts with semantic paragraph organization and filler word removal
- AI-generated tags: 3-5 relevant tags automatically generated using NLP analysis
- Custom links: Configurable links section (optional)
- Batch processing: Process all videos at once or limit for testing
- Index page: Generate a beautiful index page listing all videos with links
- Live demo available: See YouTube Videos Collection - 2025 for a real-world example
- Python 3.8+
- PyTorch
- Transformers
- Librosa
- NumPy
- Accelerate
- yt-dlp
- spaCy (for NLP processing)
Important: YouTube requires authentication cookies to download videos. This is necessary because:
- Age-restricted content: Many videos are age-restricted and require login
- Private videos: Access to private or unlisted videos requires authentication
- Rate limiting: Authenticated requests have higher rate limits
- Geo-restrictions: Some videos are region-locked and require location-based authentication
- Premium content: YouTube Premium content requires authentication
- Anti-bot measures: YouTube uses cookies to distinguish legitimate users from automated bots
Method 1: Using Browser Developer Tools (Recommended)
-
Open YouTube in your browser (Chrome, Firefox, Safari, or Edge)
-
Log in to your YouTube/Google account
-
Open Developer Tools:
- Chrome/Edge: Press
F12orCtrl+Shift+I(Windows/Linux) /Cmd+Option+I(Mac) - Firefox: Press
F12orCtrl+Shift+I(Windows/Linux) /Cmd+Option+I(Mac) - Safari: Enable Developer menu in Preferences > Advanced, then press
Cmd+Option+I
- Chrome/Edge: Press
-
Go to the Application/Storage tab:
- Chrome/Edge: Click "Application" tab, then "Cookies" in the left sidebar
- Firefox: Click "Storage" tab, then "Cookies"
- Safari: Click "Storage" tab, then "Cookies"
-
Find YouTube cookies:
- Look for
youtube.comorgoogle.comin the domain list - Key cookies to export:
SID,HSID,SSID,APISID,SAPISID,__Secure-3PAPISID
- Look for
-
Export cookies:
- Right-click on each cookie and copy the name and value
- Or use browser extensions like "Cookie Editor" to export all cookies
Method 2: Using Cookie Extensions
-
Install a cookie export extension:
- Chrome: "Cookie Editor" or "EditThisCookie"
- Firefox: "Cookie Quick Manager"
- Safari: "Cookie Editor"
-
Export cookies:
- Go to YouTube while logged in
- Use the extension to export cookies
- Save as a
.txtfile
Method 3: Using yt-dlp's Built-in Cookie Extraction
# Extract cookies from browser (Chrome/Edge)
yt-dlp --cookies-from-browser chrome
# Extract cookies from Firefox
yt-dlp --cookies-from-browser firefox
# Extract cookies from Safari
yt-dlp --cookies-from-browser safariOption 1: Automated Setup (Recommended)
# Run the cookie setup helper
python setup_cookies.pyThis interactive script will guide you through the entire cookie setup process.
Option 2: Cookie File
# Save cookies to a file
python audio_to_json.py --url "VIDEO_URL" --cookies cookies.txtOption 3: Environment Variable
# Set cookies as environment variable
export YT_COOKIES="SID=value; HSID=value; SSID=value; APISID=value; SAPISID=value; __Secure-3PAPISID=value"
python audio_to_json.py --url "VIDEO_URL"Option 4: Direct Cookie String
# Pass cookies directly
python audio_to_json.py --url "VIDEO_URL" --cookies "SID=value; HSID=value; SSID=value; APISID=value; SAPISID=value; __Secure-3PAPISID=value"- Never share your cookies: They contain your authentication credentials
- Use a dedicated account: Consider creating a separate YouTube account for downloading
- Regular rotation: Update cookies periodically as they expire
- Local storage only: Store cookies locally, never commit them to version control
- Limited scope: Only use cookies for legitimate content you have permission to download
Common Problems:
- "Video unavailable": Cookies may be expired or invalid
- "Age-restricted content": Need valid authentication cookies
- "Private video": Requires cookies from an account with access
- "Rate limited": Too many requests, try with authenticated cookies
Solutions:
- Refresh cookies: Get new cookies from your browser
- Check account access: Ensure your account can view the video
- Wait and retry: YouTube may temporarily block requests
- Use different account: Try with a different YouTube account
- Clone the repository:
git clone <repository-url>
cd yt-page-builder- Run the automated setup:
python setup.py- Set up development environment (optional but recommended):
python setup_dev.py- Test the installation:
python test_installation.py- Clone the repository:
git clone <repository-url>
cd yt-page-builder- Install project dependencies:
# Install main project dependencies
pip install -r requirements.txt
# Install test dependencies (optional)
pip install -r requirements-test.txt- Download spaCy language model (optional):
python -m spacy download en_core_web_smcd audio-to-json
# Download a single video
python audio_to_json.py --url "https://www.youtube.com/watch?v=VIDEO_ID"
# Download a playlist
python audio_to_json.py --url "https://www.youtube.com/playlist?list=PLAYLIST_ID"
# Download an entire channel
python audio_to_json.py --url "https://youtube.com/@channelname"cd ../yt-page-builder
# Process all video folders in the input directory
python yt_page_builder.py
# Generate an index page for all videos
python create_index.pyyt-page-builder/
├── audio-to-json/ # YouTube download and transcription tool
│ ├── audio_to_json.py # Main transcription script
│ ├── requirements.txt # Component-specific dependencies
│ └── README.md # Detailed documentation
├── yt-page-builder/ # HTML page generation tool
│ ├── yt_page_builder.py # Main page builder script
│ ├── create_index.py # Index page generator
│ ├── requirements.txt # Component-specific dependencies
│ ├── input/ # Video folders (created by audio-to-json)
│ ├── output/ # Generated HTML pages
│ ├── logs/ # Processing logs
│ └── README.md # Detailed documentation
├── requirements.txt # Main project dependencies
├── setup.py # Automated setup script
├── example.py # Complete workflow example
├── test_installation.py # Installation verification
├── config.py # Configuration settings
├── config_template.py # Template configuration file
├── config_julien_simon.py # Julien Simon's specific configuration
├── setup_cookies.py # Cookie setup helper script
├── setup_dev.py # Development environment setup
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── pyproject.toml # Black and isort configuration
├── requirements-dev.txt # Development dependencies
├── update_badges.py # Badge updater script
├── run_tests.py # Test runner script
├── requirements-test.txt # Testing dependencies
├── pytest.ini # pytest configuration
├── tests/ # Test suite
│ ├── __init__.py # Tests package
│ ├── test_yt_page_builder.py # YouTube Page Builder tests
│ ├── test_create_index.py # Index creation tests
│ └── test_utilities.py # Utility script tests
├── .github/workflows/ # GitHub Actions
│ └── tests.yml # Automated testing workflow
├── LICENSE # MIT License
├── .gitignore # Git ignore rules
└── README.md # This file
The YouTube Page Builder expects video folders in the input directory with this structure:
input/
├── 20250103_Video_Title_Here/
│ ├── Video_Title_Here.info.json
│ ├── Video_Title_Here.description
│ ├── Video_Title_Here.webp
│ └── Video_Title_Here_transcription.json
├── 20250110_Another_Video_Title/
│ ├── Another_Video_Title.info.json
│ ├── Another_Video_Title.description
│ ├── Another_Video_Title.webp
│ └── Another_Video_Title_transcription.json
└── ...
Folders should follow the pattern: YYYYMMDD_Video_Title_Here
YYYYMMDD: Publication date in YYYYMMDD formatVideo_Title_Here: Video title (underscores and hyphens are converted to spaces)
For each video, the tool downloads:
- Audio file:
.opusformat (high quality) - Thumbnail:
.webpformat - Description:
.descriptiontext file - Metadata:
.info.jsonwith full video information - Transcription:
_transcription.jsonwith transcription text
The YouTube Page Builder generates HTML files with:
- Video title (parsed from folder name)
- Publication date (formatted from folder name)
- Embedded YouTube video (responsive iframe)
- Description section (from .description file with automatic hyperlink and timestamp conversion)
- Transcript section (from *_transcription.json file with semantic paragraph organization and filler word removal)
- AI-generated tags (automatically extracted using spaCy NLP)
- Custom links (optional): Configurable links section that can be customized in
config.py
The create_index.py script generates an index.html file that provides:
- Complete video list: All videos sorted by date (newest first)
- Clickable links: Direct links to each video page
- Video statistics: Total count and latest video date
- Responsive design: Works on all devices
- Navigation: Easy browsing of the entire collection
To add custom links to your generated pages, edit the config.py file:
# In config.py, modify the links section:
"links": {
"website": "https://your-website.com",
"youtube_channel": "https://youtube.com/@your-channel",
"github": "https://github.com/your-username",
"twitter": "https://twitter.com/your-handle",
}The links will appear at the bottom of each video page and the index page. To disable links entirely, set the links dictionary to empty: "links": {}
Note: See config_example.py for a complete example configuration file.
The badges in this README are set to placeholder GitHub URLs. To update them for your repository:
python update_badges.py <your-github-username> <your-repository-name>Example:
python update_badges.py myusername my-yt-builderThis will update all the GitHub badges to point to your repository.
# Specify output directory
python audio_to_json.py --url "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir "/path/to/output"
# Convert local audio files
python audio_to_json.py --file "audio_file.opus"
python audio_to_json.py --directory "/path/to/audio/files"# Specify custom input/output directories
python yt_page_builder.py --input /path/to/input --output /path/to/output
# Process only first 5 folders (for testing)
python yt_page_builder.py --limit 5
# Short form options
python yt_page_builder.py -i input -o output -l 3Run the complete example to see the full workflow:
python example.pyThis will:
- Download a sample YouTube video
- Generate HTML pages
- Create an index page
- Show you where to find the results
See the tool in action with a live demo: YouTube Videos Collection - 2025
This demonstrates:
- 35 videos from 2025 processed and organized
- Beautiful responsive design
- AI-generated tags for each video
- Clean transcriptions and descriptions
- Professional styling and navigation
python run_tests.py quickpython run_tests.py allpython run_tests.py test_utilities
python run_tests.py test_create_index
python run_tests.py test_yt_page_builder# Install test dependencies
pip install -r requirements-test.txt
# Run all tests with coverage
pytest tests/ -v --cov=yt-page-builder --cov=audio-to-json
# Run quick tests only
pytest tests/ -m quick
# Run tests in parallel
pytest tests/ -n auto- Unit Tests: Core functionality testing
- Integration Tests: End-to-end workflow testing
- Mock Tests: External API testing without real calls
- Configuration Tests: Settings validation
Note: Some tests require external dependencies (spaCy, API keys) and may fail in certain environments. The quick tests focus on core functionality that doesn't require external services.
To test the tools with a small subset of videos:
# Test audio-to-json with a single video
python audio_to_json.py --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Test page builder with limited folders
python yt_page_builder.py --limit 3Both tools include robust error handling:
- Missing files: Gracefully handles missing files or corrupted data
- Invalid JSON: Skips folders with corrupted JSON data
- File permissions: Reports write errors for output files
- Network issues: Handles YouTube rate limiting and geo-restrictions
- Transcript processing: Manages long transcripts by chunking them
- Concurrent processing: Uses ThreadPoolExecutor for parallel video processing
- Memory management: Processes large transcripts in chunks
- Progress tracking: Shows progress bars and detailed logging
- GPU acceleration: Automatically uses available GPU (CUDA, MPS on macOS, or CPU)
- Missing spaCy model: Run
python -m spacy download en_core_web_sm - Large transcript errors: The tool automatically chunks long transcripts
- YouTube download failures: Check network connection and video availability
- Memory issues: Reduce the
--limitparameter for testing
Check the following log files for detailed information:
yt-page-builder/logs/error.log- Processing logs and errorsyt-page-builder/output.log- Output processing logs
-
Fork the repository
-
Clone your fork:
git clone <your-fork-url> cd yt-page-builder
-
Set up development environment:
python setup_dev.py
This will install pre-commit hooks, black, isort, and other development tools.
-
Create a feature branch:
git checkout -b feature/your-feature-name
-
Make your changes - Pre-commit hooks will automatically format your code
-
Run tests:
python run_tests.py all
-
Commit your changes - Pre-commit hooks will run automatically
-
Submit a pull request
This project uses:
- Black for code formatting (line length: 88)
- isort for import sorting (compatible with Black)
- Pre-commit hooks that run automatically on every commit
The pre-commit hooks will automatically modify files to ensure consistent formatting.
If you need to format files manually:
# Format all Python files
black .
# Sort imports
isort .
# Run all pre-commit hooks
pre-commit run --all-filesThis project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the logs in the
logs/directory - Review the detailed documentation in each component's README
- Test with a small subset of videos using the
--limitoption