[OSSCA] OSSCA 한달 간의 여정에서 배운 것들

카테고리 없음

[OSSCA] OSSCA 한달 간의 여정에서 배운 것들

킁킁잉 2025. 5. 27. 06:29

벌써 OSSCA 수료일이 다가왔다. 약 한달간의 길다면 길고 짧다면 짧은 시간동안, 정말 많은 것을 배우고 경험할 수 있었다. 한 달 동안의 활동 내용들과, 소감을 정리해보려고 한다.

Pyvideo 기여

첫 활동은 pycon-kr 영상 데이터를 수집하여 pyvideo에 기여하는 것이었다. pycon-kr 영상을 팀원들과 분할하여 수집하고, 각각 PR을 날려보았다. 나는 추가로 Scipy-kr 2024 영상 데이터를 수집해서 PR을 했는데, 지금 보니 Scipy-kr PR만 Merge되고, Pycon-kr 관련 PR은 아직 Open 상태라...Scipy-kr 영상 수집에 지원한 게 정말 잘 한 일인 것 같다. 간단한 영상 데이터 json 파일 기여이지만, 글로벌 레포지토리에 기여하는 경험을 쌓을 수 있었다.

https://github.com/pyvideo/data/pull/1216

Add Scipy Korea 2024 videos by Monixc · Pull Request #1216 · pyvideo/data

#1214

github.com

Pycon 영상 데이터 수집

pyvideo 같은 대규모의 레포지토리에 기여하는 것은 pycon-kr 2024와 scipy-kr 2024 데이터 기여하는 것에서 끝이 났다. 이어지는 활동은 팀 organization에서 진행되었다.

우선, 파이콘 영상 관련 데이터들을 최대한 수집을 했다. 보통 한 명당 하나 국가의 파이썬 정보만 수집해서 올리는데, 하다보니 욕심이 생겨서 미등록된 영상 playlist를 전부 체크해서 Issue로 올렸다.

그리고 영상 데이터 수집을 진행했다. 국가별로 2017-2024 사이에 열린 파이콘은 전부 수집하려고 했고, 한 해마다 영상 20개 이상씩은 기본이라 생각보다 시간이 오래 걸렸다.

기존 방식처럼 스크립트 내에 url 붙여넣어 하기에는 수집해야 할 playlist들이 너무 많아서 비효율적이었다. 한 해의 영상이 하나의 플레이리스트에 정리된 게 아니라, 세션 카테고리별로, 트랙별로, 날짜별로 분리된 경우도 다수 있어, 이에 대한 처리도 필요했다.

그래서 콘솔을 활용한 간단한 메뉴 시스템을 만들어 영상 플레이리스트 정보를 한 번에 등록하고, 수집할 수 있도록 구현하였다.

from pytube import Playlist
import yt_dlp
import pandas as pd
import json
import time
import os
from datetime import datetime

def get_playlist_videos(playlist_url):
    """
    Uses yt-dlp to get list of video URLs from a playlist

    Args:
        playlist_url (str): URL of the YouTube playlist

    Returns:
        tuple: (playlist title, list of video URLs)
    """
    try:
        # Configure yt-dlp options for playlist
        ydl_opts = {
            'quiet': True,
            'no_warnings': True,
            'extract_flat': True,  # 플레이리스트 정보만 추출
            'ignoreerrors': True,
        }

        # Extract playlist info using yt-dlp
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            playlist_info = ydl.extract_info(playlist_url, download=False)
            
            if not playlist_info:
                print("Failed to extract playlist information")
                return None, []

            playlist_title = playlist_info.get('title', 'Unknown Playlist')
            video_urls = []

            # Extract video URLs from playlist entries
            for entry in playlist_info.get('entries', []):
                if entry:
                    video_url = entry.get('url')
                    if video_url:
                        video_urls.append(video_url)

            print(f"Playlist Title: {playlist_title}")
            print(f"Number of videos in playlist: {len(video_urls)}")

            return playlist_title, video_urls

    except Exception as e:
        print(f"Error fetching playlist: {str(e)}")
        return None, []

def get_video_info(video_url, index):
    """
    Uses yt-dlp to extract detailed information about a video

    Args:
        video_url (str): URL of the YouTube video
        index (int): Position of the video in the playlist

    Returns:
        dict: Dictionary containing video information
    """
    try:
        print(f"🐍Fetching info for video {index + 1}: {video_url}")

        # Configure yt-dlp options
        ydl_opts = {
            'quiet': True,  # Suppress console output
            'no_warnings': True,  # Suppress warnings
            'skip_download': True,  # Don't download the video
            'extract_flat': False,  # We want full info extraction
            'ignoreerrors': True,  # Skip videos that cause errors
        }

        # Extract info using yt-dlp
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            video_info = ydl.extract_info(video_url, download=False)
            
            if not video_info:
                print(f"Failed to extract info for video {index + 1}: No data returned")
                return None

            # Create a dictionary with relevant information
            info = {
                'index': index + 1,
                'title': video_info.get('title'),
                'url': video_url,
                'video_id': video_info.get('id'),
                'uploader': video_info.get('uploader'),
                'uploader_id': video_info.get('uploader_id'),
                'uploader_url': video_info.get('uploader_url'),
                'upload_date': video_info.get('upload_date'),
                'duration': video_info.get('duration'),
                'view_count': video_info.get('view_count'),
                'like_count': video_info.get('like_count'),
                'comment_count': video_info.get('comment_count'),
                'tags': video_info.get('tags'),
                'categories': video_info.get('categories'),
                'description': video_info.get('description'),
                'thumbnail': video_info.get('thumbnail'),
                'age_limit': video_info.get('age_limit'),
                'is_live': video_info.get('is_live'),
                'was_live': video_info.get('was_live'),
                'availability': video_info.get('availability'),
                'webpage_url': video_info.get('webpage_url'),
                'original_url': video_info.get('original_url'),
            }

            return info

    except Exception as e:
        print(f"Error extracting info for {video_url}: {str(e)}")
        return None

def collect_playlist_info(playlist_url):
    """
    Collects information from all videos in a YouTube playlist

    Args:
        playlist_url (str): URL of the YouTube playlist

    Returns:
        tuple: (playlist title, list of video information dictionaries)
    """
    # Get playlist title and video URLs using pytube
    playlist_title, video_urls = get_playlist_videos(playlist_url)

    if not video_urls:
        return playlist_title, []

    # List to store video information
    video_data = []

    # Process each video in the playlist
    for index, video_url in enumerate(video_urls):
        # Get video info using yt-dlp
        video_info = get_video_info(video_url, index)
        if video_info is not None:  # None이 아닐 때만 추가
            video_data.append(video_info)

        # Add a small delay to avoid rate limiting
        time.sleep(0.2)

    return playlist_title, video_data

def save_data(playlist_title, video_data, output_dir, playlist_url):
    """
    Saves the collected video information to CSV and JSON files

    Args:
        playlist_title (str): Title of the playlist
        video_data (list): List of dictionaries containing video information
        output_dir (str): Directory to save the files in
        playlist_url (str): URL of the playlist
    """
    if not video_data:
        print("No data to save")
        return

    try:
        # Use existing pycon-playlist directory from parent directory
        base_dir = os.path.join("..", "pycon-playlist")
        if not os.path.exists(base_dir):
            print(f"Error: {base_dir} directory not found")
            return
            
        # Create output directory under base directory
        full_output_dir = os.path.join(base_dir, output_dir.lower())  # 소문자로 변환
        if not os.path.exists(full_output_dir):
            os.makedirs(full_output_dir)
            print(f"Created directory: {full_output_dir}")

        # Create sanitized filename from playlist title (소문자로 변환)
        safe_filename = "".join([c if c.isalnum() or c in ['-', '_'] else '_' for c in playlist_title.lower()])
        safe_filename = safe_filename[:50]  # Limit length

        # Save as CSV
        csv_filename = os.path.join(full_output_dir, f"{safe_filename}_data.csv")
        df = pd.DataFrame(video_data)

        # Handle potential list fields for CSV export
        for col in df.columns:
            if df[col].apply(lambda x: isinstance(x, list)).any():
                df[col] = df[col].apply(lambda x: ','.join(str(item) for item in x) if isinstance(x, list) else x)

        df.to_csv(csv_filename, index=False, encoding='utf-8-sig')
        print(f"Data saved to {csv_filename}")

        # Save as JSON
        json_filename = os.path.join(full_output_dir, f"{safe_filename}_data.json")
        with open(json_filename, "w", encoding="utf-8") as f:
            json.dump(video_data, f, ensure_ascii=False, indent=4, default=str)
        print(f"Data saved to {json_filename}")

        # Update README.md
        readme_path = os.path.join(full_output_dir, "README.md")
        readme_content = f"{playlist_title}: {playlist_url}\n"
        
        if os.path.exists(readme_path):
            with open(readme_path, "a", encoding="utf-8") as f:
                f.write(readme_content)
        else:
            with open(readme_path, "w", encoding="utf-8") as f:
                f.write("# Playlist URLs\n\n")
                f.write(readme_content)

    except Exception as e:
        print(f"Error saving data: {str(e)}")

def main():
    print("YouTube Playlist Scraping Tool")
    print("=" * 50)
    print("Usage:")
    print("1. Use 'add' command to add playlist URL")
    print("2. Use 'path' command to set save directory")
    print("3. Use 'list' command to check added playlists")
    print("4. Use 'start' command to begin scraping")
    print("5. Use 'merge' command to merge multiple playlists")
    print("6. Use 'exit' command to quit")
    print("=" * 50)

    playlist_urls = []  # 전체 플레이리스트 URL 저장
    output_dir = None  # 초기 path는 None으로 설정
    path_changes = []  # path 변경 기록 저장 (path, end_index)
    is_merge_mode = False  # merge 모드 여부
    merge_urls = []  # merge할 URL 목록

    while True:
        command = input("\nEnter command (add/path/list/start/merge/exit): ").strip().lower()

        if command == "exit":
            print("Exiting program...")
            break

        elif command == "add":
            if output_dir is None:
                print("Please set path first using 'path' command")
                continue
                
            url = input("Enter playlist URL: ").strip()
            if url:
                playlist_urls.append(url)
                print(f"Playlist added. (Total: {len(playlist_urls)})")

        elif command == "path":
            new_path = input("Enter save directory: ").strip()
            if new_path:
                if output_dir is not None:
                    # 이전 path의 마지막 인덱스를 기록
                    path_changes.append((output_dir, len(playlist_urls)))
                    print(f"Added path change: {output_dir} at index {len(playlist_urls)}")
                output_dir = new_path
                print(f"Save directory set to: 'pycon-playlist/{output_dir.lower()}'")

        elif command == "list":
            if not playlist_urls:
                print("No playlists added yet.")
            else:
                print("\nAdded playlists:")
                for i, url in enumerate(playlist_urls, 1):
                    print(f"{i}. {url}")

        elif command == "merge":
            if output_dir is None:
                print("Please set path first using 'path' command")
                continue

            is_merge_mode = True
            merge_urls = []
            print("\nEnter playlist URLs to merge (enter 'done' to finish):")
            
            while True:
                url = input("Enter playlist URL (or 'done' to finish): ").strip()
                if url.lower() == 'done':
                    break
                merge_urls.append(url)
                print(f"Added to merge list. (Total: {len(merge_urls)})")
            
            print(f"\nAdded {len(merge_urls)} playlists to merge list.")
            print("Use 'start' command to begin merging.")

        elif command == "start":
            if output_dir is None:
                print("Please set path first using 'path' command")
                continue

            if is_merge_mode:
                if not merge_urls:
                    print("No playlists added to merge list.")
                    continue

                merged_data = []
                first_playlist_title = None

                for url in merge_urls:
                    print(f"\n{'='*50}")
                    print(f"Processing playlist: {url}")
                    playlist_title, video_data = collect_playlist_info(url)
                    
                    if video_data:
                        if first_playlist_title is None:
                            first_playlist_title = playlist_title
                        
                        # 인덱스 재설정
                        start_idx = len(merged_data)
                        for video in video_data:
                            video['index'] = start_idx + 1
                            start_idx += 1
                            merged_data.append(video)
                        
                        print(f"Added {len(video_data)} videos to merged data")
                    else:
                        print("Failed to collect playlist information")
                    print(f"{'='*50}\n")

                if merged_data:
                    print(f"\nSaving merged data with {len(merged_data)} videos...")
                    save_data(first_playlist_title, merged_data, output_dir, "Merged Playlist")
                    print("Merge completed successfully!")
                else:
                    print("No data to save")

                # merge 모드 초기화
                is_merge_mode = False
                merge_urls = []

            else:
                if not playlist_urls:
                    print("No playlists added. Use 'add' command to add playlists first.")
                    continue

                # 마지막 path의 마지막 인덱스를 기록
                path_changes.append((output_dir, len(playlist_urls)))
                print(f"Added final path change: {output_dir} at index {len(playlist_urls)}")

                print(f"\nPath changes: {path_changes}")
                print(f"Total playlists: {len(playlist_urls)}")

                # path 변경 기록에 따라 데이터 저장
                for i, (path, end_idx) in enumerate(path_changes):
                    start_idx = path_changes[i-1][1] if i > 0 else 0
                    current_playlists = playlist_urls[start_idx:end_idx]
                    print(f"\nProcessing path {i+1}/{len(path_changes)}: {path}")
                    print(f"Playlists {start_idx} to {end_idx}: {current_playlists}")
                    if current_playlists:
                        print(f"\nProcessing playlists for path: pycon-playlist/{path.lower()}")
                        for playlist_url in current_playlists:
                            print(f"\n{'='*50}")
                            print(f"Starting to collect information from playlist: {playlist_url}")
                            playlist_title, video_data = collect_playlist_info(playlist_url)
                            if video_data:
                                print(f"Successfully collected information for {len(video_data)} videos")
                                save_data(playlist_title, video_data, path, playlist_url)
                            else:
                                print("Failed to collect playlist information")
                            print(f"{'='*50}\n")

                print("All playlists have been processed.")
                playlist_urls = []  # Clear the list after processing
                path_changes = []  # Clear path changes
                output_dir = None  # Reset output directory

        else:
            print("Invalid command. Please try again.")

if __name__ == "__main__":
    main()

스크립트 실행 시, 콘솔에서 메뉴가 표시된다. 먼저, path를 입력하여 수집한 영상 데이터 파일을 저장할 경로를 설정해주어야 한다. 기본적으로 pycon-playlist 폴더 하위에 생성되도록 설정해두었기에, 이 하위에 새로 생성될 폴더 명만 입력해주면 된다.(ex: pycon-br, pycon-kr)

path로 경로를 입력하였다면, add를 통해 수집할 플레이리스트 url을 입력할 수 있다. url 하나를 입력할 때마다 add를 입력해야 한다는 약간의 번거로움이 있기는 하다. 새로운 path를 입력하기 전까진, add를 통해 입력한 플레이리스트 영상 정보는 하나의 path 아래에 저장된다. 새로운 path를 입력한다면 그 이후의 add는 새로운 path에 저장된다. 여러 국가의 파이콘을 한 번에 수집할 수 있게 한 것.

하나의 연도이지만 분할된 playlist의 경우, merge 명령어로 처리하도록 했다. merge를 입력한 후, 분할된 플레이리스트 url들을 입력하면, 가장 처음 입력한 url의 영상 정보에 이어서 추가되도록 하였다.

이 코드를 사용하여 정말 많은 국가의 영상 데이트를 수집했다. 내가 올린 이슈를 보고 다른 팀원분들도 추가로 수집을 해주셔서 약간의 뿌듯함 추가

느낀점

가장 좋았던 점은, 글로벌 오픈소스 레포지토리에 기여하는 경험을 할 수 있었다는 점이다. 간단한 영상 정보이지만 pyvideo에 pr을 하고, merge되는 경험이 너무 뿌듯했다. 한 번 merge되니 너무 좋고 동기부여가 되어서 이후에 이어진 팀 내 오가니 활동에서도 더 열심히 할 수 있었던 것 같다.

아쉬웠던 점은, pyvideo 1회 기여 후 이어지는 활동은 전부 팀 내 오가니 안에서 진행되었다는 것. 흔히들 오픈소스 기여라고 하면 떠올리는 이미지와는 많이 다른 방향서응로 활도잉 진행되어서 조금 아쉽게 느껴졌다. 하지만 github 정적 페이지 배포도 배우고, notebookLLM을 활용한 데이터 분석이라는 신개념 데이터 분석 방식도 접해서 많은 것을 배우고 AI 활용에 대한 시야를 넓힐 수 있었던 것 같다.

마지막 활동은 오프라인으로 만나서 점심을 먹기로 하였는데, 가족 행사 때문에 주말 내내 창원에 있어서 참여를 못했다..이게 제일 아쉬운것 같움🥲

한 달 동안의 짧은 과정이지만 많은 것을 경험하고, 배울 수 있었다. 오픈소스에 더 관심을 가지게 된 계기가 된 것 같다. 이번에는 간단한 영상 데이터 기여로 github 활용을 배우는 초급 과정에 가까운 것 같다. 이번 활동으로 오픈소스에 관심 + 대규모 레포에 PR을 날릴 수 있는 용기를 얻었으니 앞으로는 본격적인 오픈소스 기여를 도전해보고 싶다. 코드 기여는 어떤 식으로 시작해야 되는지 더 배워보고 싶어서 OSSCA 모집이 올라오면 다시 도전해보고 싶음.