Why is my youtube transcripts API only working in non-prod, but not in prod?
In my non-production environment, I am able to use the transcript youtube API to obtain transcript.
In my production environment, after much debugging and logging, I am unable to do this. Here are the logs:
2024-08-20T07:41:29.989747260Z [ANONYMIZED_IP] - - [20/Aug/2024:07:41:29 +0000] "GET /generate/youtubeSummary/ HTTP/1.1" 200 30723 "https://[ANONYMIZED_DOMAIN]/dashboard/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0"
2024-08-20T07:41:45.777775814Z Custom form options: {}
2024-08-20T07:41:45.778110014Z Form data debug: {'grade_level': '', 'video_url': 'https://www.youtube.com/watch?v=[ANONYMIZED_VIDEO_ID]', 'summary_length': None}
2024-08-20T07:41:45.778131714Z INFO 2024-08-20 07:41:45,777 views Generating summary for video URL: https://www.youtube.com/watch?v=[ANONYMIZED_VIDEO_ID]
2024-08-20T07:41:45.781101714Z INFO 2024-08-20 07:41:45,780 views YouTube IP address: [ANONYMIZED_IP]
2024-08-20T07:41:45.979793308Z INFO 2024-08-20 07:41:45,979 views YouTube connection status: 200
2024-08-20T07:41:45.980433708Z INFO 2024-08-20 07:41:45,980 views Attempting to connect to: www.youtube.com
2024-08-20T07:41:45.980820208Z INFO 2024-08-20 07:41:45,980 views Extracted video ID: [ANONYMIZED_VIDEO_ID]
2024-08-20T07:41:46.463787194Z INFO 2024-08-20 07:41:46,463 _universal Request URL: 'https://[ANONYMIZED_DOMAIN]/v2.1/track'
2024-08-20T07:41:46.463807494Z Request method: 'POST'
2024-08-20T07:41:46.463815194Z Request headers:
2024-08-20T07:41:46.463822194Z 'Content-Type': 'application/json'
2024-08-20T07:41:46.463830194Z 'Content-Length': '2373'
2024-08-20T07:41:46.463840094Z 'Accept': 'application/json'
2024-08-20T07:41:46.463847194Z 'x-ms-client-request-id': '[ANONYMIZED_REQUEST_ID]'
2024-08-20T07:41:46.463853694Z 'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.9.19 (Linux-5.15.158.2-1.cm2-x86_64-with-glibc2.28)'
2024-08-20T07:41:46.463863894Z A body is sent with the request
2024-08-20T07:41:46.485539393Z INFO 2024-08-20 07:41:46,485 _universal Response status: 200
2024-08-20T07:41:46.485558093Z Response headers:
2024-08-20T07:41:46.485565793Z 'Transfer-Encoding': 'chunked'
2024-08-20T07:41:46.485600593Z 'Content-Type': 'application/json; charset=utf-8'
2024-08-20T07:41:46.485609693Z 'Server': 'Microsoft-HTTPAPI/2.0'
2024-08-20T07:41:46.485616293Z 'Strict-Transport-Security': 'REDACTED'
2024-08-20T07:41:46.485622793Z 'X-Content-Type-Options': 'REDACTED'
2024-08-20T07:41:46.485629293Z 'Date': 'Tue, 20 Aug 2024 07:41:45 GMT'
2024-08-20T07:41:46.486316593Z INFO 2024-08-20 07:41:46,485 _base Transmission succeeded: Item received: 2. Items accepted: 2
2024-08-20T07:41:46.515040992Z ERROR 2024-08-20 07:41:46,513 views Error generating YouTube summary:
2024-08-20T07:41:46.515060292Z Could not retrieve a transcript for the video https://www.youtube.com/watch?v=[ANONYMIZED_VIDEO_ID]! This is most likely caused by:
2024-08-20T07:41:46.515068192Z
2024-08-20T07:41:46.515203692Z Subtitles are disabled for this video
2024-08-20T07:41:46.515214292Z
2024-08-20T07:41:46.515348292Z If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
2024-08-20T07:41:46.515375792Z Traceback (most recent call last):
2024-08-20T07:41:46.515384092Z File "/tmp/[ANONYMIZED_PATH]/theDashboard/views.py", line 2002, in generate_youtube_summary
2024-08-20T07:41:46.515390492Z transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
2024-08-20T07:41:46.515396292Z File "/tmp/[ANONYMIZED_PATH]/antenv/lib/python3.9/site-packages/youtube_transcript_api/_api.py", line 71, in list_transcripts
2024-08-20T07:41:46.515401992Z return TranscriptListFetcher(http_client).fetch(video_id)
2024-08-20T07:41:46.515407392Z File "/tmp/[ANONYMIZED_PATH]/antenv/lib/python3.9/site-packages/youtube_transcript_api/_transcripts.py", line 48, in fetch
2024-08-20T07:41:46.515413192Z self._extract_captions_json(self._fetch_video_html(video_id), video_id),
2024-08-20T07:41:46.515418692Z File "/tmp/[ANONYMIZED_PATH]/antenv/lib/python3.9/site-packages/youtube_transcript_api/_transcripts.py", line 62, in _extract_captions_json
2024-08-20T07:41:46.515424292Z raise TranscriptsDisabled(video_id)
2024-08-20T07:41:46.515429592Z youtube_transcript_api._errors.TranscriptsDisabled:
2024-08-20T07:41:46.515434992Z Could not retrieve a transcript for the video https://www.youtube.com/watch?v=[ANONYMIZED_VIDEO_ID]! This is most likely caused by:
2024-08-20T07:41:46.515440592Z
2024-08-20T07:41:46.515446092Z Subtitles are disabled for this video
2024-08-20T07:41:46.515451692Z
2024-08-20T07:41:46.515457192Z If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
2024-08-20T07:41:46.517604192Z [ANONYMIZED_IP] - - [20/Aug/2024:07:41:46 +0000] "POST /generate/youtubeSummary/ HTTP/1.1" 200 789 "https://[ANONYMIZED_DOMAIN]/generate/youtubeSummary/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0"
2024-08-20T07:41:51.462730956Z INFO 2024-08-20 07:41:51,462 _universal Request URL: 'https://[ANONYMIZED_DOMAIN]/v2.1/track'
2024-08-20T07:41:51.462753956Z Request method: 'POST'
2024-08-20T07:41:51.462761756Z Request headers:
2024-08-20T07:41:51.462838355Z 'Content-Type': 'application/json'
2024-08-20T07:41:51.462846755Z 'Content-Length': '1124'
2024-08-20T07:41:51.462853355Z 'Accept': 'application/json'
2024-08-20T07:41:51.462870155Z 'x-ms-client-request-id': '[ANONYMIZED_REQUEST_ID]'
2024-08-20T07:41:51.462877055Z 'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.9.19 (Linux-5.15.158.2-1.cm2-x86_64-with-glibc2.28)'
2024-08-20T07:41:51.462883755Z A body is sent with the request
2024-08-20T07:41:51.467871320Z INFO 2024-08-20 07:41:51,467 _universal Request URL: 'https://[ANONYMIZED_DOMAIN]/v2.1/track'
2024-08-20T07:41:51.467899520Z Request method: 'POST'
2024-08-20T07:41:51.467909320Z Request headers:
2024-08-20T07:41:51.467952720Z 'Content-Type': 'application/json'
2024-08-20T07:41:51.467962219Z 'Content-Length': '2397'
2024-08-20T07:41:51.467968519Z 'Accept': 'application/json'
2024-08-20T07:41:51.467974719Z 'x-ms-client-request-id': '[ANONYMIZED_REQUEST_ID]'
2024-08-20T07:41:51.467981319Z 'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.9.19 (Linux-5.15.158.2-1.cm2-x86_64-with-glibc2.28)'
2024-08-20T07:41:51.467987919Z A body is sent with the request
2024-08-20T07:41:51.472131390Z INFO 2024-08-20 07:41:51,471 _universal Response status: 200
2024-08-20T07:41:51.472146690Z Response headers:
2024-08-20T07:41:51.472154290Z 'Transfer-Encoding': 'chunked'
2024-08-20T07:41:51.472160590Z 'Content-Type': 'application/json; charset=utf-8'
2024-08-20T07:41:51.472167190Z 'Server': 'Microsoft-HTTPAPI/2.0'
2024-08-20T07:41:51.472195890Z 'Strict-Transport-Security': 'REDACTED'
2024-08-20T07:41:51.472204590Z 'X-Content-Type-Options': 'REDACTED'
2024-08-20T07:41:51.472210890Z 'Date': 'Tue, 20 Aug 2024 07:41:50 GMT'
2024-08-20T07:41:51.472633987Z INFO 2024-08-20 07:41:51,472 _base Transmission succeeded: Item received: 2. Items accepted: 2
2024-08-20T07:41:51.479943736Z INFO 2024-08-20 07:41:51,479 _universal Response status: 200
2024-08-20T07:41:51.479965136Z Response headers:
2024-08-20T07:41:51.479973036Z 'Transfer-Encoding': 'chunked'
2024-08-20T07:41:51.479980236Z 'Content-Type': 'application/json; charset=utf-8'
2024-08-20T07:41:51.479987236Z 'Server': 'Microsoft-HTTPAPI/2.0'
2024-08-20T07:41:51.479995336Z 'Strict-Transport-Security': 'REDACTED'
2024-08-20T07:41:51.480004735Z 'X-Content-Type-Options': 'REDACTED'
2024-08-20T07:41:51.480012935Z 'Date': 'Tue, 20 Aug 2024 07:41:50 GMT'
2024-08-20T07:41:51.480649231Z INFO 2024-08-20 07:41:51,480 _base Transmission succeeded: Item received: 1. Items accepted: 1
2024-08-20T07:43:41 No new trace in the past 1 min(s).
2024-08-20T07:44:41 No new trace in the past 2 min(s).
I know that my code is fine as it works in non-production.
def generate_youtube_summary(video_url, custom_form_options=None):
logger.info(f"Generating summary for video URL: {video_url}")
connectivity_results = test_youtube_connectivity()
if not (connectivity_results["dns_resolution"] and connectivity_results["connection_status"]):
logger.error("YouTube connectivity check failed. Details: %s", connectivity_results)
return "Unable to connect to YouTube. Please check your internet connection and try again."
logger.info(f"Attempting to connect to: {urlparse(video_url).netloc}")
video_id = None
if 'youtu.be/' in video_url:
video_id = video_url.split('youtu.be/')[1]
elif 'youtube.com/watch?v=' in video_url:
video_id = video_url.split('v=')[1]
elif 'youtube.com/embed/' in video_url:
video_id = video_url.split('embed/')[1]
logger.info(f"Extracted video ID: {video_id}")
if video_id:
try:
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
logger.info(f"Retrieved transcript list for video ID: {video_id}")
transcript = transcript_list.find_transcript(['en'])
logger.info("Found English transcript")
transcript_data = transcript.fetch()
logger.info("Fetched transcript data")
transcript_text = ' '.join([entry['text'] for entry in transcript_data])
logger.info(f"Extracted transcript text (first 100 chars): {transcript_text[:100]}...")
summary_prompt = f"""
<role>YouTube Video Summarizer</role> """
logger.info("Sending summary prompt to process_text function")
summary = process_text(summary_prompt)
logger.info(f"Received summary from process_text (first 100 chars): {summary[:100]}...")
return summary
except Exception as e:
logger.error(f"Error generating YouTube summary: {str(e)}", exc_info=True)
if "TranscriptsDisabled" in str(e):
return "Unable to generate summary. Subtitles are disabled for this video."
elif "NoTranscriptFound" in str(e):
return "No transcript found for this video. It may not have subtitles available."
else:
return f"Failed to generate video summary. Error: {str(e)}"
else:
logger.warning(f"Invalid YouTube video URL: {video_url}")
return "Invalid YouTube video URL. Please provide a valid URL."
According to the logs and the fact that my openAI API works, it can't be a networking issue.
In addition to solving it, I'm quite curious why this is the case?
Debugging / logging
checking networking settings
Note it says subtitltes are disabled for this video however I can confirm they are not - this seems to be a blanket error message thrown at any video.
It's very likely that this is due to a network issue with YouTube specifically, as seen in this issue on Github as well as here: Error Fetching YouTube Transcript Using YouTubeTranscriptApi on Server but Works Locally. Try using a proxy or assign a new IP to the machine used in production, and make sure library versions line up between the two environments - the API relies on web scraping, and as such can be troublesome.