Convert Lecture Videos to PDF

Overview

Convert Lecture Videos to PDF

Description

Want to go through lecture videos faster without missing any information? Wish you can read the lecture video instead of watching it? Now you can! With this python application, you can convert lecture videos to a PDF file! The PDF file will contain a screenshot of lecture slides presented in the video, along with a transcription of your instructor explaining those lecture slide. It can also handle instructors making annotations on their lecture slides and mild amounts of PowerPoint animations.

Table of Contents

  • Walkthrough
  • Getting Started
  • Tweeking the Application
  • Next steps
  • Usage
  • Credits
  • License

Walkthrough of this project

Users will need to download a video file of their lecture. For instance, the video file might look like this:

Users will also need a copy of the video's subtitles.

After running the command line tool, they will get a PDF that looks like this:

where each page contains an image of the lecture video, and a transcription of the instructor explaining about that slide.

Getting Started

  1. Ensure Python3 and Pip is installed on your machine

  2. Next, install package dependencies by running:

    pip3 install -r requirements.txt

  3. Now, run:

    python3 src/main.py tests/videos/input_1.mp4 -s tests/subtitles/subtitles_1.vtt -o output.pdf

    to generate a PDF of this lecture video with these subtitles

  4. The generated PDF will be saved as output.pdf

Tweeking the Application

This application uses computer vision with OpenCV to detect when the instructor has moved on to the next PowerPoint slide, detect animations, etc.

You can adjust the sensitivity to video frame changes in the src/video_segment_finder.py file. You can also visualize how well the application detect transitions and animations via the src/plot.py tool.

Next Steps

  • Automatically generate subtitles
  • Wrap project into a web app?

Usage

Please note that this project is used for educational purposes and is not intended to be used commercially. We are not liable for any damages/changes done by this project.

Credits

Emilio Kartono, who made the entire project.

License

This project is protected under the GNU licence. Please refer to the LICENSE.txt for more information.

Comments
  • SRT files

    SRT files

    Hi, I got

    Getting subtitles for each frame
    Traceback (most recent call last):
      File "src/main.py", line 80, in <module>
        runner.run(sys.argv[1:])
      File "src/main.py", line 46, in run
        self.__run__(
      File "src/main.py", line 61, in __run__
        segment_finder = SubtitleSegmentFinder(subtitle_parser.get_subtitle_parts())
      File "C:\Users\Tilman\AppData\Local\Programs\Python\Python38\src\subtitle_segment_finder.py", line 52, in get_subtitle_parts
        for caption in webvtt.read(self.input_file):
      File "C:\Users\Tilman\AppData\Local\Programs\Python\Python38\lib\site-packages\webvtt\webvtt.py", line 60, in read
        parser = WebVTTParser().read(file)
      File "C:\Users\Tilman\AppData\Local\Programs\Python\Python38\lib\site-packages\webvtt\parsers.py", line 25, in read
        self._validate(content)
      File "C:\Users\Tilman\AppData\Local\Programs\Python\Python38\lib\site-packages\webvtt\parsers.py", line 258, in _validate
        raise MalformedFileError('The file does not have a valid format')
    webvtt.errors.MalformedFileError: The file does not have a valid format
    

    when trying to use a .srt subtitle file. Is it possible to add support for that?

    opened by tharos96 10
  • IndexError: string index out of range

    IndexError: string index out of range

    I'm getting this error - In my subtitle file I dont have any dots (.) I think that is the problem,

    This is the output

    Number of frames: 103
    Getting subtitles for each frame
    Traceback (most recent call last):
      File "src/main.py", line 87, in <module>
        runner.run(sys.argv[1:])
      File "src/main.py", line 51, in run
        self.__run__(
      File "src/main.py", line 70, in __run__
        segments = segment_finder.get_subtitle_segments(subtitle_breaks)
      File "D:\Study\Sixth-Sem\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 42, in get_subtitle_segments
        pos = self.__get_part_position_of_time_break__(time_break)
      File "D:\Study\Sixth-Sem\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 122, in __get_part_position_of_time_break__
        if self.parts[right_part_index].text[right_part_char_index] == ".":
    IndexError: string index out of range```
    opened by s-bhagwat 4
  • Feature Request: Extract slides without subtitles

    Feature Request: Extract slides without subtitles

    Hi! I like the idea of your project but it's not what I actually want.

    I have a few lectures videos without subtitle files and I'm not interested in having subtitles in the PDF document either. I just want to to obtain a document with the slides in landscape orientation.

    Please consider adding the option to achieve the mentioned result.


    Side note: I noticed that your program depends on CPU power. Have you ever thought about utilizing the GPU instead?

    opened by MaZED-UP 3
  • Cropped slides in output.pdf

    Cropped slides in output.pdf

    python.exe src/main.py "D:\Downloads\vid1.m4v" -s "D:\Downloads\subt1.vtt" -o output.pdf
    
    C:\Users\User\AppData\Local\Programs\Python\Python38>python.exe src/main.py "D:\Downloads\vid1.m4v" -s "D:\Downloads\subt1.vtt" -o output.pdf
    Getting selected frames
    Getting subtitles for each frame
    Merging frames and subtitles
    C:\Users\User\AppData\Local\Programs\Python\Python38\lib\site-packages\fpdf\fpdf.py:710: UserWarning: Substitutting Arial by core font Helvetica
      warnings.warn("Substitutting Arial by core font Helvetica")
    

    resulted in cropped slides in the output.pdf like this: grafik

    Also, the subtitles are more or less split into two major parts. They were converted from .srt to .vtt via https://www.happyscribe.com and have the following exemplary format:

    1090 00:36:26.960 --> 00:36:29.660 inzwischen wird das war schon ein

    opened by tharos96 3
  • Type Error when running the script

    Type Error when running the script

    Hi Thank you for sharing the code. I tried executing the script as per the instructions in readme file. It works fine if run with the example subtitle files (subtitles_1, subtitles_2, etc) are used. But when using my subtitle file with video, the program throws the following error (link to subtitle file that was used: https://github.com/docstar1/Lecture-Video-to-PDF/blob/2daed0bc3426db515ad8da3ef3611834789c23db/tests/subtitles/subtitles_3.vtt): Traceback (most recent call last): File "src/main.py", line 77, in runner.run(sys.argv[1:]) File "src/main.py", line 47, in run video_segment_finder, video_filepath, subtitle_parser, output_filepath File "src/main.py", line 61, in run segments = segment_finder.get_subtitle_segments(subtitle_breaks) File "..\PycharmProjects\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 96, in get_subtitle_segments pos = self.get_part_position_of_time_break(time_break) File "..\PycharmProjects\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 152, in get_part_position_of_time_break part = self.parts[part_index] TypeError: list indices must be integers or slices, not NoneType

    opened by docstar1 2
  • feat: setup github actions

    feat: setup github actions

    Description:

    This PR is about setting up github actions so that tests are automatically run each time a PR is made Also, a new integ test was made which is to generate a sample pdf file through the cli

    opened by EKarton 0
  • feat: support adding no subtitles

    feat: support adding no subtitles

    Description:

    This PR contains multiple changes:

    1. Adding no subtitles to the generated PDF
    2. Creating integ tests for the main.py file
    3. Refactoring the test code

    Usage of (1)

    To add no subtitles to the generated PDF, one would need to supply the -S or --skip-subtitles flag to the cli tool, like:

    python3 -m src.main tests/videos/input_1.mp4 -S -o output.pdf   
    

    Running the command above will output a pdf with no subtitles, like:

    output.pdf

    opened by EKarton 0
  • fix: fix generating pdf with empty subtitles

    fix: fix generating pdf with empty subtitles

    Description:

    Problem:

    When we have a video like https://www.youtube.com/watch?v=KN6OSdUfgyA with subtitles like subtitles_8-old.txt, we get an exception thrown:

    Number of frames: 103
    Getting subtitles for each frame
    Traceback (most recent call last):
      File "src/main.py", line 87, in <module>
        runner.run(sys.argv[1:])
      File "src/main.py", line 51, in run
        self.__run__(
      File "src/main.py", line 70, in __run__
        segments = segment_finder.get_subtitle_segments(subtitle_breaks)
      File "D:\Study\Sixth-Sem\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 42, in get_subtitle_segments
        pos = self.__get_part_position_of_time_break__(time_break)
      File "D:\Study\Sixth-Sem\Lecture-Video-to-PDF\src\subtitle_segment_finder.py", line 122, in __get_part_position_of_time_break__
        if self.parts[right_part_index].text[right_part_char_index] == ".":
    IndexError: string index out of range```
    

    Cause:

    This is because the subtitle file contains subtitle parts that has no content in it. For instance,

    image

    This causes the check of self.parts[right_part_index].text[right_part_char_index] to fail since right_part_index is a valid index in self.parts but since the text is empty and we first set right_part_char_index to 0, it will cause self.parts[right_part_index].text[right_part_char_index] to fail.

    Solution:

    Each subtitle part needs to have some text. A subtitle part with no text is redundant. Hence, when we parse the subtitles, we need to remove the subtitle parts that have no text.

    Manual Test:

    1. Ran python3 src/main.py tests/videos/input_8.mp4 -s tests/subtitles/subtitles_8.1.vrt -o output.pdf. It did not crash, and it returned this PDF:

    output.pdf

    1. Unfortunately, the subtitle file is not the best, and the subtitle at tests/subtitles/subtitles_8.srt is a better replacement for that video's subtitle. Running python3 src/main.py tests/videos/input_8.mp4 -s tests/subtitles/subtitles_8.srt -o output.pdf` produced this subtitle:

    output.pdf

    opened by EKarton 0
  • fix: generating pdf with subtitles containing no dots, and missing unicode fonts

    fix: generating pdf with subtitles containing no dots, and missing unicode fonts

    Description:

    Problem:

    This PR is about fixing two bugs:

    1. When the subtitle has no periods (.), it will go into an infinite loop or it will assign subtitles to each frame unbalanced.
    2. When generating a PDF with non-English subtitles, it will throw this error:

    image

    Solution:

    For (1), the issue is in the code responsible for getting the best break points of a timestamp in the subtitle. In the current codebase, it will find the index in the subtitle based on its timestamp, and then trying to find a . iteratively left and right from the original index. If the index is not found, it will return the entire subtitle file.

    The fix for (1) is to return the index if a . is not found left or right of the original index.

    For (2), the fix is to add unicode fonts when writing content to the PDF. From this link, this can be done by downloading a unicode font, adding the font to PyFPDF, and using that font.

    Manual Test:

    1. Ran python3 src/main.py tests/videos/input_7.mp4 -s tests/subtitles/subtitles_7.srt -o output.pdf
    2. The output file looks like this:

    output.pdf

    opened by EKarton 0
  • fix: set max image width on pdf

    fix: set max image width on pdf

    Problem:

    When the resolution of lecture videos are too large, it embeds images into PDF that are larger than the PDF itself. For instance,

    image

    Solution:

    Set the width of images when embedding them in PDFs

    opened by EKarton 0
  • feat: added srt support

    feat: added srt support

    Description:

    This PR is about adding .srt support for subtitles. This PR should not affect existing behaviour in parsing .webvtt subtitles.

    If the subtitle filename ends in .srt, it will use the SRT subtitle parser. Else, it will use the WebVTT subtitle parser.

    Manual Test:

    1. Ran pip3 install -r requirements.txt
    2. Ran python3 src/main.py tests/videos/input_1.mp4 -s tests/subtitles/subtitles_1.srt -o output.pdf. It outputted the subtitles correctly:

    output.pdf

    1. Ran python3 src/main.py tests/videos/input_2.mp4 -s tests/subtitles/subtitles_2.srt -o output.pdf. It outputted the subtitles correctly too:

    output.pdf

    1. Ran cd test && python3 -m unittest * and they all passed.
    opened by EKarton 0
Owner
Emilio Kartono
Computer Science Graduate from the University of Toronto
Emilio Kartono
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Matthew Stamy 5k Jan 4, 2023
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 9, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

null 3 Mar 12, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 4, 2023
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

Prajwol Lamichhane 131 Dec 16, 2022
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 1, 2023
Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

Duo Apps 6 Oct 3, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 2, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 8, 2023
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 1, 2023