Saturday, May 4, 2019

Convert tables in all PDF pages into CSV


(bigdata) Donghuas-MacBook-Air:Documents donghua$ pip install tabula-py
(bigdata) Donghuas-MacBook-Air:Documents donghua$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tabula
>>> tabula.convert_into("/Users/donghua/Documents/SASMO-Result-Grade-3.pdf", "/Users/donghua/Documents/SASMO-Result-Grade-3.csv", output_format="csv",pages="all")
May 04, 2019 12:36:19 AM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO: To get higher rendering speed on JDK8 or later,
May 04, 2019 12:36:19 AM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
May 04, 2019 12:36:19 AM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS