How to use Llama Parse to convert PDF to text and extract complex table data. For Annual Reports, 10Ks, Research Reports
Published: March 20, 2024
NEW: TIGZIG: Co-Analyst
app.tigzig.com - my open-source platform with 25+ micro-apps and tooling's for AI driven analytics and data science.
Including a Llama Parse PDF to Markdown converter
Extracting data, especially table data, from complex PDFs with tables used to be a challenge. But with the launch of LlamaParse by LlamaIndex, that period is now over.
Originally published on LinkedIn. Embedded post below.
Note for developers doing the conversion themselves with Python/JS scripts:
- The API call works faster than the Python package.
- Chunking the file before parsing improves speeds.
- Currently, around 50 pages seems to be the optimal chunk size.
- Parsing is faster when done in 50-page chunks versus the full file at once, even for say a 100-page report.
- Tested chunk sizes between 25 to 100 pages, with less than 50 or more than 50 pages increasing the conversion time.
- However, all this can change rapidly as LlamaParse is evolving quickly. For example, just a few days back they increased file size limit from 200 to 700 pages.
🔗
Blog Migration Notice: Some links or images in earlier posts may be broken.
View the original post on the old blog site.