The next step in my data processing project is to find strings matching certain patterns in the PDF data. Today I worked my way through the relevant chapter (#7) of Al Sweigart's excellent / useful Automate the Boring Stuff with Python.
I've left some sample code above as a reminder (mainly for myself) of the basic pattern / syntax that you can use. I saw a slightly more concise pattern for running the search in Data Wrangling with Python; I may experiment with that in the future. That has you running something like:
search_result = re.search(word, fulltext)
I guess one of them will have a speed advantage, especially when multiplied over hundreds of thousands of pieces of text.
The next step with this project will be to connect this regex function with the splitting file. That way when I split the file, I can rename the file at the same time with a string that I've extracted using a regex search.
If you've reached this far and you don't know what I'm talking about, there's an interesting article by Cory Doctorow where he argues that regular expressions should probably be taught as a foundational skill to children:
Knowing regexp can mean the difference between solving a problem in three steps and solving it in 3,000 steps. When you're a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.