I use Calibre to manage my eBook, PDF, etc. libraries. When I’m importing catalogs of PDFs into my library things generally run smoothly… except when they don’t. I have a block of PDFs that I tried to import that had a file naming convention of author – title.pdf and author – seriesName seriesIndex – title.pdf. By default, Calibre will, when reading the metadata from the file name, assume that the file naming convention is title-author.pdf (or something similar).
In the case of my PDF collection, the default fails spectacularly which means I have to manually edit all the files to make corrections. Happily, I found this discussion thread on the exactly the problem I was having (Yay Internet!) and got this regex to sort out all my file name parsing woes:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?(\[?(?P<series>[^0-9\-]+) (- )?(?P<series_index>[0-9.]+)\]?\s*-\s*)?(?P<title>.+)
Without going into a lot of detail, this regular expression string parses the file name by author, series, series index, and title with a conditional that checks if the series and series index values are present. If they are, it includes them; otherwise, it moves directly on to the title. For future reference, here’s what my preferences panel looks like:
I ran this against my PDF collection of about 56 files that contained a mix of formats (some with complicated series titles and indexes) and the regex parsed them all beautifully.
Special thanks to the Mobile Read forums member Starson17 for taking the time to answer a question way back in December of 2009.