Login

trevorprinn · 11-11-2014, 10:12 PM

I've written a little program to make it easier to extract the pages from a pdf and import them as songs into MobileSheets. It's a .net 4 program, so the runtime for that has to be installed first (but I think that comes with Windows 7).

You can download its install program from
http://tprinn.co.uk/RealBookExtractor/Re...-1.1.0.msi

There's a help page at
https://github.com/trevorprinn/RealBookExtractor/wiki

I've open sourced it, and the source is at
https://github.com/trevorprinn/RealBookExtractor

I've processed several books while I wrote it, and it takes about as long to process a book as it does for you to type in the artists and titles.

**GraemeJ** · 11-12-2014, 01:27 AM

Looks interesting, but I couldn't get it working in Win 8.1. Error message in screenshot.

trevorprinn · 11-12-2014, 01:53 AM

That's a shame. It's worked with all the ones I've tried it on, but I suppose they are all quite old.

I've used an open source library, pdfsharp, to extract the pages, and it looks like they haven't written code to cope with some recent changes that Adobe have made.

All is not lost. If you can extract them some other way into a folder, perhaps using the MobileSheets Companion app, then you should be able to enter that folder name into the main page of the Extractor, load it and work through them.

trevorprinn · (This post was last modified: 11-12-2014, 03:16 AM by trevorprinn.)

I've found a possible way of extracting pages. I'll try and have a look at putting it in later in the week. I might have trouble testing it, though, as I don't think I have any pdfs that use the more recent format.

Edit: Found one.

**GraemeJ** · 11-12-2014, 03:34 AM

OK - no panic. We'll wait and see how it turns out.

trevorprinn · 11-12-2014, 05:06 AM

I think I have sorted it out. There's a new version at
http://tprinn.co.uk/RealBookExtractor/Re...-1.2.0.msi

**GraemeJ** · 11-12-2014, 05:35 AM

(11-12-2014, 05:06 AM)trevorprinn Wrote: I think I have sorted it out. There's a new version at
http://tprinn.co.uk/RealBookExtractor/Re...-1.2.0.msi

Hmmm.. not here you haven't.

I can load and extract from a large pdf, but the pages don't display.

1 - The first and last pages (i.e. the front and back covers) displayed ok

2 - Pages in between the first and last mainly didn't show at all, a couple or three displayed garbage.

3 - I tried to open some of the non-displaying pages with another program, This reported that, although the files had the .jpg extant, they were really .png files. I renamed a couple, but they still showed nothing or garbage.

Screenshots attached.

A work in progress, I feel.

popoff · 11-12-2014, 06:05 AM

I see that you switched from pdfsharp to itextsharp (the agpl version).

FYI I had many troubles in the past with the 4.x version of this library with modern pdf's, so I dropped the use of it trying to port an app from linux (using poppler) to windows.

Regards.

**GraemeJ** · (This post was last modified: 11-12-2014, 06:50 AM by GraemeJ.)

Just for the record, the original pdf was v1.5 (Acrobat 6.x) and created in 1999.

Many of the pdf's that musos will be using ( mainly fake book stuff) have been knocking around for years - there's not a lot that's new out there Sad

.

popoff · 11-12-2014, 07:02 AM

PDFsharp seems to support only PDF's from 1.2 to 1.4 .... although it can create pdfs marked from 1.2 to 1.7.

I suppose that the new pdf engine that Trevor is using (iTextPdf) may support pdf 1.5 without problem. I only had problems with pdfs beyond 1.7 extension level 3 in my application using this library, so there must be a work in progress as you said.

trevorprinn · 11-12-2014, 10:52 PM

I've knocked this up in too much of a hurry. It works with everything I have here, but there's obviously more variety in formats than I realised. Always writing the files out with a jpg extension is an obvious mistake. I need to check the images I'm retrieving from the pdf and see what format they are. That doesn't explain why renaming them to png doesn't work.

The only thing I'm using iTextSharp for is to convert pdfs with iref streams into the older format, so they can be read by PdfSharp, which has a much easier to use API. I used the old version of iTextSharp because it has a less restrictive licence, although given that I've open sourced my program I suppose I might as well use a more recent version. The PdfSharp team have been saying they will try to support iref streams for about 5 years, but they aren't making enough from donations for anyone to be willing to commit to doing it.

I have come across a couple of pdfs on my disk that it can't get the sheets out of at all. They aren't stored as images but are actually rendered by Acrobat. I think I probably downloaded them from Wikifonia. I don't think I want to put the work in to deal with those.

GraemeJ, is there any chance you could send me a copy of the pdf? I promise I'll delete it once I get it working.

**GraemeJ** · 11-12-2014, 11:15 PM

(11-12-2014, 10:52 PM)trevorprinn Wrote: GraemeJ, is there any chance you could send me a copy of the pdf? I promise I'll delete it once I get it working.

Yes - no problem, but you need to PM me your email address, as it's not possible to attach files to emails through the forum.

Don't worry about deleting it, it's available all over the net anyway Smile

.

trevorprinn · 11-12-2014, 11:47 PM

I've just been having a look at it, and it seems the 1.2.0 version has decided to start extracting all the 1bbp images as negatives. I can't see why. The code appears to be identical to before.

I also can't (easily) determine the image format that I've extracted from the pdf (the image is always a memory bmp), but I think I may be able to write it out converted to a specific format, once I have worked out the negative problem.

trevorprinn · 11-13-2014, 05:50 AM

I understand the problem with that file. I've been learning more about the innards of pdfs than I really intended to. The images in it are encoded using JPX (JPEG 2000) which is the one standard format not supported by the PdfSharp libraries. iTextSharp may support it, but I can't easily see how to get it to extract them.

I'm going to have to leave it at that for now. I've got too much work on to spend much more time on it. The pdf renderer built into the MobileSheets Companion (File/Convert PDF To Images) does support it, so all I can suggest is that you use that for any that the Extractor can't cope with.

I'm going to put out a release later this evening with a few small changes. It will at least not collapse if it tries to read a JPX encoded image (or any other format it can't handle) will always force the output to be png, won't (I hope) ever write the music pages out as negatives, and will have a Browse button for selecting a folder.

**GraemeJ** · 11-13-2014, 07:12 AM

OK - I understand that you have more important things to work on. Thanks for the effort so far and I'm sorry you had to learn more about pdf's than you would have liked Smile

.

I'll have a play around with using the MS renderer.

Login
Username/Email:
Password:	Lost Password?
	Remember me