How to convert PDF files to HTML or XML files in openSUSE

7 Comments

  1. Hi there,

    just want to mention, that it is generally extremely unwise to use non-markup formats like PDF for creating markup language output like XML or HTML (or LaTeX, …).

    PDF does not contain the necessary meta information, so you always get a “flat” output of what you originally had in the master document. What algorithm’s pdftohtml would ever use, it just will be able “guess” about structural informations like e.g. headers. You can compare this to run an OCR program over an scanned text document.

    However it does this job very good. But for my personal taste it is often more useful to use the clipboard to get text out of a PDF file 😉

    More useful is the extraction of pictures, which is indeed one of the most advanced features.

    Another important feature is, that (html-) links will be preserved too.

    pdftohtml also tries to produce an outline of the document. It is able to scan for headers and even does a good job when it comes to render scientific documents containing a numbered header hierarchy.

    —————-

    I think such informations are elementary introducing a tool like pdftohtml and it is a little bit weak – at least not very careful concerning the preparation of the article – just pointing out a few installer screen shots leaving the user alone. A short link to another howto explaining “one-click-install” and instead a concrete sample real world sample with screenshots would have been more useful to the reader.

    The author should not title it’s article “How to convert PDF files to HTML or XML files …” when he/she just explains common installation procedures and copy/pastes the console output.

    I hope, “just publishing” is not the main motto of the author “admin”.

    Sorry about this comment, but I really hope the writer could take something of this into consideration for further journalism tasks 😉 More and more often I see such like articles cluttering the forums and howto sites and I – personally – think this is not the way teaching people linux or any knowledge.

    Nevertheless Greets

    P.S.: Installing pdftohtml in ubuntu: it is part of the “poppler-utils” package.


  2. Author

    @Axel: Thanks very much for your comment. I did think about showing screens of the converted file but then the outputs are fairly file and files only and there aren’t features that can represented pictorially.

    I do agree that its not the best way to convert PDFs but having a huge file and using a clipboard doesn’t help either… does it?

    I look at all levels of audiences and not just experts but also beginners, end users.

    Anyway, thanks for your comments and will remember the valid points you have raised.

    Thanks

  3. Now all we need is war2pdf, war2html, war2txt, war2ps, etc., for KDE/Konqueror’s web archive format (which replaces saving html files, which in IE6 and maybe later, and in Firefox iirc saves the html file as one file and then creates a sub-directory for the relevant image files needed to render the html page, with the Konq web archive format compressing the html text and saving the image files all in a compressed archive file.

  4. I believe PDF is a vector based binary format. Is there a tool to convert it directly to SVG (preferably with embedded imaages if possible)? It would be even better if the HTML markup would contain drawings as embedded SVG.
    Thanks.

  5. I have recently been looking for a solution to convert non-tagged pdf’s into tagged ones. The only reasonable thing I found was to first convert them to some format openoffice can read and then use openoffice to export the tagged pdf file. But that also is not so obvious. I’ll try with this tool.
    If you wonder why I need it tagged: is to to be able to have a nice rendering on my (old) windows mobile pda.

  6. I am impressed. It is almost there.

    I converted a 70 pages OpenOffice document to PDF then I used pdftohtml. I had some error:
    Error: Embedded TrueType font is missing a required table (‘fpgm’)
    Error: Embedded TrueType font is missing a required table (‘prep’)

    The index is off because of the trailling ……. to the page number. Maybe it is because of those TrueType font.

    It is also a little bit off for the justified lines. Maybe also due again to the TrueType font error.

    One very good improvement will be to put some NEXT and PREVIOUS buttons at the bottom of the page and get rid of the navigation left “panel” and replace it with an “index” button.

    Also make use a switch to make a one page only HTML document.

    As I wrote above, I am impressed mainly because it is a 0.36 version.

    looking forward to the 1.0 version. It will be a small revolution for the HTML world.

    Michel-Andre

Leave a Reply

Your email address will not be published. Required fields are marked *