Java - How to remove the HTML/CSS tags from the HTML email while retaining line breaks?

I have the email body:

Email body

It is in HTML/CSS. I want to convert it to plain text in Java, without changing the structure. I have used Jsoup lib:

Jsoup.clean(body, Whitelist.basic());

as well as solutions from this link, mainly the HTML2Text class using swing APIs. But they are both simply removing the CSS tags without retaining the structure and the result is this:

enter image description here

All of them are in one line. For only HTML tags, it is working fine. But when CSS is involved the structure is all messed up. Is there any library in Java that can retain the CSS tags as well?

1 answer

  • answered 2018-07-11 06:35 agrim khanna

    I guess this is what you need, This code snippet first removed all the Tags except br and p, and then in removes just the tag and keeps the data. The Parser.unescapeEntities removes the escape charactes like '&nbsp'.

    String html = (String) bodyPart.getContent();
    String prettyPrintedBodyFragment = Jsoup.clean(html, "", Whitelist.none().addTags("br", "p"),
                    new Document.OutputSettings().prettyPrint(true));
    result = Parser.unescapeEntities(Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(),
                    new Document.OutputSettings().prettyPrint(false)), false);