Tip: Cleanup HTML
Just a few lines of code I put together to clean up html input, by removing carriage returns, redundant white spaces (as much as possible...), HtmlAgilityPack takes care of missing closing tags. There's a little bug here in HtmlAgilityPack, it doesn't close missing <p>-tags...
using System.Text.RegularExpressions; using HtmlAgilityPack; namespace iFrameWorx.Core.Utilities.Extensions.Html { public static class CleanupHtmlExtension { public static string CleanupHtml(this string input) { try { var html = new HtmlDocument { OptionFixNestedTags = true, OptionAutoCloseOnEnd = true }; html.LoadHtml(input); var cleanHtml = Regex.Replace(html.DocumentNode.OuterHtml, @">\s*<", "><", RegexOptions.IgnoreCase | RegexOptions.Multiline); cleanHtml = Regex.Replace(cleanHtml, @"(<[^/>]*>)\s*", "$1", RegexOptions.IgnoreCase | RegexOptions.Multiline); cleanHtml = Regex.Replace(cleanHtml, "\r\n|\r|\n", string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase); cleanHtml = Regex.Replace(cleanHtml, @"\s+", " ", RegexOptions.Multiline | RegexOptions.IgnoreCase); return cleanHtml; } catch { // ignore error, return input } return input; } } }
The one thing I can't cleanup is this one: whitespace followed by opening tag, because that would break up the text.
If I add the functionality to replace "...end of the sentence. </p>" with "...end of the sentence.</p>", that would also mean that "...middle of the sentence <i>some word<i>..." would be replaced with "...middle of the sentence<i>some word<i>...", which is obviously wrong, so it's better to leave this 'as is'.
Note the whitespace before on line 24.
using iFrameWorx.Core.Utilities.Extensions.Html; using NUnit.Framework; namespace iFrameWorx.Core.UnitTests.Extensions { [TestFixture] public class CleanupHtmlUnitTest { [Test] public void CleanupHtmlExtension() { const string input = @"<b> <img src='http://datanews.rnews.be/images/resized/119/501/349/393/9/200_0_KEEP_RATIO_SCALE_CENTER_FFFFFF.jpg'> <u><li> De Amerikaanse zangeres Lady Gaga is de eerste beroemdheid geworden van wie de berichtjes op Twitter gevolgd worden door meer dan tien miljoen internetgebruikers. </li> "; var s = input.CleanupHtml(); Assert.AreEqual(@"<b><img src='http://datanews.rnews.be/images/resized/119/501/349/393/9/200_0_KEEP_RATIO_SCALE_CENTER_FFFFFF.jpg'><u><li>De Amerikaanse zangeres Lady Gaga is de eerste beroemdheid geworden van wie de berichtjes op Twitter gevolgd worden door meer dan tien miljoen internetgebruikers. </li></u></b>", s); } } }