Tip: Cleanup HTML

sep 30, 2011 22:38:54   //   Geert Van Huychem   //   Freebies   //   0 reacties

Just a few lines of code I put together to clean up html input, by removing carriage returns, redundant white spaces (as much as possible...), HtmlAgilityPack takes care of missing closing tags. There's a little bug here in HtmlAgilityPack, it doesn't close missing <p>-tags...

                                    using System.Text.RegularExpressions;
            
                                    using HtmlAgilityPack;
            
                                    namespace iFrameWorx.Core.Utilities.Extensions.Html
                                    {
                                        public static class CleanupHtmlExtension
                                        {
                                            public static string CleanupHtml(this string input)
                                            {
                                                try
                                                {
                                                    var html = new HtmlDocument { OptionFixNestedTags = true, OptionAutoCloseOnEnd = true };
            
                                                    html.LoadHtml(input);
            
                                                    var cleanHtml = Regex.Replace(html.DocumentNode.OuterHtml, @">\s*<", "><", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            
                                                    cleanHtml = Regex.Replace(cleanHtml, @"(<[^/>]*>)\s*", "$1", RegexOptions.IgnoreCase | RegexOptions.Multiline);
                                            
                                                    cleanHtml = Regex.Replace(cleanHtml, "\r\n|\r|\n", string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);
                            
                                                    cleanHtml = Regex.Replace(cleanHtml, @"\s+", " ", RegexOptions.Multiline | RegexOptions.IgnoreCase);
            
                                                    return cleanHtml;
                                                }
                                                catch
                                                {
                                                    //  ignore error, return input
                                                }
            
                                                return input;
                                            }
                                        }
                                    }
                                

The one thing I can't cleanup is this one: whitespace followed by opening tag, because that would break up the text.

If I add the functionality to replace "...end of the sentence. </p>" with "...end of the sentence.</p>", that would also mean that "...middle of the sentence <i>some word<i>..." would be replaced with "...middle of the sentence<i>some word<i>...", which is obviously wrong, so it's better to leave this 'as is'.

Note the whitespace before on line 24.

                                using iFrameWorx.Core.Utilities.Extensions.Html;
            
                                using NUnit.Framework;
            
                                namespace iFrameWorx.Core.UnitTests.Extensions
                                {
                                    [TestFixture]
                                    public class CleanupHtmlUnitTest
                                    {
                                        [Test]
                                        public void CleanupHtmlExtension()
                                        {
                                            const string input = @"<b>
            
                                                <img src='http://datanews.rnews.be/images/resized/119/501/349/393/9/200_0_KEEP_RATIO_SCALE_CENTER_FFFFFF.jpg'>
            
                                                    <u><li>
                                                De Amerikaanse zangeres Lady Gaga is de eerste beroemdheid geworden van wie de berichtjes op Twitter gevolgd worden door meer dan tien miljoen internetgebruikers.
                                                        </li>
                                                ";
            
                                            var s = input.CleanupHtml();
            
                                            Assert.AreEqual(@"<b><img src='http://datanews.rnews.be/images/resized/119/501/349/393/9/200_0_KEEP_RATIO_SCALE_CENTER_FFFFFF.jpg'><u><li>De Amerikaanse zangeres Lady Gaga is de eerste beroemdheid geworden van wie de berichtjes op Twitter gevolgd worden door meer dan tien miljoen internetgebruikers. </li></u></b>", s);
                                        }
                                    }
                                }
                                

Reageer.

Enkele items ontbreken of zijn fout ingevuld.