From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on gnuweeb.org X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=ALL_TRUSTED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,NO_DNS_FOR_FROM, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 Received: from [10.7.7.5] (unknown [182.253.183.71]) by gnuweeb.org (Postfix) with ESMTPSA id DDCE07E257; Wed, 19 Oct 2022 16:59:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gnuweeb.org; s=default; t=1666198768; bh=fE4suvlR7lwYPzyLSZR719kk56XYXabmsY5rw/39scU=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=j4z8u7iVMJZWQU3ryt4wdq0OpVgOox55aJ3x5GBC+eISBkjlsgTgEpdqr9mwyHWcS pc4RZWL/md+OifhydyUlYw9fDLqmIo2FSspB+3WNnFVx6kSgFMSKpCdEGI/cKRKuzP v4dChwy56VkcN2TiBRK1ERWdfo3Dixi+KS9AAQxKnwGkA5Iz+/qRkGq6NR38GPEpUQ ng8uLsOOlzGvE6LPJpFYEuVDt1CTaD6x3PTjx31OcOe36q6PiaQddFBwuNYtN2J/5Y Bqxo+dMbU9rUmPk21Q7lrDdCJ2CeWiUKY4RSAHEQOZuF9Ap+wBwBpYnPIa+aLiHkGt mQbEfqobY8nUg== Message-ID: <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org> Date: Wed, 19 Oct 2022 23:59:24 +0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Subject: Re: [PATCH v1 4/7] atom: Improve fix_utf8_char() Content-Language: en-US To: Muhammad Rizki Cc: Alviro Iskandar Setiawan , GNU/Weeb Mailing List References: <20221018081635.1617-1-kiizuha@gnuweeb.org> <20221018081635.1617-5-kiizuha@gnuweeb.org> From: Ammar Faizi In-Reply-To: <20221018081635.1617-5-kiizuha@gnuweeb.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: On 10/18/22 3:16 PM, Muhammad Rizki wrote: > -def fix_utf8_char(text: str, html_escape: bool = True): > +def fix_utf8_char(text: str, unescape: bool = True): > t = text.rstrip().replace("�"," ") > - if html_escape: > - t = html.escape(html.escape(text)) > + if unescape: > + t = html.unescape(html.unescape(text)) > + reg = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') > + t = reg.sub('', t) > return t You do html.unescape() twice, then remove all HTML special chars and tags. I don't understand why we should do that. Can you explain a bit on what is going on here? -- Ammar Faizi