From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on gnuweeb.org X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=ALL_TRUSTED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,NO_DNS_FOR_FROM, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 Received: from [192.168.1.2] (unknown [101.128.125.123]) by gnuweeb.org (Postfix) with ESMTPSA id D46758093B; Wed, 19 Oct 2022 17:23:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gnuweeb.org; s=default; t=1666200197; bh=SslIvTgFPHoOoDsThTdvWAZtjvA4zYe4TURZ8M6AnQw=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=EXUrRMNH6lSNbFxig4BUMu8zj85yUNcqjyZrUV8hDfht6um+H42ciAfxxzDo6DUT9 0KWLTDzKwJ4cdi9r/od2cwJLDLMoyg6BGbuXUYK00XH5YJob2zdPVcJtoT+pLWDFsE /7bcUHQoIkd8bcpNiM4anZJDyNtR/nV/9d/Xg2VkJpEzBVg1pt/dURBh/HPXTN6KWQ NEdBN3PWl+A++UITrQoxPWUAJm6yeLAPDgdC6bPQb13erCI7TAwr0cmkhhJWeLth+m ffeShnaeVKoPw5adZEKyJF0m6huhmHzfszrNMRb9oG9Xn4ntiOxeonEFA16ov07l0Z 2E3W+siKZ+WFg== Message-ID: Date: Thu, 20 Oct 2022 00:23:13 +0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: [PATCH v1 4/7] atom: Improve fix_utf8_char() To: Ammar Faizi Cc: Alviro Iskandar Setiawan , GNU/Weeb Mailing List References: <20221018081635.1617-1-kiizuha@gnuweeb.org> <20221018081635.1617-5-kiizuha@gnuweeb.org> <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org> Content-Language: en-US From: Muhammad Rizki In-Reply-To: <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: On 19/10/2022 23.59, Ammar Faizi wrote: > On 10/18/22 3:16 PM, Muhammad Rizki wrote: >> -def fix_utf8_char(text: str, html_escape: bool = True): >> +def fix_utf8_char(text: str, unescape: bool = True): >>       t = text.rstrip().replace("�"," ") >> -    if html_escape: >> -        t = html.escape(html.escape(text)) >> +    if unescape: >> +        t = html.unescape(html.unescape(text)) >> +        reg = >> re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') >> +        t = reg.sub('', t) >>       return t > > You do html.unescape() twice, then remove all HTML special chars and > tags. I don't understand why we should do that. Can you explain a bit > on what is going on here? > You said an HTML tag in the email payload should be empty or removed, so I created the re.sub() to remove the HTML tag. I forgot where you said that.