From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on gnuweeb.org X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=ALL_TRUSTED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,NO_DNS_FOR_FROM, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 Received: from [192.168.1.2] (unknown [101.128.125.123]) by gnuweeb.org (Postfix) with ESMTPSA id 1BA7080AB2; Wed, 19 Oct 2022 17:36:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gnuweeb.org; s=default; t=1666200962; bh=4381cobIk2ZqVjaQySmQZRAvA8y+7OyDiGQTk6tcQp8=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Dqrxnww4AAZ0sR/Bqj3akgtNsCJNErMVBto/JTrxnOuf0tUU5O3+2oNd+OKglx1iK NQB9uyhimFl0mAr0S2xCC1wQnoSITdvuLoykn54+avcDZYvVPN7XSlhhuzB10G66zW 5vomLYMxPmdt9ULExI8OsdSfOqMTGfrWyke9fMs+FaAJhGX4e8RPpv7CdI9VdR2Lyc IXAK/MCn4SkYDP3Z8fHHz79s1S4eTs8vdw24t2NLetrn6pJMpn9x0SL+DzxCecImKt mmbOpv98hYGMMX1GrPB6GMIwn+qyOJnWFDFy9eWBxcb5Yd0s4tLZ/2eXyAut0W10yK 0zQKF1pVDNV5A== Message-ID: Date: Thu, 20 Oct 2022 00:35:58 +0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: [PATCH v1 4/7] atom: Improve fix_utf8_char() To: Ammar Faizi Cc: Alviro Iskandar Setiawan , GNU/Weeb Mailing List References: <20221018081635.1617-1-kiizuha@gnuweeb.org> <20221018081635.1617-5-kiizuha@gnuweeb.org> <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org> <785390ce-e3ed-f1eb-dec6-a383563d139b@gnuweeb.org> Content-Language: en-US From: Muhammad Rizki In-Reply-To: <785390ce-e3ed-f1eb-dec6-a383563d139b@gnuweeb.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: On 20/10/2022 00.27, Ammar Faizi wrote: > On 10/20/22 12:23 AM, Muhammad Rizki wrote: >> On 19/10/2022 23.59, Ammar Faizi wrote: >>> On 10/18/22 3:16 PM, Muhammad Rizki wrote: >>>> -def fix_utf8_char(text: str, html_escape: bool = True): >>>> +def fix_utf8_char(text: str, unescape: bool = True): >>>>       t = text.rstrip().replace("�"," ") >>>> -    if html_escape: >>>> -        t = html.escape(html.escape(text)) >>>> +    if unescape: >>>> +        t = html.unescape(html.unescape(text)) >>>> +        reg = >>>> re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') >>>> +        t = reg.sub('', t) >>>>       return t >>> >>> You do html.unescape() twice, then remove all HTML special chars and >>> tags. I don't understand why we should do that. Can you explain a bit >>> on what is going on here? >>> >> >> You said an HTML tag in the email payload should be empty or removed, >> so I created the re.sub() to remove the HTML tag. I forgot where you >> said that. > > How so? > > I don't think I said that. Won't this patch corrupt the email > if it contains HTML special chars? > Ugh, hate when I should digging up the chat to give a prove. So, you want the re.sub() to be remove or no?