From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kiizuha@gnuweeb.org>
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on gnuweeb.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.8 required=5.0 tests=ALL_TRUSTED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,NO_DNS_FOR_FROM,
	URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6
Received: from [192.168.1.2] (unknown [101.128.125.123])
	by gnuweeb.org (Postfix) with ESMTPSA id D46758093B;
	Wed, 19 Oct 2022 17:23:16 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gnuweeb.org;
	s=default; t=1666200197;
	bh=SslIvTgFPHoOoDsThTdvWAZtjvA4zYe4TURZ8M6AnQw=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=EXUrRMNH6lSNbFxig4BUMu8zj85yUNcqjyZrUV8hDfht6um+H42ciAfxxzDo6DUT9
	 0KWLTDzKwJ4cdi9r/od2cwJLDLMoyg6BGbuXUYK00XH5YJob2zdPVcJtoT+pLWDFsE
	 /7bcUHQoIkd8bcpNiM4anZJDyNtR/nV/9d/Xg2VkJpEzBVg1pt/dURBh/HPXTN6KWQ
	 NEdBN3PWl+A++UITrQoxPWUAJm6yeLAPDgdC6bPQb13erCI7TAwr0cmkhhJWeLth+m
	 ffeShnaeVKoPw5adZEKyJF0m6huhmHzfszrNMRb9oG9Xn4ntiOxeonEFA16ov07l0Z
	 2E3W+siKZ+WFg==
Message-ID: <e28dc200-ea05-815f-5170-a5e7b68346cc@gnuweeb.org>
Date: Thu, 20 Oct 2022 00:23:13 +0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.3.3
Subject: Re: [PATCH v1 4/7] atom: Improve fix_utf8_char()
To: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Cc: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org>,
 GNU/Weeb Mailing List <gwml@vger.gnuweeb.org>
References: <20221018081635.1617-1-kiizuha@gnuweeb.org>
 <20221018081635.1617-5-kiizuha@gnuweeb.org>
 <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org>
Content-Language: en-US
From: Muhammad Rizki <kiizuha@gnuweeb.org>
In-Reply-To: <14303851-8483-0737-8edc-649ee121f0ee@gnuweeb.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
List-Id: <gwml.vger.gnuweeb.org>

On 19/10/2022 23.59, Ammar Faizi wrote:
> On 10/18/22 3:16 PM, Muhammad Rizki wrote:
>> -def fix_utf8_char(text: str, html_escape: bool = True):
>> +def fix_utf8_char(text: str, unescape: bool = True):
>>       t = text.rstrip().replace("�"," ")
>> -    if html_escape:
>> -        t = html.escape(html.escape(text))
>> +    if unescape:
>> +        t = html.unescape(html.unescape(text))
>> +        reg = 
>> re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
>> +        t = reg.sub('', t)
>>       return t
> 
> You do html.unescape() twice, then remove all HTML special chars and
> tags. I don't understand why we should do that. Can you explain a bit
> on what is going on here?
> 

You said an HTML tag in the email payload should be empty or removed, so 
I created the re.sub() to remove the HTML tag. I forgot where you said that.