GNU/Weeb Mailing List <[email protected]>
 help / color / mirror / Atom feed
* [PATCH fb v1 0/6] Introducing cache for the Facebook scraper
@ 2023-05-09 10:46 Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 1/6] fb: Introduce `getCache()` and `setCache()` functions Ammar Faizi
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

Hi,

This series introduce a cache mechanism to speed up the web API
performance. It's very useful to reduce the pain when developing an
app that uses the API. It also greatly reduces the number of requests
to the same endpoint that happens in a short period of time.

There are 6 patches in this series:

Patch #1: Introduce `getCache()` and `setCache()`.
A preparation patch to implement better caching mechanism. All methods
that need cache will call these functions.

Patch #2: Replace old cache mechanism in `getTimelineYears()`.
Simplify the caching mechanism and make the `getTimelineYears()`
cache private to itself. This also means that the endpoint
"action=getTimelineYears" will utilize the cache.

Patch #3, #4: Implement cache in `getTimelinePosts()` and `getPost()`.
Make short repeated calls fast.

Patch #5: Introduce `clearExpiredCaches()`.
When a cache is expired, it won't be deleted unless getCache() with the
corresponding key is invoked. Introduce a new function to scan for
expired caches and delete them.

Patch #6: Create cron.php to clear cache.
Allow the server to clear expired caches via a small PHP script,
cron.php. Periodically calling clearExpiredCaches() will delete old
expired caches, it saves storage space.

Signed-off-by: Ammar Faizi <[email protected]>
---

The following changes since commit 0d5e59e00359e165778a81f80122bb522f8edb0f:

  Merge branch 'rewrite_url' (Facebook Onion rewrite support) (2023-05-03 18:46:47 +0700)

are available in the Git repository at:

  https://gitlab.torproject.org/ammarfaizi2/Facebook.git dev.cache

for you to fetch changes up to d30f2dad8ca761b5a9c8de32ea48adbbdd201d03:

  fb: web: Create cron.php to clear cache (2023-05-09 17:33:12 +0700)

----------------------------------------------------------------
Ammar Faizi (6):
      fb: Introduce `getCache()` and `setCache()` functions
      fb: Post: Replace old cache mechanism in `getTimelineYears()`
      fb: Post: Implement cache in `getPost()`
      fb: Post: Implement cache in `getTimelinePosts()`
      fb: cache: Introduce `clearExpiredCaches()`
      fb: web: Create cron.php to clear cache

 src/Facebook/Facebook.php     | 99 ++++++++++++++++++++++++++++++----------
 src/Facebook/Methods/Post.php | 74 ++++++++++--------------------
 web/cron.php                  |  9 ++++
 3 files changed, 108 insertions(+), 74 deletions(-)
 create mode 100644 web/cron.php

base-commit: 0d5e59e00359e165778a81f80122bb522f8edb0f
-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH fb v1 1/6] fb: Introduce `getCache()` and `setCache()` functions
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 2/6] fb: Post: Replace old cache mechanism in `getTimelineYears()` Ammar Faizi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

A preparation patch to implement better caching mechanism. All methods
that need cache will call these functions.

Signed-off-by: Ammar Faizi <[email protected]>
---
 src/Facebook/Facebook.php | 45 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/src/Facebook/Facebook.php b/src/Facebook/Facebook.php
index 2fe33f3b7cb6e9ff..6411c26709c24307 100644
--- a/src/Facebook/Facebook.php
+++ b/src/Facebook/Facebook.php
@@ -287,4 +287,49 @@ class Facebook
 
 		return $url;
 	}
+
+	/**
+	 * @param  string $key
+	 * @param  mixed  $data
+	 * @param  int    $expire
+	 * @return void
+	 */
+	private function setCache(string $key, $data, int $expire = 600): void
+	{
+		$key = str_replace(["/", "\\"], "_", $key);
+		$data = [
+			"exp"  => time() + $expire,
+			"data" => $data
+		];
+		$data = json_encode($data, JSON_INTERNAL_FLAGS);
+		if (!is_dir($this->cache_dir)) {
+			mkdir($this->cache_dir, 0777, true);
+			if (!is_dir($this->cache_dir)) {
+				throw new \Exception("Unable to create cache directory: {$this->cache_dir}");
+			}
+		}
+		file_put_contents("{$this->cache_dir}/{$key}.json", $data);
+	}
+
+	/**
+	 * @param  string $key
+	 * @return mixed
+	 */
+	private function getCache(string $key)
+	{
+		$key = str_replace(["/", "\\"], "_", $key);
+		$file = "{$this->cache_dir}/{$key}.json";
+
+		if (!file_exists($file)) {
+			return NULL;
+		}
+
+		$data = json_decode(file_get_contents($file), true);
+		if (!isset($data["exp"]) || !isset($data["data"])) {
+			unlink($file);
+			return NULL;
+		}
+
+		return $data["data"];
+	}
 }
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH fb v1 2/6] fb: Post: Replace old cache mechanism in `getTimelineYears()`
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 1/6] fb: Introduce `getCache()` and `setCache()` functions Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 3/6] fb: Post: Implement cache in `getPost()` Ammar Faizi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

`getTimelineYears()` always fetches the endpoint online, then it sets
the cache based on the fetch result. On the other hand, the
`getTimelinePosts()` method always tries to read the cache that
`getTimelineYears()` sets.

Now, simplify the caching mechanism and make the `getTimelineYears()`
cache private to itself. This also means that the endpoint
"action=getTimelineYears" will utilize the cache.

Signed-off-by: Ammar Faizi <[email protected]>
---
 src/Facebook/Facebook.php     | 24 ---------------
 src/Facebook/Methods/Post.php | 58 +++++------------------------------
 2 files changed, 8 insertions(+), 74 deletions(-)

diff --git a/src/Facebook/Facebook.php b/src/Facebook/Facebook.php
index 6411c26709c24307..16970714ddce7bdc 100644
--- a/src/Facebook/Facebook.php
+++ b/src/Facebook/Facebook.php
@@ -115,30 +115,6 @@ class Facebook
 		}
 	}
 
-	/**
-	 * @param string $username
-	 * @return string
-	 */
-	public function getUserCacheDir(string $username): string
-	{
-		$ret = $this->cache_dir."/".$username;
-		if (!is_dir($ret)) {
-			if (!mkdir($ret, 0755, true)) {
-				throw new \Exception("Cannot create user cache directory: {$ret}");
-			}
-		}
-
-		if (!is_writable($ret)) {
-			throw new \Exception("User cache directory is not writable: {$ret}");
-		}
-
-		if (!is_readable($ret)) {
-			throw new \Exception("User cache directory is not readable: {$ret}");
-		}
-
-		return $ret;
-	}
-
 	/**
 	 * @param string $user_agent
 	 * @return void
diff --git a/src/Facebook/Methods/Post.php b/src/Facebook/Methods/Post.php
index 988739568dddb9cb..81017c9122e6c341 100644
--- a/src/Facebook/Methods/Post.php
+++ b/src/Facebook/Methods/Post.php
@@ -38,66 +38,28 @@ trait Post
 		return $years;
 	}
 
-	/**
-	 * Cache timeline year links. 
-	 *
-	 * @param  string $username
-	 * @param  array  $years
-	 * @return void
-	 */
-	private function setCacheTimelineYears(string $username, array $years)
-	{
-		$years = json_encode($years, JSON_INTERNAL_FLAGS);
-		$dir = $this->getUserCacheDir($username);
-		file_put_contents("{$dir}/timeline_years.json", $years);
-	}
-
-	/**
-	 * @param  string $username
-	 * @return array|null
-	 */
-	private function getCacheTimelineYears(string $username): ?array
-	{
-		$dir = $this->getUserCacheDir($username);
-		$file = "{$dir}/timeline_years.json";
-
-		if (!file_exists($file)) {
-			return NULL;
-		}
-
-		/*
-		 * Max cache time: 10 minutes.
-		 */
-		if (time() - filemtime($file) > 600) {
-			unlink($file);
-			return NULL;
-		}
-
-		$years = json_decode(file_get_contents($file), true);
-		if (!is_array($years)) {
-			return NULL;
-		}
-
-		return $years;
-	}
-
 	/**
 	 * @param  string $username
 	 * @return array
 	 */
 	public function getTimelineYears(string $username): array
 	{
+		$cacheKey = __METHOD__.$username;
 		$username = trim($username);
 		if ($username === "") {
 			throw new \Exception("Username cannot be empty!");
 		}
 
+		$years = $this->getCache($cacheKey);
+		if (is_array($years))
+			return $years;
+
 		$username = urlencode($username);
 		$o = $this->http("/profile.php?id={$username}", "GET");
 		try {
 			$ret = $this->parseTimelineYears($o["out"]);
 			if (count($ret) > 0) {
-				$this->setCacheTimelineYears($username, $ret);
+				$this->setCache($cacheKey, $ret);
 				return $ret;
 			}
 		} catch (\Exception $e) {
@@ -118,7 +80,7 @@ trait Post
 
 		$ret = $this->parseTimelineYears($o);
 		if (count($ret) > 0) {
-			$this->setCacheTimelineYears($username, $ret);
+			$this->setCache($cacheKey, $ret);
 		}
 
 		return $ret;
@@ -134,11 +96,7 @@ trait Post
 	 */
 	public function getTimelinePosts(string $username, int $year = -1, bool $take_content = false, int $limit = -1): array
 	{
-		$years = $this->getCacheTimelineYears($username);
-		if (!is_array($years)) {
-			$years = $this->getTimelineYears($username);
-		}
-
+		$years = $this->getTimelineYears($username);
 		if ($year === -1) {
 			$year = max(array_keys($years));
 		}
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH fb v1 3/6] fb: Post: Implement cache in `getPost()`
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 1/6] fb: Introduce `getCache()` and `setCache()` functions Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 2/6] fb: Post: Replace old cache mechanism in `getTimelineYears()` Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 4/6] fb: Post: Implement cache in `getTimelinePosts()` Ammar Faizi
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

Make short repeated calls fast.

Signed-off-by: Ammar Faizi <[email protected]>
---
 src/Facebook/Methods/Post.php | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/Facebook/Methods/Post.php b/src/Facebook/Methods/Post.php
index 81017c9122e6c341..3cf5d7e9896e74ce 100644
--- a/src/Facebook/Methods/Post.php
+++ b/src/Facebook/Methods/Post.php
@@ -353,6 +353,13 @@ trait Post
 	 */
 	public function getPost(string $post_id): array
 	{
+		$cacheKey = __METHOD__.$post_id;
+
+		$ret = $this->getCache($cacheKey);
+		if ($ret) {
+			return $ret;
+		}
+
 		/**
 		 * $post_id must be numeric or a string starts with "pfbid".
 		 */
@@ -372,9 +379,11 @@ trait Post
 		$content = $this->parsePostContent($o);
 		$content["embedded_link"] = $this->parseEmbeddedLink($orig);
 
-		return [
+		$ret = [
 			"content" => $content,
 			"info"    => $info
 		];
+		$this->setCache($cacheKey, $ret);
+		return $ret;
 	}
 }
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH fb v1 4/6] fb: Post: Implement cache in `getTimelinePosts()`
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
                   ` (2 preceding siblings ...)
  2023-05-09 10:46 ` [PATCH fb v1 3/6] fb: Post: Implement cache in `getPost()` Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 5/6] fb: cache: Introduce `clearExpiredCaches()` Ammar Faizi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

Make short repeated calls fast.

Signed-off-by: Ammar Faizi <[email protected]>
---
 src/Facebook/Methods/Post.php | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/Facebook/Methods/Post.php b/src/Facebook/Methods/Post.php
index 3cf5d7e9896e74ce..7fe38c5c1b982c72 100644
--- a/src/Facebook/Methods/Post.php
+++ b/src/Facebook/Methods/Post.php
@@ -96,6 +96,12 @@ trait Post
 	 */
 	public function getTimelinePosts(string $username, int $year = -1, bool $take_content = false, int $limit = -1): array
 	{
+		$cacheKey = __METHOD__.$username.$year.($take_content ? 1 : 0).sprintf("%010d", $limit);
+
+		$posts = $this->getCache($cacheKey);
+		if (is_array($posts))
+			return $posts;
+
 		$years = $this->getTimelineYears($username);
 		if ($year === -1) {
 			$year = max(array_keys($years));
@@ -155,6 +161,7 @@ trait Post
 			];
 		}
 
+		$this->setCache($cacheKey, $posts);
 		return $posts;
 	}
 
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH fb v1 5/6] fb: cache: Introduce `clearExpiredCaches()`
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
                   ` (3 preceding siblings ...)
  2023-05-09 10:46 ` [PATCH fb v1 4/6] fb: Post: Implement cache in `getTimelinePosts()` Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 10:46 ` [PATCH fb v1 6/6] fb: web: Create cron.php to clear cache Ammar Faizi
  2023-05-09 11:06 ` [PATCH fb v1 0/6] Introducing cache for the Facebook scraper GNU/Weeb Facebook Team
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

When a cache is expired, it won't be deleted unless getCache() with the
corresponding key is invoked. Introduce a new function to scan for
expired caches and delete them.

Signed-off-by: Ammar Faizi <[email protected]>
---
 src/Facebook/Facebook.php | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/src/Facebook/Facebook.php b/src/Facebook/Facebook.php
index 16970714ddce7bdc..00aed812693248a8 100644
--- a/src/Facebook/Facebook.php
+++ b/src/Facebook/Facebook.php
@@ -308,4 +308,34 @@ class Facebook
 
 		return $data["data"];
 	}
+
+	/**
+	 * @return void
+	 */
+	public function clearExpiredCaches(): void
+	{
+		$scan = scandir($this->cache_dir);
+		foreach ($scan as $file) {
+			$file = "{$this->cache_dir}/{$file}";
+			if (!is_file($file)) {
+				continue;
+			}
+
+			$data = @file_get_contents($file);
+			if (!$data) {
+				unlink($file);
+				continue;
+			}
+
+			$data = @json_decode($data, true);
+			if (!isset($data["exp"]) || !isset($data["data"])) {
+				unlink($file);
+				continue;
+			}
+
+			if ($data["exp"] < time()) {
+				unlink($file);
+			}
+		}
+	}
 }
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH fb v1 6/6] fb: web: Create cron.php to clear cache
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
                   ` (4 preceding siblings ...)
  2023-05-09 10:46 ` [PATCH fb v1 5/6] fb: cache: Introduce `clearExpiredCaches()` Ammar Faizi
@ 2023-05-09 10:46 ` Ammar Faizi
  2023-05-09 11:06 ` [PATCH fb v1 0/6] Introducing cache for the Facebook scraper GNU/Weeb Facebook Team
  6 siblings, 0 replies; 8+ messages in thread
From: Ammar Faizi @ 2023-05-09 10:46 UTC (permalink / raw)
  To: GNU/Weeb FB Team
  Cc: Ammar Faizi, GNU/Weeb Mailing List, Michael William Jonathan

Allow the server to clear expired caches via a small PHP script,
cron.php. Periodically calling clearExpiredCaches() will delete old
expired caches, it saves storage space.

Signed-off-by: Ammar Faizi <[email protected]>
---
 web/cron.php | 9 +++++++++
 1 file changed, 9 insertions(+)
 create mode 100644 web/cron.php

diff --git a/web/cron.php b/web/cron.php
new file mode 100644
index 0000000000000000..bc183bedea9f4062
--- /dev/null
+++ b/web/cron.php
@@ -0,0 +1,9 @@
+<?php
+
+require __DIR__."/../vendor/autoload.php";
+require __DIR__."/auth.php";
+
+use Facebook\Facebook;
+
+$fb = new Facebook($session_dir);
+$fb->clearExpiredCaches();
-- 
Ammar Faizi


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH fb v1 0/6] Introducing cache for the Facebook scraper
  2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
                   ` (5 preceding siblings ...)
  2023-05-09 10:46 ` [PATCH fb v1 6/6] fb: web: Create cron.php to clear cache Ammar Faizi
@ 2023-05-09 11:06 ` GNU/Weeb Facebook Team
  6 siblings, 0 replies; 8+ messages in thread
From: GNU/Weeb Facebook Team @ 2023-05-09 11:06 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: GNU/Weeb Facebook Team, GNU/Weeb Mailing List, Michael William Jonathan

On Tue, 9 May 2023 17:46:52 +0700, Ammar Faizi wrote:

The pull request you sent on Sun, 07 May 2023 18:12:10 +0000:

> https://gitlab.torproject.org/ammarfaizi2/Facebook.git dev.cache

has been merged into ammarfaizi2/Facebook.git:
https://github.com/ammarfaizi2/Facebook/commit/68e95a61956e75ad08ad0bb68f10172fd2883816

Thank you!

[1/6] fb: Introduce `getCache()` and `setCache()` functions
      commit: 88952f396b1b4831eab3b8ed5d71959e42686a88
[2/6] fb: Post: Replace old cache mechanism in `getTimelineYears()`
      commit: 8bc6986c8b802b4b22ba69b86ca0892ff70546e7
[3/6] fb: Post: Implement cache in `getPost()`
      commit: eb5b43a9ac232b4e0d7e973e5feddf1922ca8415
[4/6] fb: Post: Implement cache in `getTimelinePosts()`
      commit: bf957bbe7e23d48360d6d3bb7b96364a25b0148a
[5/6] fb: cache: Introduce `clearExpiredCaches()`
      commit: 38622b9d3c33a44ee6c05cc75bb781a6c6f52cd5
[6/6] fb: web: Create cron.php to clear cache
      commit: cbed859c4a77521dcac840b18d4cf30ef493d747

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-09 11:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09 10:46 [PATCH fb v1 0/6] Introducing cache for the Facebook scraper Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 1/6] fb: Introduce `getCache()` and `setCache()` functions Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 2/6] fb: Post: Replace old cache mechanism in `getTimelineYears()` Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 3/6] fb: Post: Implement cache in `getPost()` Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 4/6] fb: Post: Implement cache in `getTimelinePosts()` Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 5/6] fb: cache: Introduce `clearExpiredCaches()` Ammar Faizi
2023-05-09 10:46 ` [PATCH fb v1 6/6] fb: web: Create cron.php to clear cache Ammar Faizi
2023-05-09 11:06 ` [PATCH fb v1 0/6] Introducing cache for the Facebook scraper GNU/Weeb Facebook Team

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox