Unknown user: /* Tips: */

2025-03-28T23:21:04Z

Tips:

← Older revision		Revision as of 23:21, 28 March 2025
Line 75:		Line 75:
	* Bisa diubah agar simpan ke `.txt` atau `.json` juga.		* Bisa diubah agar simpan ke `.txt` atau `.json` juga.
	* Mau filter halaman yang '''bukan berita'''? Bisa ditambahkan regex atau `if "news" in url`.		* Mau filter halaman yang '''bukan berita'''? Bisa ditambahkan regex atau `if "news" in url`.

			==Pranala Menarik==

			* [[Scrapping]]

Unknown user: Created page with "==FITUR:== * Input dari `keywords.txt` * Cari tiap keyword di Google (ambil `top-N` URL) * Kunjungi tiap URL dan ambil kontennya (judul + paragraf) * Simpan semua ke `scraped_..."

2025-03-28T23:16:47Z

Created page with "==FITUR:== * Input dari `keywords.txt` * Cari tiap keyword di Google (ambil `top-N` URL) * Kunjungi tiap URL dan ambil kontennya (judul + paragraf) * Simpan semua ke `scraped_..."

New page

==FITUR:==
* Input dari `keywords.txt`
* Cari tiap keyword di Google (ambil `top-N` URL)
* Kunjungi tiap URL dan ambil kontennya (judul + paragraf)
* Simpan semua ke `scraped_results.csv`

==Kebutuhan:==

pip install googlesearch-python requests beautifulsoup4

==SCRIPT FULL:==

from googlesearch import search
import requests
from bs4 import BeautifulSoup
import csv
import time

def load_keywords(filename):
with open(filename, 'r', encoding='utf-8') as f:
return [line.strip() for line in f if line.strip()]

def get_page_content(url):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')

# Ambil judul halaman
title = soup.title.string if soup.title else 'No Title'

# Ambil konten paragraf utama
paragraphs = soup.find_all('p')
text_content = ' '.join([p.get_text() for p in paragraphs[:5]]) # Batasi 5 paragraf pertama
return title.strip(), text_content.strip()

except Exception as e:
return 'Error', f"Failed to fetch content: {e}"

def google_scrape_with_content(keywords, num_results=5, output_file='scraped_results.csv'):
with open(output_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Keyword', 'Rank', 'Title', 'URL', 'Content'])

for keyword in keywords:
print(f"\n🔍 Searching for: {keyword}")
try:
results = search(keyword, num_results=num_results)
for i, url in enumerate(results):
print(f" → Fetching: {url}")
title, content = get_page_content(url)
writer.writerow([keyword, i+1, title, url, content])
time.sleep(2) # Delay biar aman
except Exception as e:
print(f"❌ Error while searching '{keyword}': {e}")

print(f"\n✅ All results + content saved to '{output_file}'")

# Main
if __name__ == '__main__':
keywords = load_keywords('keywords.txt')
google_scrape_with_content(keywords, num_results=5)

==Output (`scraped_results.csv`):==

| Keyword | Rank | Title | URL | Content |
|--------|------|-------|-----|---------|
| berita teknologi Indonesia | 1 | Judul dari halaman | https://... | Paragraf-paragraf pertama |
| ... | ... | ... | ... | ... |

==Tips:==
* Jangan pakai `num_results > 10` kalau nggak pakai delay besar.
* Bisa diubah agar simpan ke `.txt` atau `.json` juga.
* Mau filter halaman yang '''bukan berita'''? Bisa ditambahkan regex atau `if "news" in url`.

Scrapping: Save text setiap URL - Revision history

Unknown user: /* Tips: */

Unknown user: Created page with "==FITUR:== * Input dari `keywords.txt` * Cari tiap keyword di Google (ambil `top-N` URL) * Kunjungi tiap URL dan ambil kontennya (judul + paragraf) * Simpan semua ke `scraped_..."