PHP + CSS + JS 数据采集与展示系统
下面我将为你实现一个完整的系统,包含前端单页展示、后端数据采集,并按照要求分离文件结构,支持伪静态。
文件结构
arduino
project/
├── assets/
│ ├── css/
│ │ └── style.css
│ └── js/
│ └── script.js
├── includes/
│ ├── config.php
│ ├── scraper.php
│ └── functions.php
├── index.php
└── .htaccess (用于伪静态)
1. 配置文件 (includes/config.php)
php
<?php
// 数据库配置
define('DB_HOST', 'localhost');
define('DB_USER', 'root');
define('DB_PASS', 'password');
define('DB_NAME', 'scraping_db');
// 采集目标配置
define('TARGET_SITES', [
'example' => [
'url' => 'https://example.com/news',
'selectors' => [
'title' => 'h1.news-title',
'content' => 'div.news-content',
'date' => 'span.news-date'
]
],
'another_site' => [
'url' => 'https://another-site.com/articles',
'selectors' => [
'title' => 'article h2',
'summary' => 'article p.summary',
'author' => 'div.author-name'
]
]
]);
// 其他配置
define('ITEMS_PER_PAGE', 10);
2. 数据库函数 (includes/functions.php)
php
<?php
require_once 'config.php';
class DB {
private static $instance = null;
private $connection;
private function __construct() {
$this->connection = new mysqli(DB_HOST, DB_USER, DB_PASS, DB_NAME);
if ($this->connection->connect_error) {
die("Connection failed: " . $this->connection->connect_error);
}
$this->connection->set_charset("utf8mb4");
}
public static function getInstance() {
if (!self::$instance) {
self::$instance = new DB();
}
return self::$instance;
}
public function getConnection() {
return $this->connection;
}
public function query($sql, $params = []) {
$stmt = $this->connection->prepare($sql);
if (!empty($params)) {
$types = str_repeat('s', count($params));
$stmt->bind_param($types, ...$params);
}
$stmt->execute();
return $stmt;
}
public function fetchAll($sql, $params = []) {
$stmt = $this->query($sql, $params);
$result = $stmt->get_result();
return $result->fetch_all(MYSQLI_ASSOC);
}
public function fetchOne($sql, $params = []) {
$stmt = $this->query($sql, $params);
$result = $stmt->get_result();
return $result->fetch_assoc();
}
}
// 初始化数据库表
function initDatabase() {
$db = DB::getInstance()->getConnection();
$sql = "CREATE TABLE IF NOT EXISTS scraped_data (
id INT AUTO_INCREMENT PRIMARY KEY,
source_site VARCHAR(50) NOT NULL,
title VARCHAR(255) NOT NULL,
content TEXT,
custom_fields JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)";
$db->query($sql);
}
// 获取分页数据
function getPaginatedData($page = 1, $perPage = ITEMS_PER_PAGE) {
$offset = ($page - 1) * $perPage;
$sql = "SELECT * FROM scraped_data ORDER BY created_at DESC LIMIT ?, ?";
return DB::getInstance()->fetchAll($sql, [$offset, $perPage]);
}
// 获取总记录数
function getTotalRecords() {
$sql = "SELECT COUNT(*) as total FROM scraped_data";
$result = DB::getInstance()->fetchOne($sql);
return $result['total'];
}
3. 数据采集器 (includes/scraper.php)
php
<?php
require_once 'functions.php';
class Scraper {
private $siteConfig;
public function __construct($siteKey) {
$this->siteConfig = TARGET_SITES[$siteKey] ?? null;
if (!$this->siteConfig) {
throw new Exception("Invalid site key");
}
}
public function scrape() {
$html = $this->fetchContent($this->siteConfig['url']);
$data = $this->parseContent($html);
$this->saveData($data);
return $data;
}
private function fetchContent($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
$html = curl_exec($ch);
if (curl_errno($ch)) {
throw new Exception('Curl error: ' . curl_error($ch));
}
curl_close($ch);
return $html;
}
private function parseContent($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$data = [];
foreach ($this->siteConfig['selectors'] as $key => $selector) {
$nodes = $xpath->query("//" . str_replace(' ', '//', $selector));
$values = [];
foreach ($nodes as $node) {
$values[] = trim($node->nodeValue);
}
$data[$key] = $values;
}
// 将数据转换为记录格式
$records = [];
$maxItems = max(array_map('count', $data));
for ($i = 0; $i < $maxItems; $i++) {
$record = [];
foreach ($data as $key => $values) {
$record[$key] = $values[$i] ?? '';
}
$records[] = $record;
}
return $records;
}
private function saveData($records) {
$db = DB::getInstance();
$sourceSite = array_search($this->siteConfig, TARGET_SITES);
foreach ($records as $record) {
// 检查是否已存在相同标题的记录
$existing = $db->fetchOne(
"SELECT id FROM scraped_data WHERE title = ? AND source_site = ?",
[$record['title'], $sourceSite]
);
if (!$existing) {
$customFields = array_diff_key($record, array_flip(['title', 'content']));
$db->query(
"INSERT INTO scraped_data (source_site, title, content, custom_fields) VALUES (?, ?, ?, ?)",
[
$sourceSite,
$record['title'],
$record['content'] ?? '',
json_encode($customFields, JSON_UNESCAPED_UNICODE)
]
);
}
}
}
}
// 定时任务调用示例
function runScrapers() {
foreach (array_keys(TARGET_SITES) as $siteKey) {
try {
$scraper = new Scraper($siteKey);
$scraper->scrape();
echo "Scraped data from $siteKey successfully.\n";
} catch (Exception $e) {
echo "Error scraping $siteKey: " . $e->getMessage() . "\n";
}
}
}
4. 前端页面 (index.php)
php
<?php
require_once 'includes/functions.php';
initDatabase();
$page = isset($_GET['page']) ? max(1, intval($_GET['page'])) : 1;
$data = getPaginatedData($page);
$totalRecords = getTotalRecords();
$totalPages = ceil($totalRecords / ITEMS_PER_PAGE);
?>
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>数据采集展示系统</title>
<link rel="stylesheet" href="assets/css/style.css">
</head>
<body>
<div class="container">
<header>
<h1>采集数据展示</h1>
<div class="stats">
共 <?php echo $totalRecords; ?> 条记录,第 <?php echo $page; ?> 页/共 <?php echo $totalPages; ?> 页
</div>
</header>
<div class="data-list" id="dataContainer">
<?php foreach ($data as $item): ?>
<div class="data-item">
<h3><?php echo htmlspecialchars($item['title']); ?></h3>
<div class="meta">
<span class="source">来源: <?php echo htmlspecialchars($item['source_site']); ?></span>
<span class="date"><?php echo $item['created_at']; ?></span>
</div>
<div class="content">
<?php echo nl2br(htmlspecialchars($item['content'])); ?>
</div>
<?php
$customFields = json_decode($item['custom_fields'], true);
if ($customFields && is_array($customFields)): ?>
<div class="custom-fields">
<?php foreach ($customFields as $key => $value): ?>
<div class="field">
<strong><?php echo htmlspecialchars($key); ?>:</strong>
<span><?php echo htmlspecialchars($value); ?></span>
</div>
<?php endforeach; ?>
</div>
<?php endif; ?>
</div>
<?php endforeach; ?>
</div>
<div class="pagination">
<?php if ($page > 1): ?>
<a href="?page=<?php echo $page - 1; ?>" class="prev">上一页</a>
<?php endif; ?>
<?php
$startPage = max(1, $page - 2);
$endPage = min($totalPages, $page + 2);
for ($i = $startPage; $i <= $endPage; $i++): ?>
<a href="?page=<?php echo $i; ?>" class="<?php echo $i == $page ? 'active' : ''; ?>">
<?php echo $i; ?>
</a>
<?php endfor; ?>
<?php if ($page < $totalPages): ?>
<a href="?page=<?php echo $page + 1; ?>" class="next">下一页</a>
<?php endif; ?>
</div>
</div>
<script src="assets/js/script.js"></script>
</body>
</html>
5. CSS 样式 (assets/css/style.css)
css
/* 基础样式 */
body {
font-family: 'Arial', sans-serif;
line-height: 1.6;
color: #333;
background-color: #f5f5f5;
margin: 0;
padding: 0;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
}
header {
background-color: #2c3e50;
color: white;
padding: 20px;
border-radius: 5px;
margin-bottom: 30px;
}
header h1 {
margin: 0;
font-size: 28px;
}
.stats {
font-size: 14px;
opacity: 0.8;
margin-top: 10px;
}
/* 数据列表样式 */
.data-list {
display: grid;
grid-gap: 20px;
}
.data-item {
background-color: white;
border-radius: 5px;
padding: 20px;
box-shadow: 0 2px 5px rgba(0,0,0,0.1);
transition: transform 0.3s ease, box-shadow 0.3s ease;
}
.data-item:hover {
transform: translateY(-3px);
box-shadow: 0 5px 15px rgba(0,0,0,0.1);
}
.data-item h3 {
margin-top: 0;
color: #2c3e50;
border-bottom: 1px solid #eee;
padding-bottom: 10px;
}
.meta {
font-size: 14px;
color: #7f8c8d;
margin-bottom: 15px;
}
.meta .source {
margin-right: 15px;
}
.content {
margin-bottom: 15px;
}
.custom-fields {
background-color: #f9f9f9;
padding: 10px;
border-radius: 3px;
font-size: 14px;
}
.field {
margin-bottom: 5px;
}
.field strong {
display: inline-block;
min-width: 80px;
color: #7f8c8d;
}
/* 分页样式 */
.pagination {
display: flex;
justify-content: center;
margin-top: 30px;
flex-wrap: wrap;
}
.pagination a {
display: inline-block;
padding: 8px 16px;
margin: 0 5px;
background-color: white;
border: 1px solid #ddd;
border-radius: 3px;
text-decoration: none;
color: #2c3e50;
transition: all 0.3s ease;
}
.pagination a:hover {
background-color: #f1f1f1;
}
.pagination a.active {
background-color: #2c3e50;
color: white;
border-color: #2c3e50;
}
/* 响应式设计 */
@media (max-width: 768px) {
.container {
padding: 10px;
}
header {
padding: 15px;
}
.data-item {
padding: 15px;
}
.pagination a {
padding: 6px 12px;
margin: 3px;
}
}
6. JavaScript (assets/js/script.js)
javascript
document.addEventListener('DOMContentLoaded', function() {
// 可以添加交互功能,比如AJAX加载、动画等
// 示例:平滑滚动到顶部
const scrollToTop = document.createElement('div');
scrollToTop.className = 'scroll-to-top';
scrollToTop.innerHTML = '↑';
scrollToTop.style.display = 'none';
scrollToTop.style.position = 'fixed';
scrollToTop.style.bottom = '20px';
scrollToTop.style.right = '20px';
scrollToTop.style.backgroundColor = '#2c3e50';
scrollToTop.style.color = 'white';
scrollToTop.style.width = '40px';
scrollToTop.style.height = '40px';
scrollToTop.style.borderRadius = '50%';
scrollToTop.style.textAlign = 'center';
scrollToTop.style.lineHeight = '40px';
scrollToTop.style.cursor = 'pointer';
scrollToTop.style.boxShadow = '0 2px 5px rgba(0,0,0,0.2)';
scrollToTop.style.zIndex = '1000';
document.body.appendChild(scrollToTop);
scrollToTop.addEventListener('click', function() {
window.scrollTo({
top: 0,
behavior: 'smooth'
});
});
window.addEventListener('scroll', function() {
if (window.pageYOffset > 300) {
scrollToTop.style.display = 'block';
} else {
scrollToTop.style.display = 'none';
}
});
// 可以添加更多交互功能...
console.log('系统已加载完毕');
});
7. 伪静态配置 (.htaccess)
ini
<IfModule mod_rewrite.c>
RewriteEngine On
# 确保请求的不是真实文件或目录
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
# 将URL重写到index.php
RewriteRule ^page/([0-9]+)/?$ index.php?page=$1 [L,QSA]
RewriteRule ^source/([a-zA-Z0-9_-]+)/?$ index.php?source=$1 [L,QSA]
RewriteRule ^source/([a-zA-Z0-9_-]+)/page/([0-9]+)/?$ index.php?source=$1&page=$2 [L,QSA]
# 其他重写规则...
</IfModule>
使用说明
-
初始化数据库:
- 创建数据库并导入SQL结构
- 修改config.php中的数据库配置
-
配置采集站点:
- 在config.php的TARGET_SITES数组中添加要采集的站点和选择器
-
定时采集:
- 可以设置cron job定时执行采集脚本
- 示例cron命令:
php /path/to/project/includes/scraper.php
-
伪静态URL:
- 启用后可以使用类似
/page/2
的URL代替?page=2
- 确保服务器已启用mod_rewrite
- 启用后可以使用类似
-
前端访问:
- 访问index.php查看采集的数据
扩展建议
- 添加用户认证以保护采集功能
- 实现更复杂的数据清洗和处理
- 添加缓存机制提高性能
- 实现多线程采集提高效率
- 添加日志记录采集过程
这个系统实现了前后端分离,支持伪静态,并提供了完整的数据采集和展示功能。你可以根据需要进一步扩展和完善它。 更多文章详情:baijiahao.baidu.com/s?id=183050...