Use Python to read HTML element by XPath
Ubuntu 16.04.4 LTS
- Use Python 3 to read an HTML element attribute:
data-endpoint
by XPath and lxmlHTML file
...<div data-listing="article" data-endpoint="https://www.sample.com/article-list.json" ... - Use the
data-endpoint
to fetch and parse json data of article listJSON of an article list 1
2
3
4
5
6
7
8
9
10
11
12
13[
{
chucks:
[
{
title: "Test Title",
url: "https://www.sample.com/article1.html"
},
{...},
{...}
]
}
] - Install lxml
1
sudo apt-get install python3-lxml
- Python script
test.py 1
2
3
4
5
6
7
8
9
10
11
12
13from lxml import html
import requests
urls = ["https://www.sample.com/sample.html"]
page = requests.get(url)
content = html.fromstring(page.content)
endpoints = content.xpath('//div[@data-listing="article"]/@data-endpoint')
for endpoint in endpoints:
r = requests.get(endpoint)
data = r.json()[0]
for article in data['chunks']:
print (url, "\t", article['url'], "\t", article['title']) - Run the script
1
python3 test.py