Select links from HTML
Function: Select links from HTML
This action helps you automatically find and extract all the web links (URLs) from any HTML content you provide. You can choose to get all links, only links that stay within the same website (internal), or only links that go to other websites (external). This is useful for tasks like analyzing website content, building sitemaps, or gathering data from web pages.
Input,
- HTML (STRING): The full HTML content (as text) from which you want to extract links. This is a required input.
- Link type (SELECT_ONE): Choose the type of links you want to find. This is a required input.
- All: Extracts every link found in the HTML.
- Internal: Extracts only links that point to pages within the same website or domain.
- External: Extracts only links that point to pages on different websites or domains.
- Base domain (URL): Specify the main web address (e.g.,
www.example.com) of the website the HTML content belongs to. This is crucial if you want to filter for "Internal" or "External" links. If you don't provide it, the system will try to guess the base domain from the HTML itself. This input is only required if you select "Internal" or "External" for the "Link type".
Output,
- Result (ARRAY): A list of all the URLs (as text) that were found and matched your selected criteria.
Execution Flow,
Real-Life Examples,
Example 1: Extracting all links from a product description
Imagine you have an HTML snippet of a product description page and you want to quickly see all the links mentioned, regardless of where they lead.
- Inputs:
- HTML:
<html><body><h1>Product A</h1><p>Check out our <a href="/features">features</a> or visit <a href="https://partner.com/promo">our partner</a>.</p></body></html> - Link type:
All - Base domain: (Left blank)
- HTML:
- Result: A list containing:
https://yourwebsite.com/features(assumingyourwebsite.comwas inferred as base)https://partner.com/promo
Example 2: Finding internal navigation links on a company's "About Us" page
You've scraped the HTML content of a company's "About Us" page and want to identify all the links that lead to other pages within the same company website (e.g., "Careers," "Contact Us," "Our Team").
- Inputs:
- HTML: (The full HTML content of
https://www.example.com/about-us) - Link type:
Internal - Base domain:
https://www.example.com
- HTML: (The full HTML content of
- Result: A list containing URLs like:
https://www.example.com/careershttps://www.example.com/contacthttps://www.example.com/team- (Excluding any links to
https://www.facebook.com/exampleorhttps://blog.anotherdomain.com)
Example 3: Identifying external resources linked from a blog post
You have the HTML of a blog post and want to find all the links that point to external websites, such as research papers, news articles, or other blogs, to understand the sources or references used.
- Inputs:
- HTML: (The full HTML content of a blog post from
https://myblog.org/post-title) - Link type:
External - Base domain:
https://myblog.org
- HTML: (The full HTML content of a blog post from
- Result: A list containing URLs like:
https://www.researchgate.net/publication/123https://www.nytimes.com/article-about-topichttps://anotherblog.com/related-post- (Excluding any links to
https://myblog.org/category/techorhttps://myblog.org/author/john-doe)