Set up Continuous Ingest for RSS Feeds
Set Up GraphGrid Ingest to Transform RSS Feed Data into a Knowledge Graph
GraphGrid Ingest is capable of gathering data from several RSS feeds and transporting it to a graph database. This means that any disconnected data is ingested into a connected state, forming nodes, properties, and relationships. We can use GraphGrid Ingest to transform xml data into a graph format. To do this, we'll set up an ingest policy that takes in an RSS feed to transform that data into article nodes.
Setting up the Feed Policy
The first steps of setup creating an RSS Feed policy with data from our RSS Feed source. For our example we'll be using Cinema Blend, an RSS feed source with the latest news, updates, and reviews of new movies and tv shows.
For our policy, we'll need relevant links to our RSS feed(s) and some CSS selector values to get our article content. In addition, we'll need to configure our feed groups within our policy.
Feed Groups
Feed groups are a helpful way to organize our RSS feeds if we decide to add more. Let's name our first feed group movies
. Any RSS feed that is relevant to movies
will go in this group.
Links
Let's get some movie reviews into the graph! The link
value will be the RSS link, https://www.cinemablend.com/rss/topic/reviews/movies
, and the baseLink
value will be the Cinema Blend homepage https://www.cinemablend.com/rss-index.html
so that we can keep track while using multiple RSS feeds from the same website.
We can set GraphGrid Ingest to use the general default refresh time of 120 seconds to pull in more data as it's uploaded to the site. The Geequel we'll pass in
will take the data from the RSS feed and transform the content into article nodes. The Geequel will take in relevant properties such as the review title, source,
author, and feedname.
CSS Selectors
The defaultSelector
value will be the CSS selector that houses the article's content. This is to ensure we don't get unwanted data like ads and comments. To find
the value of the defaultSelector
, open the main webpage for the feed you want to ingest.
To get movie review articles into our graph, go to Cinema Blend's movie review section of the site. Then click on a review, and use Chrome developer tools to find
the css selector that we'll need for the article content.
To find the appropriate CSS selector, open the Chrome Developer Tools (Command+Option+I (mac), Control+Shift+I (Windows, Linux, Chrome OS)). Click the element selector button:
Next select the element that houses the article content. Hover over the elements on the webpage to find the element that houses the main content. In our case we're looking for the selector that houses the article content.
Cinema Blend houses their main article content in the #body.story-content
selector. The selector for articles that we'll use for our
policy is .story-content p
. The p
is included because we want to ensure that it is only grabbing content inside the <p></p>
or
paragraph tags. This way we don't get ads or images. We will only ingest the words of the articles.
Save RSS Ingest Policy
With our feed groups configured with our relevant links and CSS selector, we can save our completed policy.
curl --location --request POST "${API_BASE}/1.0/ingest/default/saveFeedPolicy/cinema-blend" \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${BEARER_TOKEN}" \
--data-raw '{
"metadata": {
"displayName": "cinema-feed-policy",
"clusterName": "ongdb",
"createdAt": "2019-07-10T18:54:15+00:00",
"updatedAt": "2019-07-11T13:53:11+00:00"
},
"feedGroups": {
"movies": {
"feeds": {
"reviews": {
"link": "https://www.cinemablend.com/rss/topic/reviews/movies",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
}
},
"defaultSelector": ".story-content",
"baseLink": "https://www.cinemablend.com/rss-index.html"
}
}
}'
Load RSS Ingest Policy
Next we'll load in the feed policy, being sure to pass in the name of the policy we just saved using this request:
curl --location --request GET "${API_BASE}/1.0/ingest/default/loadFeedPolicy/cinema-blend" \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${BEARER_TOKEN}"
Start RSS Ingest
Now that our policy is loaded in we can start the Ingest process! We'll make a request to start ingest:
This request will need to be called each time the service starts up for the Ingest process to begin.
curl --location --request GET "${API_BASE}/1.0/ingest/default/startRSSIngest" \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${BEARER_TOKEN}"
Head over to ONgDB and check on the progress. Run MATCH (a:Article) RETURN a
and you should see new nodes that have been created from the RSS Feed data. On
average there are about 50 articles that will be created by the first start ingest request using Cinema Blend.
Stop RSS Ingest
To stop the Ingest process make the stop ingest request:
curl --location --request GET "${API_BASE}/1.0/ingest/default/stopRSSIngest" \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${BEARER_TOKEN}"
RSS Feed Policy With Multiple Feed Groups
This is an example of a fuller RSS Ingest policy that configures several different RSS feed links and catgories. This is how a policy with several feed groups is organized and configured:
{
"metadata": {
"displayName": "cinema-feed-policy",
"clusterName": "ongdb",
"createdAt": "2019-07-10T18:54:15+00:00",
"updatedAt": "2019-07-11T13:53:11+00:00"
},
"feedGroups": {
"movies": {
"feeds": {
"reviews": {
"link": "https://www.cinemablend.com/rss/topic/reviews/movies",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
},
"news": {
"link": "https://www.cinemablend.com/rss/topic/news/movies",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
}
},
"defaultSelector": ".story-content p",
"baseLink": "https://www.cinemablend.com/rss-index.html"
},
"popculture": {
"feeds": {
"popnews": {
"link": "https://www.cinemablend.com/rss/topic/news/pop",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
}
},
"defaultSelector": ".story-content p",
"baseLink": "https://www.cinemablend.com/rss-index.html"
},
"tv": {
"feeds": {
"popnews": {
"link": "https://www.cinemablend.com/rss/topic/news/television",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
}
},
"defaultSelector": ".story-content p",
"baseLink": "https://www.cinemablend.com/rss-index.html"
},
"games": {
"feeds": {
"gamenews": {
"link": "https://www.cinemablend.com/rss/topic/news/games",
"refresh": "120",
"cypher": "MERGE (n:Article {link: {link}}) ON CREATE SET n.title = {title}, n.pubDate = {pubDate}, n.updatedDate = {updatedDate}, n.uri = {uri}, n.description = {description}, n.author = {author}, n.source = {feedGroupName}, n.feedName = {feedName}, n.content = {articleContent}",
"overrideSelector": null
}
},
"defaultSelector": ".story-content p",
"baseLink": "https://www.cinemablend.com/rss-index.html"
}
}
}
After saving the above in the request body of the save policy request, you can check that your feed groups are configured correctly by running the feed policy report:
curl --location --request GET "${API_BASE}/1.0/ingest/default/getFeedPolicyReport" \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${BEARER_TOKEN}"
If you used the exact request body as above then your response should look like this:
{
"Feed Groups": [
"movies",
"popculture",
"tv",
"games"
],
"policyName": "cinema-blend",
"clusterName": "ongdb",
"Num Feed Groups": 4
}
Next you can start the RSS Ingest process and watch your newly transformed data become a connected and interactive knowledge graph!