Scraping GitHub

Scraping GitHub

David Carr

2 min read - 17th Jul, 2015

For an upcoming project I need to be able to dynamically get information about a GitHub repository such as the number of stars, watchers, forks and the repo description and url.

Looking at the API I didn’t see a simple way of doing it so I decided to scrape my repo instead.

Using HTML Dom Parser (http://simplehtmldom.sourceforge.net) the process is simple. First include simple_html_dom.php then setup the url to my repo:

include('simple_html_dom.php');
$html = file_get_html('https://github.com/simple-mvc-framework/framework');

Next I need to get the watchers, stars and forks, each are contained within an a link with a class of social-count, that’s perfect I can use the class to get all links with that class:

$html->find('a.social-count', 0)->innertext;

The number represents the index I could loop through the results using a foreach but I wanted to be specific and add them to an array like this:

$info = array(
    'watching' => trim($html->find('a.social-count', 0)->innertext),
    'starred' => trim($html->find('a.social-count', 1)->innertext),
    'forked' => trim($html->find('a.social-count', 2)->innertext)
);

I’ve wrapped the results around trim to remove any spacing.

That’s the stats taken care of, next is the repo description, that is stored in a div with a class of ‘repository-description’:

$html->find('div.repository-description', 0)->innertext;

Finally the repo url, it’s inside a div with a class of ‘repository-website’:

strip_tags($html->find('div.repository-website', 0)->innertext)

This time I want to remove the a link using strip_tags that will remove all markup.

Putting this all together:

$info = array(
    'watching' => trim($html->find('a.social-count', 0)->innertext),
    'starred' => trim($html->find('a.social-count', 1)->innertext),
    'forked' => trim($html->find('a.social-count', 2)->innertext),
    'desc' => trim($html->find('div.repository-description', 0)->innertext),
    'sitelink' => trim(strip_tags($html->find('div.repository-website', 0)->innertext))
);

Now anytime I want to display one of these I can call the relevent part such as $info[’starred’].

Commits

Now I have the stats it would be nice to display recent commits say the most recent 5.

This time I call a different url. The commits are stored in series of li’s with a class of commit. This time looping through them.

Storing the commit and title is variabled and then using str_replace to make sure the url on the a links are pointing to github.

I only want to so a check is ran once the $i is equal to 5 break the loop.

$i = 0;
$html = file_get_html('https://github.com/simple-mvc-framework/framework/commits/master');
foreach($html->find('li.commit') as $e){

    $comit = $e->find('div.commit-meta', 0)->innertext.'<br>';
    $title = $e->find('p.commit-title', 0)->innertext.'<br>';

    echo '<p>';
    echo str_replace('href="/', 'href="https://github.com/', $title);
    echo str_replace('href="/', 'href="https://github.com/', $comit);
    echo '</p>';

    if ($i == 5) {
        break;
    }
$i++;}

Conclusion

It would have been nice to use an official API but it would have meant multiple calls for the information. Scrapping the information is much quicker and easier in this case.

0 comments
Add a comment

Copyright © 2025 DC Blog - All rights reserved.