The Wonderland Board

Posted: **Tue Jul 31, 2012 10:44 pm**

VirtLands wrote:congrats. you're such a brain with that code, (which I don't even know what is displayed there),
it's some kind of Python i guess?

I can see bits and pieces of webpages being slowly
digested by that spider thing you created.

(The only programming I know is with Blitz. %@?_)

I suppose in the future, you can ultimately make some (offline) search
program for us, where when you input an author or level name, that it
returns all kinds of stuff, I suppose.

I'm optimistic. I think you'll finish this in about a week.
Keep up the good work.

tyteen4a03 wrote:The spider is working...

The big bit displayed was the topics, with the format "topicid": (topicname, topicStarterUID). It will be passed on to other parts to crawl the posts inside each topic.

The offline search part is actually going to be in the new website - I guess you can call that offline search

(but the most accurate term is off-site search)

Posted: **Tue Jul 31, 2012 11:02 pm**

Maybe you can make an offline

version too, -one that you can give us to download...

Posted: **Tue Jul 31, 2012 11:17 pm**

VirtLands wrote:
tyteen4a03 wrote:The offline search part is actually going to be in the new website - I guess you can call that offline search (but the most accurate term is off-site search)
Maybe you can make an offline version too, -one that you can give us to
download; its info will be finite, (dated up to a point).
All offline programs work faster than online.

Now that you've got the bot code off to a good start, I was wondering...
How will you prevent duplication of search efforts,
..what I mean is, how will you keep from searching the same links over and over?

For example, Link A links to several (perhaps 10 others), and those links will
eventually link back to Link A, in some strange round loop.

(Well, I'm going to lala land, will be back in a while... )

It won't, because the database backend for this bot has a unique key feature - it prevents the same post/topic/profile from being archived over and over again. The bot also does not follow links - it simply grabs information it needs then move on.

An offline version will depend on if I can spare time to work on a GUI application (which has been a bit of a failure to me). Development of this will also depend if MS wants to use this website to handle future level submissions - if that's the case, another application wouldn't be necessary. If the community really do want this application, it will start after the website's development finishes.

Posted: **Fri Aug 03, 2012 6:54 pm**

Another status update...

I am off to grabbing posts from the forum, and it required me to do some special configuration... anybody who can guess what I did gets a (virtual) cookie.

Posted: **Fri Aug 03, 2012 7:08 pm**

Well, you changed the date to the post number(?).

Posted: **Fri Aug 03, 2012 7:25 pm**

StinkerSquad01 wrote:Well, you changed the date to the post number(?).

It's not the post number...

(hint: Compare my picture to what you see on the main page)

Posted: **Fri Aug 03, 2012 7:33 pm**

tyteen4a03 wrote:They are Unix Timestamps. They represent seconds since the Unix Epoch (1st January 1970 00:00:00 GMT) and is a very convenient time format since you can turn them into any time format displayable.

Thanx.

Posted: **Fri Aug 03, 2012 7:51 pm**

I was thinking that as well..

Posted: **Sat Aug 04, 2012 8:23 am**

VirtLands wrote:

My wildest guess is that you've conveniently converted the date-time into
a special numeric format for easier sorting.

Does anyone see a pattern here? (I can't).

Fri Aug 03, 2012 11:33 am -- 1343947903 -- VirtLands
Fri Jul 13, 2012 8:17 am -- 1342196228
Sat Jul 14, 2012 5:15 pm -- 1342314949
Fri Aug 03, 2012 6:51 am -- 1344005460
Thu Aug 02, 2012 2:30 pm -- 1343946627
Thu Jul 19, 2012 5:25 pm -- 1342747557 -- hex:5008B3A5

I'll get back to you on this.

They are Unix Timestamps. They represent seconds since the Unix Epoch (1st January 1970 00:00:00 GMT) and is a very convenient time format since you can turn them into any time format displayable.

(and yes, they can also be used for sorting)

Posted: **Sun Aug 05, 2012 4:46 am**

Aha. I was so close.

Posted: **Sun Aug 05, 2012 4:57 am**

VirtLands wrote:Aha. I was so close.

( I was hunting for date-time formats, and it never occurred to me that it's Unix. )

Out of curiousity, I may compare the Unix to other formats to see 'complexity' vs 'convenience'.

Let us know of your eventual progress.
Here's some coffee to keep you going.

Thanks! I am having insomnias lately I can't focus on anything I do.

Here's some code snippets to show you the current progress. It might not be the tidiest, but it at least works.

This code snippet requires Python OOP knowledge, and Scrapy and Beautiful Soup library knowledge.

Code: Select all

    def parseTopics(self, response):
        soup = BeautifulSoup(response.body)
        # Find topic information
        topics = []
        for link, profileBit in zip(soup.find_all("a", attrs={"class": "topictitle"}),
            soup.find_all("span", attrs={"class": "name"})):
            if link["href"].split("=")[1] in self.topics_to_ignore: # Make sure Announcements and Stickies are not scanned twice while making sure they do get scanned at least once
                continue
            aTopic = Topic()
            aTopic["forumID"] = response.meta["forumid"]
            aTopic["topicID"] = link["href"].split("=")[1]
            aTopic["topicName"] = link.string
            aTopic["posterID"] = profileBit.a["href"].split("=")[2]
            topics.append(aTopic)
            if link.previous_sibling().string in ["Announcement:", "Sticky:"]: # We've scanned this topic before, let's skip it in the future
                self.topics_to_ignore.append(link["href"].split("=")[1])
        # Figure out if there's tomorrow
        hasMultiplePages = soup.find_all("td", align="right", valign="bottom", nowrap="nowrap")
        if hasMultiplePages:
            hasNextPage = hasMultiplePages[0].find_all("a", text="Next")
            if hasNextPage:
                yield Request((self.root_domain + "/" + hasNextPage[0]["href"]), callback=self.parseTopics, meta={"forumid": response.meta["forumid"]})
        for t in topics:
            if t["topicID"] not in ignoreList:
                yield Request((self.root_domain + "/" + "viewtopic.php?t=" + t["topicID"]), callback=self.parsePosts, meta={"topicID": t["topicID"]})

Posted: **Sun Aug 05, 2012 5:35 am**

Posted: **Sat Aug 11, 2012 6:30 pm**

VirtLands wrote:Hmmm, this post hasn't been updated in a while. Could be it's turning into a cobweb site.

Yes, haven't got time to work on it for a while.

Here's a very important bit of the spider - post scraping. This 70-lined code scraps post information and attachments, while initiating scrap of post content (will explain later why it's a separate process), user profile, and (of course) the scraping of next page.

It's 2:30 AM here now, I need to go to sleep.

Code: Select all

    def parsePosts(self, response):
        """
        Parse post content.
        """
        soup = BeautifulSoup(response.body)
        def aName(tag):
            return tag.name == "a" and isinstance(tag["name"], int)
        def aHref(tag):
            return tag.name == "a" and tag["href"].startswith("profile.php?mode=viewprofile&u=")
        def spanClass(tag):
            return tag.name == "span" and tag["class"] == "postdetails" and tag.string.startswith("Posted: ")
        def spanClassPostBody(tag):
            return tag.name == "span" and \
                   tag["class"] == "postbody" and not \
                   tag.string.startswith("<br />_________________<br />")
        def attachURL(tag):
            return tag.name == "a" and tag["href"].startswith("download.php?id=")

        def determineContentFetchMode(postid):
            if soup.find("a", href="posting.php?mode=editpost&p=" + postid):
                return "edit"
            elif soup.find("a", href="posting.php?mode=quote&p=" + postid):
                return "quote"
            else:
                return "raw"

        # Find posts information
        posts = []
        attachments = []
        for (pid, userid, username, posttime, content) in zip(
            soup.find_all(aName), # Post ID
            soup.find_all(aHref), # User ID
            soup.find_all("span", attrs={"class": "name"}).b.string,
            soup.find_all(spanClass),
            soup.find_all(spanClassPostBody), # Post body
        ):
            aPost = Post()
            aPost["postID"] = pid
            aPost["topicID"] = response.meta["topicID"]
            aPost["posterID"] = userid.strip("profile.php?mode=viewprofile&u=")
            aPost["postTime"] = posttime.strip("Posted: ")[0:9] # Timestamps are always 10 digits
            attachTable = content.find_next_sibling("table", attrs={"class": "attachtable"})
            # Attachment?
            if attachTable:
                anAttachment = Attachment()
                anAttachment["postID"] = pid
                anAttachment["originalFilename"] = attachTable.find("span", attrs={"class": "gen"}) # The original name
                anAttachment["displayFilename"] = attachTable.find(attachURL)
            # Initiate Post content scraping
            mode = determineContentFetchMode(pid)
            if mode != "raw":
                yield Request((self.root_domain + "/" + "posting.php?mode=" +
                               ("editpost" if p["content"][0] == "edit" else "quote") +
                               "&p=" + p["postID"]),
                    callback=self.parsePostContent,
                    meta={"postID": pid, "content": None, "mode": mode})
            # Initiate User scraping
            if username not in self.users_scanned:
                yield Request((self.root_domain + "/" + userid),
                    callback=self.parseUser)
            posts.append(aPost)
        yield posts
        # Figure out if there's tomorrow
        hasMultiplePages = soup.find("td", align="left", valign="bottom", colspan=2)
        if hasMultiplePages:
            hasNextPage = hasMultiplePages.find("a", text="Next")
            if hasNextPage:
                yield Request((self.root_domain + "/" + hasNextPage["href"]),
                callback=self.parseTopics,
                meta={"topicID": response.meta["topicID"]})

Posted: **Sat Aug 11, 2012 9:30 pm**

Good work:

Posted: **Sat Aug 11, 2012 11:26 pm**

Posted: **Sun Aug 12, 2012 3:55 am**

I don't want to work with regex (they are a pain in the butt), and all those customizations just hurt my brain.

And the code I'm writing is open-source (well, will be soon, haven't got time to upload it to GitHub yet)

For now, the best help would be to grab me coffee.

Posted: **Mon Aug 13, 2012 2:55 am**

I had an idea that if you ...

finish this project and wish to share your hard earned data with us
then you can upload it to the following.

Create an account on Zoolz: http://goo.gl/6D4uT
or

SkyDrive: https://skydrive.live.com/

Posted: **Mon Aug 13, 2012 4:51 am**

Uploaded zipped form 240 MB (incomplete, but substantial.)
Zoolz link: http://zlz.me/yq5i9

Posted: **Mon Aug 13, 2012 12:24 pm**

VirtLands wrote:Surprisingly it only amounts to 240 MB, yet it states "complete".
{ 8878 files in 1058 folders }, maybe it's just a very, very compact format.

I think I found out why this is. I just tested out your download, and for example, for the old off-topic it says "Goto page 1, 2, 3 ... 41, 42, 43". Pages 4-40 have not been archived. It appears that all the "in-between-pages" for all the subforums are not downloaded at all.

Posted: **Mon Aug 13, 2012 1:11 pm**

search.php does not work because it's a dynamic page.

I also have my own storage space - I don't use cloud file services.

I want to also clarify why making my own spider is better than archiving all pages then mine data out of it - it saves disk space. Because pages are scanned and mined on-the-fly, almost no disk space is needed to store unnecessary HTML files (which is a lot of overhead)

Posted: **Mon Aug 13, 2012 7:29 pm**

I forgot to mention about (c),

(a) It gets stuck when it wanders onto http://www.midnightsynergy.com/
(b) The search function ( http://www.pcpuzzle.com/forum/search.php ) doesn't work.
(c) Contains no attachments, and therefore no levels & customMedia
______________________________________________________

Thanks to JDL for his studying of the download.
I thought there was something fishy about it only being 240 MB.

Posted: **Mon Aug 13, 2012 11:36 pm**

Yes, I blacklisted the login page.

The spider mines topic list, post list, user profile and attachments of specific forums.

Posted: **Thu Aug 16, 2012 2:20 am**

I see.

-----------------------------------------------------------------------

Posted: **Thu Aug 16, 2012 3:02 am**

VirtLands wrote:I see.

Well, I tried the demo of Offline Explorer Enterprise, and tried so
many ways to set up "URL omissions" so that it won't log me out
of
http://www.pcpuzzle.com/forum/

But, I could never get it to download attachments,
can only get it to download the regular html stuff, (+images).

I basically gave up on the attachments option.

Looks like we'll have to get someone to download all
the attachments for us. Any volunteers?
-----------------------------------------------------------------------
So, how much progress have you made with your data mining ?

You need to login in order to download attachments.

Today's the last day that I'm not free. Work will start tomorrow.

Posted: **Thu Aug 16, 2012 8:58 pm**

Thanks. Though my efforts to login did not result in an "apparent log-in",
I've resined myslef to the fact that I shall never download attachments. Some day you'll send us the life-line, I'll gladly wait.

[ Don't worry about the sharks,
they've been around doing some house-cleaning, eating up unnecessary posts. ]

[ cover art to Bobb Trimble's 1983 recording with The Crippled Dog Band, released July 26th 2011 on Yoga Records ]

Posted: **Fri Aug 17, 2012 5:53 pm**

[ Time to start worrying about the sharks, they're coming after you.. ]

I changed my mind about Web Boomerang. It's awful.

Posted: **Fri Aug 17, 2012 6:25 pm**

That site still exists? o.o

Anyways, work has resumed and I expect another update very soon. Hopefully I will be able to put the spider to work for the first time.

Posted: **Fri Aug 17, 2012 7:40 pm**

I'm temporarily back on the 'rack. (HTTrack)

I just learned what a SID is:
Whenever one logs onto a forum (such as this), we are provided
with a SID.

For example, my SID is :

sid=786b9ecdf00278##################

AHa!. Did you really think I'd tell you my SID? (I covered it with #'s).

So, folks, never give out your SID.

Definition and examples of SID:

http://kb.iu.edu/data/aotl.html

A SID contains:

User and group security descriptors
48-bit ID authority
Revision level
Variable sub-authority values
____________________________________________

Posted: **Sat Aug 18, 2012 4:45 am**

SID means PHP Session ID. Everytime you visit another page it refreshes in the database. There is absolutely no harm giving out your PHP Session ID out, as hackers can't really do anything about it.

(But if an exploit has been found in a software this will not be the case - won't explain here)

Posted: **Sat Aug 18, 2012 5:06 am**

VirtLands wrote:The following attachment is a sorted list of member names, ID's, emails,
compiled from 241 member webpages. Click on the attached download.

Enjoy.

Ok um... I'm not ok with my ID or Email being given out on that mirror website. Please, when its up. Display my name ONLY.

The Wonderland Board

The New Wonderland Archive

Re: Tyteen's spider bot progress

Re: Tyteen's spider bot progress

Re: Tyteen's spider bot progress

Unix Time Stamps

Unix Time Format

Re: Unix Time Format

Code is complex

Re: Code is complex

crawlers and stuff

Teleport Pro

Zoolz

Zoolz link for downloaded http://www.pcpuzzle.com/forum/

Re: Zoolz link for downloaded http://www.pcpuzzle.com/forum/

offline and so fine

Offline Explorer Enterprise

Re: Offline Explorer Enterprise

no login for Offline Explorer

Web Boomerang

SID

Re: HTTrack project 20GB