Friday 31 May 2013

Why Web Scraping Software Won't Help

How to get continuous stream of data from these websites without getting stopped? Scraping logic depends upon the HTML sent out by the web server on page requests, if anything changes in the output, its most likely going to break your scraper setup.

If you are running a website which depends upon getting continuous updated data from some websites, it can be dangerous to reply on just a software.

Some of the challenges you should think:

1. Web masters keep changing their websites to be more user friendly and look better, in turn it breaks the delicate scraper data extraction logic.

2. IP address block: If you continuously keep scraping from a website from your office, your IP is going to get blocked by the "security guards" one day.

3. Websites are increasingly using better ways to send data, Ajax, client side web service calls etc. Making it increasingly harder to scrap data off from these websites. Unless you are an expert in programing, you will not be able to get the data out.

4. Think of a situation, where your newly setup website has started flourishing and suddenly the dream data feed that you used to get stops. In today's society of abundant resources, your users will switch to a service which is still serving them fresh data.

Getting over these challenges

Let experts help you, people who have been in this business for a long time and have been serving clients day in and out. They run their own servers which are there just to do one job, extract data. IP blocking is no issue for them as they can switch servers in minutes and get the scraping exercise back on track. Try this service and you will see what I mean here.


Source: http://ezinearticles.com/?Why-Web-Scraping-Software-Wont-Help&id=4550594

Monday 27 May 2013

Download all images from a website easily

Is there any way to download all images from a website automatically, without having to click through all the pages by hand? Yes there is! Extreme Picture Finder is the answer. Simply enter the website address, select the folder on your local hard disk where all downloaded images must be saved and click Start! And that's all. Now you can switch back to other tasks while Extreme Picture Finder works in the background extracting, downloading and saving all those images.


The example below shows you how easy it is to download all images from a website automatically with Extreme Picture Finder and how to avoid downloading small images (like thumbnails or banners).

To make Extreme Picture Finder download images from a website you have to create a project. Simply use menu command Project - New project... or click the Create a new project to download all images from a website button on the program toolbar and you will see the New Project Wizard window shown below.


Now in the Starting address (URL) field type the address of a website. If this site is password-protected, then check the This site is password protected box and enter a valid username and password.

Basically, this is it. The default project settings a set to download all images from all pages of the site, so you can now click the Finish button and watch images flow to your hard disk. By the way, you can view the downloaded images while the rest of them are still being downloaded - there is built-in image viewer with thumbnails and slideshow in Extreme Picture Finder.
How to download only big or full-size images from a website

By default Extreme Picture Finder will download all images from a website - big and small. But in most cases you need only big or full-size images. You do not want thumbnails, banners or parts of the website design. So instead of clicking Finish button after entering the website address, click Next button several times to reach the last step of the New Project Wizard.


Now check the Show advanced project properties box and then click Finish button. You will see the Project properties window where all project details can be modified.

In the Project properties select the Limits - File size section. This section allows you set the minimum and maximum file size of the images. So check the Do not download small target files, less than box and enter 25 in the corresponding edit field. Also you can prevent the download of huge images by specifying the maximum file size.

Now you click OK button and Extreme Picture Finder will start downloading only big images. You can easily make those settings default for all projects by clicking the Make these properties default for all projects... button.

Source: http://www.exisoftware.com/news/download-all-images-from-a-website.html

Friday 24 May 2013

Expand Your Resources With A Coupon Database

Why multiple coupons is an important strategy to saving money: it multiplies the savings for items you buy, lets you build a stockpile, lets you maximize store sales, keeps you from paying full price when you can buy at the best possible price.

Some great tools to get you there include the Coupon Database at Hotcouponworld. Use it to search, trade, and manage your own coupon inventory. I love the database because it will give you a list of every valid coupon available for an item, including printables. The database is always being updated daily.

Another tool is the Target Generator, save on paper, ink, and clipping time generating multiples of target instore coupons. Remember, instore coupons can be combined with manufacturer's coupons for additional savings. You can also find out about the many great deals going on at target here. Find out about unadvertised deal, clearance and other coupon match ups to save even more money.

For additional coupon resources including printables, pricebooks, and more go here.

Source: http://never2late2save.blogspot.in/2009/02/expand-your-resources-with-coupon.html

Friday 17 May 2013

Slickdeals.net

Hey, remember yesterday when I told you all about Fat Wallet and their forums? Well guess what, I have another awesome web community filled with the newest and best deals across the internet. Slickdeals.net is a site I visit everyday, throughout the day because sometimes the best deals don't last very long, especially when the masses start sharing their deals.

When you first access Slickdeals.net, you are greeted with a list of featured deals that is updated throughout the day. These deals and are the latest and greatest of all the posted deals on the site. The nice thing is, dead deals will lose it's boldface title and will be labeled expired red. This definitely makes finding good deals easier and faster. Once you click on a headline, the deal will be expanded and you can learn where and how you can get the product at a bargain price! To add to the convenience factor, an RSS feed is available so you can be one of the first people to jump on a deal.

On the top of the page is a ribbon bar with links to the homepage, a coupon database, the forums, research tools, and a FAQ. Lets start off with the FAQ. The Slickdeals.net FAQ will explain anything and everything on the site. It teaches users how to use every feature to its fullest extent and is your best friend when you don't know how to do something or need help with something obscure. Users can quickly search the FAQ using the search tool which will give users the most relevant answers.

Hovering over the Coupon link, a drop down box will allow you to sort all the coupons by alphabetical ranges, the newest coupons, ones expiring soon, and my merchandise (apparel, tech, etc.). If you click on the coupon link, a list of all the newest coupons will be displayed. Whenever you buy something online, it is best to search Slickdeals.net for the shop you are buying from to see if you can save yourself a few dollars in a few seconds time.

Similar to the Coupons link, the list of forum topics will be shown when your mouse hovers over the Forums link. This is useful, but I usually just click on the link to show the forums. The categories "The Deals" and "Deal Talk" are by far the best places to look for deals on the site. Each of the forums are pretty self explanatory and I'll leave you to discover.

When looking at the forum topics, you will notice many different icons. Most of are pretty self explanatory, like the paper clip means there is an attachment. A W stands for a Wiki-post. These wiki-posts are created by the system and shown as the second post in the thread. All users can edit it and quickly add notes in which the original poster left out (links, telephone numbers, lists, etc). You will also see icons on the left side of the topics which categorize the deal. You can quickly find similar deals by clicking on the icon and showing all the deals in the category.

Additionally, Slickdeals.net has some features that only registered users can use. You might have noticed an icon that looked like a circle with a dash or a minus sign in the middle. This allows you to ignore the topic. This is useful when a deal is dead or when you just don't care for the product that is being offered. It keeps the forums less cluttered and easier to find the recently added deals. Additionally, you can create your own sticky topics by going into the topic, clicking on the Thread Tools drop down and choosing "Stick the thread to the top". Pretty useful eh?

Now for the Research Tools, a feature I need to use a bit more often! Here you will find a Price Search tool, Store Ratings, Amazon Fillers, and Product Reviews. The Product Reviews just takes you to a forum dedicated to users posting their own reviews for anything they have bought in the past.

The Price Search tool is powered by Price Grabber, so the format may look familiar to you bargain hunters out there. All you do is search for an item you are looking for and enter your zip code. In return you will be shown multiple vendors selling the product and who will have the lowest final cost after shipping and taxes. This portion is pretty much useless since I know ya'll have taken my advice and started using Cashback Sites!

Store Rating are provided by a site called resellerrattings.com. Here you can get a good idea if a small vendor you've never heard of is reliable when you find a deal that is too good to be true. Using this tool will save you from handing over your credit card information to someone who will sell your identity to the highest bidder. You can quickly search for the store name and read reviews and get other consumer feedback based on the store's past transactions.

Now I saved my favorite Research Tool for last, the Amazon Filler Tool! Amazon offers free shipping on almost all purchases over $25. The problem is, they like to price things like $14.95 or $9.99. It's even more of a pain when you're only a few cents short of getting to that $25 mark and now your stuck with paying $6 for shipping or finding something to add to your cart. Well lucky for you, Slickdeals.net users have taken the time to create a database full of cheap items so you save the most amount of money. All you do is input the amount your short of getting free shipping and you will be shown every item that you can add to your cart that will qualify you for free shipping!

Source: http://jesses-deals.blogspot.in/2008/05/slickdealsnet.html

Monday 6 May 2013

Splunk: Real-time (web) analytics, powerful data mining and cost effective single customer view

Splunk is a fantastic monitoring and operational intelligence tool and now we are all trained up here at Datalicious with certificates to prove it (see end of post). The most frequent use case is for systems administrators, but we set out to play around with it and see how we could use it for web analytics. We realised that we could use its powerful, expressive search language and its intuitive charting & visualisation features to do analytics work that’s more difficult, more expensive, or simply not possible, in other web analytics suites.

The big philosophy of Splunk is that you just throw all your data into it and worry about how to report on it and what to do with it later. This is great for us: it means we can focus on gathering as much data as possible in the implementation stage of a project, and there’s no risk of getting to the reporting & insights staging only to realise we’ve overlooked something.

We have a setup where all our Google Analytics data is cloned and sent into Splunk. We hacked together a simple, scaleable pixel server in node which acts as an intermediary between Google Analytics and our Splunk installation. Our server can handle any pixel request, so we can supplement the data that Google Analytics gathers with anything we want to do in our tracking code – without having to set up Custom Variables in advance, and without being limited to 5 of them.

Once the data is in Splunk, its search language lets us get right at the data and do whatever we want with it. For example, maybe we want to see how many page views our website gets on average per session, to see how our latest site design is performing. We can run this search:

    eventtype=datalicious_GA earliest=-7d | stats avg(utms) AS avg | eval avg=round(avg, 2)

Broken down, it’s pretty simple: we’re looking at the event type called “datalicious_GA”, which has been defined elsewhere. The earliest results we want are 7 days ago. We “pipe” the output of that search to the “stats” command, and we get an average of “utms”, which is Google Analytics’ session counter. We then round it to two so that it looks a bit nicer, and we get this:

average page views

Fairly simple. But what happens if we realise we want to break those results down by some kind of segmentation which we didn’t plan for in the past? It’s no problem. If at any time in the future we get some additional metadata about our visitors, we can apply that retrospectively to generate segmentations across their full history. For example, lets say some visitors eventually “convert”, which for our website is simply clicking one of the links to contact us. We could run this more complex search query:

    eventtype=”dataliciousGA” | eval type=”Non-Converter” | join type=outer datalicious [search eventtype="dataliciousGA" | join datalicious [search eventtype="datalicious_conversion"] | eval type=”Converter”] | stats avg(utms) AS avg by type | eval avg=round(avg,2)

This just means we want to do a search for converters, join it to the search result for all visitors, and show the average per-session page views of each of those segments.
 segmented average page views

It’s trivial to look at something like conversions by channel:

Of course, no one wants to look at ugly search strings all day. That’s why we build visualisations:

individually segmented page views

It’s important to emphasise that we can retrospectively apply a segmentation across the full history of all impressions, events and custom data at any time. In the above example, we built a little form and got people from around the office to fill in their name. We associated that with the unique cookie ID they have on our website, and suddenly we can track their individual behaviour over all time. This didn’t have to be the name, it could have been any meaningful segmentation: annual household income, country, favourite musical genre, etc.

And of course, we can apply all of those segmentations across data like search keywords:

Source: http://blog.datalicious.com/splunk-blog-post-for-review/

Thursday 2 May 2013

Counter-terrorism using Data Mining

Boston marathon bombings terrorized the country.  Shortly after the bombings, FBI looked into mining of data to narrow down on the suspects. The FBI team analyzed 10 TB of data such as cell phone tower call logs, text messages, social media data, photos and videos from surveillance videos and additional photos and videos from general public who were present at the marathon. Twitter data was also analyzed with the help of a company called Topsy Labs which is a repository of tweets from the year 2010 and the location of origin of tweets. Data was analyzed not only few days before the bombings but also billions of tweets related to Boston and its suburbs.

This humongous data was analyzed using FBI's software and common tools such as face-recognition and position triangulation. Even though mining of this data didn't lead to the capture of the suspect Dzhokhar Tsarnaev, it shows what data mining is an effective tool for counter-terrorism. In the future, by developing a model and using the features of Artificial Intelligence terrorism can be reduced to the maximum. In the future, just like the movie "Minority Report" where the precogs predict the crime, supercomputers can be made to analyze data from satellite images, drone video feeds, photos and videos uploaded by users in YouTube, Facebook, Twitter and other social media to predict a crime.
Predictive analysis seems to be the future of counter-terrorism.

Source: http://auburnbigdata.blogspot.in/2013/04/counter-terrorism-using-data-mining.html

Note:

Alyce Medina is experienced web scraping consultant and writes articles on web data scraping, website data scraping, data scraping services, web scraping services, website scraping, eBay product scraping, Forms Data Entry etc.