Parsing data in C# is a simple process, and can be done using some
basic language syntax. Sub Strings provide this functionality, and when
used properly can make data collection, or simple parsing operations a
much easier task.
First when parsing data from sources such as a web page, it is
important to first remove all data at the beginning of the document
which is unnecessary, in order to prevent the parsing utility from
finding the wrong information.
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>website design search results</title>
<link>http://randomexamplesiteurl.com/</link>
<language>en</language>
<pubDate>Wed, 15 Apr 2009 18:31:33 GMT</pubDate>
<lastBuildDate>Wed, 15 Apr 2009 18:31:33 GMT</lastBuildDate>
<image>
<title>website design - sample feed</title>
<url>http://randomexamplesiteurl.com/testimage1.gif</url>
<link>http://randomexamplesiteurl.com/</link>
</image>
<item>
<title>Small Businesses Receive Web Design Financing from Wildfire</title>
<link>http://randomexamplesiteurl.com/testlink1.html</link>
<pubDate>Wed, 15 Apr 2009 07:15:30 GMT</pubDate>
<description>This is a sample description I am using for testing purposes</description>
</item>
<item>
<title>Effective website design for successful ecommerce</title>
<link>http://randomexamplesiteurl.com/testlink2.html</link>
<pubDate>Wed, 15 Apr 2009 11:23:38 GMT</pubDate>
<description>This is a sample description I am using for testing purposes</description>
</item>
<description>website design - XML Sample</description>
</channel>
</rss>
Finding unique tags to mark the beginning of the data to be parsed is
the key to building an efficient parsing utility. In the above sample,
all of the text prior to "" is irrelevant if you are only attempting to
gather the item data, and will not be needed to complete the parsing
process. To remove this from your text use the following code: (code
assumes data is loaded in a string variable named strData)
int intStartPos = strData.IndexOf("<item>");
strWorkingRSS = strData.Substring(intStartPos);
Once the irrelevant data has been removed, you can then
focus on parsing the remainder of the string, with the following code
this can be done by using any unique string at the beginning and the
end of the data you would like to capture. The following code will
always stop at the first instance of search string so if you continue
to trim the text as you work using the above sample, you can easily
write a loop to pull out each of the items until the data has all been
parsed successfully. The below sample will result in assigning the
variable strTitle with the text in between the "<title>" and
"</title>" tags.
string strOpenString = "<title>";
intStartPos = strData.IndexOf(strOpenString ) + strOpenString .Length;
int intEndPos = strData.IndexOf("</title>");
int intLength = intEndPos - intStartPos;
string strTitle = strData.Substring(intStartPos, intLength);
This should be enough information to get any parsing project
started. The data that I used for my sample may have been XML, but the
real value in this type of parsing utility, is in cases where data from
an HTML site, or group of HTML pages needs to be moved to a dynamic
location such as a database. Many times the only viable option for data
transfer is to use a "screen scraping" application, and this code
provides a general outline for how to build one for most any
circumstances.
If you're looking for a web development company to help you figure out how to do your next project, contact us at Sales & Marketing Technologies.