Panofi Blog: 2009

Monday, 28 September 2009

3.11 References and Resources

3.10 Information Architectures

As the technology advances, and Web technologies are becoming more and more ubiquitous there is an ever-growing need to deal with vast amount of data collected. In order to efficiently collect, store and extract information from these data, Information Architecture disciplines have developed and introduce several techniques which facilitate this. One of the most widely used and supported technology is the Relational Database Management Systems or RDBMS.

RDBMS implement special data structures represented as tables and relationships between those tables and the data they contain are enforced using special indexes commonly known as primary and foreign keys. The database schema supporting a IT system is one of the core subsystems; a good DB design will help towards efficient response times in user request. Modern RDBMS are incorporating subsystems and mechanisms that help towards availability, reliability and scalability (Connolly & Begg, Addison Wesley, 2005).

In contrast with Object Oriented architectures which use custom objects and data types, RDBMS are based on user created entities as tables with attributes as columns and the data contained represented with integral data types such as integers, decimal and varchar. Such data types can be very flexible in the storage they demand (SQL Server 2008 Books Online).

From my experience, the choice of the datatypes used when designing a DBschema is very crucial. It is tempting to always opt for a bigger datatype (i.e. int instead of smallint or tinyint) even when you know that this column won’t store any values bigger than 1 or 2 bytes. This has pros and cons: doing so will ease the future scaling of the system if the requirements change, but it won’t produce any errors if the client application tries to store values that are inconsistent with the behaviour of the entity the table represents, violating the data integrity without informing the user, thus more prone to error results.

3.9. Client side programming

Working as professional software developer includes a lot of programming; one could assume that I would be bored of developing such a basic client application. But my work mainly involves middle-tier and back-end database programming so I took the time to develop something that uses some of the techniques described on this blog with a view to enhance my understanding of the DOM model and other front-end technologies.

I was indented to use the meta information of BBC pages with regards of which urls to display, but I’ve stumbled across the same-origin-policy which prevents JavaScript to access pages on different sites. So the choice was to hold the urls within the application, which was developed in two faces.

- Initially the user selection was filtered by using pre-populated select html elements, with onchange() events attached to display the appropriate url on the page. This is still happening, but with a few modifications and additions described below.

- When Session 08 described the way modern search engines work, regarding inverted files etc, I’ve consider it a better challenge for me to try and emulate some of those. I certainly didn’t expect to face so many peculiar challenges imposed by current modern browsers such as:

The use of XML data source for the drop-down lists and to create a basic emulation of the IR techniques. Faced with the challenge of different ways an xml document is loaded by different browsers. Mozilla and recent versions of IE uses the XMLHttpRequest() object, though older IE versions use the ActiveXObject("Microsoft.XMLHTTP") approach (w3schools).
The use of XPath to extract XML element and attributes values greatly simplified traversing the XMLs, however there were cross-browser issues related to how the XPath expressions are used. Thus i've implement cross-browser functionality, based on recomendations by w3schools.

More info about the application inside the commented code. This is the result:
http://www.student.city.ac.uk/~abhp626/BBCinfo.html

3.8 Information Retrieval

Surely everyone who has used Google’s search engine has at one point left bewildered with the amount of information even the most obscure query returns. Many times I’ve deliberately searched using made-up words, and got back at least some loosely related results – leaving me with the warm feeling that there are other like-minded people there who wander on the art of Information Retrieval or IR.

IR is the process that allows retrieving information related to a user’s requirements. Differences between querying with a view for IR and querying RDBMS stems from the way the information is organised: in the DB environment, information is structured and related to the underlying business model, plus the process is deterministic - same query by different users will retrieve the same results. In contrast, due to the unstructured nature of the information that exists in many different formats and media and the subjective relevance of a user’s perspective, an IR query could return different results, and is highly probabilistic .
Techniques used in IR include removing stop words, stemming, and identifying synonyms in order to create document indexes. A widely used type of index is an inverted file, which is a list of terms, pointing to a list of relevant documents. Additionally, more complex queries can be constructed using Boolean algebra operators like OR and AND. (Macfarlane, A., Raper , J. & Dykes, J., Lecture 08: Information Retrieval)

The algorithms used by modern search engines in order to efficiently retrieve information are highly kept secrets – like Google’s PageRank. The penetration of the web in modern life, and the need of brand recognition, lead the importance of search engine’s ranking grew stronger and stronger (source: SEMPO Survey). As a consequence a new IT field has arise, namely Search Engine Optimization (SEO), aiming to implement various ranking improvement methodologies with a view to draw more visitors in a client’s site.

3.7 Databases

Database Management Systems (DBMS) are software programs that can be considered the foundation that any organisation using modern computer systems builds upon; when compared with the file approach used in the early days of computers, they offer considerable advantages in managing information, by allowing simultaneous access in centrally stored and systematically organised data. Additional benefits from DBMS use include data independence, reduced data redundancies, quick data recovery and enforcing security policies. (Connolly & Begg, 2005).

Relational databases are the most widely used DBMS on which information is stored in data structures that can be visualised as two-dimensional tables, each table representing a distinct entity of a model of a system. A good database design which maintains referential and relational integrity among the data stored is fundamental for a successful implementation of an organisation’s system, and helps towards efficient data retrieval and scaling-up. Using structural query language (SQL) queries can be constructed which will return data based on filtering criteria, possibly by joining one or more related tables.(source: Butterworth, R., Lecture 07: Structuring and querying information stored in databases)

Modern databases (like MS SQL Server 2008) are now spatially-enabled and can be used in the field of GIS. Along with the native SQL data types, new spatial types allow geo-coded information to be stored and indexed, on which then spatial operations can be performed, with the results visualised on a map, allowing useful information which is hidden within the data to be extracted (Longley, P. Goodchild, M. Maguire, D, and Rhind, D. (2005)).

As an example using data describing property locations and property prices, and using post-codes to get longitude-latitude coordinates, a query can be constructed which will display on a map the properties which have beed sold with a price over a given amount:

SELECT p.PropertyID,  p.Address, p.PostCode, max(s.SalePrice) as MaxPrice, c.Latitude,    c.Longtidude
FROM dbo.Property p
   JOIN dbo.PropertySale s
     ON p.PropertyID = s.PropertyID
      JOIN dbo.Coordinates c
        ON p.PropertyID = c.PropertyID
GROUP BY p.PropertyID, p.FullAddress, p.PostCode, c.Longtidude, c.Latitude
HAVING max(s.SalePrice) > 220000

PropertyID	FullAddress	PostCode	MaxPrice	Latidude	Longtitude
1012	22 Mornington Road	E11 3BE	285000	51.5696	0.0143
1013	42 Abbot's Park Road	E10 6HX	222000	51.5725	-0.0039

3.6 CSS

The success of the Internet lead to more visually and semantically complicated web documents. As a consequence the HTML code describing a web page was being cluttered by styling information and this had an adverse effect on the quality of the code. Additionally the need to maintain common aesthetics for related web pages lead the W3C to create the DOM(Document Object Model) and the CSS(Cascading Style Sheets) standards to encourage a better and more efficient web design.

DOM is a concept that takes advantage of the XML structure of the XHTML markup and provides ways to traverse, access, extract and apply properties of the elements included in an HTML page.
It considers every HTML document as a tree with hierarchically arranged elements as its nodes which in turn can have more nodes as children or can contain text (REF). Different web technologies (like JavaScript) are implementing the DOM model in different ways, but the premises are the same.

CSS uses special syntax to access and apply visual styles to any elements inside an HTML page and instructs the web browser on how to display those. It helps separating style from content by allowing single CSS files to be used by multiple web pages on the site, allowing changes of the look of those at the same time. This additionally benefits network traffic, as the browser cashes the files and doesn’t need to download them all the time.(REF)

Not all web browsers interpret CSS tags the same way. Older versions of Internet Explorer have a very limited support which resulted in scrambled page styling. Web developers have found and implement workarounds to bypass browser specific limitations, however this sometimes leads to cryptic and intelligible CSS code, violating the separation of style from content principles.

On my personal web space I’ve used CSS to provide a consistent look and feel, and there are examples of CSS inheritance and override via the class and id attributes.

3.5 XML

XML...is so simple and elegant as a concept, that one now wonders how is possible that this has not been invented years and years ago. In fact, it is very hard to think any computer data that cannot be described by some XML structure. W3C created XML to allow web documents to be easily interpreted by humans and computers as well, and is now widely used for the representation of arbitrary data structures, i.e. in web services, as well as the underlying model of several data formats for different desktop applications (Microsoft Office 2007 uses docx format, or Sun's Open Office).

XML is not a programming language as per se. Is a descendant of SGML, with user created mark-up and tags which provide the documents with the required semantics necessary to be successfully interpreted by the applications which uses them, thus greatly enhance the interoperability of different computer systems. In order to do so, Document Type Definition(DTD) and XSD schemas define the structures based on which XML documents are created and validated against, with the latter providing additional support for data definition and datatypes.

On the web front, XHTML, an extended HTML version, added well-formed and case-sensitive restrictions and as a consequence of this, XHTML documents can be processed using standard XML tools, like XPath (used to traverse the logical structure of an XML document in order to query and extract the encapsulated data), as well as the XSL Stylesheet Language which allows the transformation of an XML document into another.

Finally, modern databases (like MS Sql Server 2005 or Oracle), have incorporated the XML technology, by supporting native XML datatypes and integrated versions of tools like the XQuery which allows access and navigation of xml documents based on XPath 2.0.

3.4 Images and Graphics

Since early 80's home computers, there was a very important decisive factor when computers where compared: which has the best graphics. Still in our days, the need of an ever better representation and manipulation of graphics on our computer screens is one of the driving forces behind the computer technology (Graphics hardware,England, N,Computer Graphics and Applications).

In computer applications graphics are represented digitally using vector or raster formats. The vector format uses points in space which may be connected with lines. We use vectors to represent graphics which are scalable and with well defined limits. Graphic and 3D design artists use software that handles complex vectors to design their artefacts, and GIS applications use vectors to represent discrete objects like rivers or create different layers of information overlaid to raster maps.

On the other hand, the raster format can represent heavily detailed images. Imagine a grid, where each cell - some may even call it pixel - contains a binary value with information regarding the color of the cell. A raster file is a series of such bits, with an initial header specifying the gird's dimensions.

Compression techniques have evolved to help the distribution of large raster files over the networks. Lossless formats like the 8-bit GIF uses indexes to hold pixel information and to recreate the image on the screen. Web designers can create complex backgrounds using small GIFs by programmatically repeating them on the web page. The lossy 24-bit JPEG format produces smaller file sizes by eliminating bits of information based on sophisticated algorithms and is widely used to store complex imagery such as photographs.

The 24-bit PNG format tackles both problems of large files and data loss by using indexes like the GIF. Is very useful for web-galleries when JPEG is used to create thumbnails but the user downloads a high quality PNG.

Source: Butterworth, R. & Dykes, J., Lecture 04: Graphical information.

3.3 Internet/WWW

With one breath: A world-wide network of networks of networks, built on a client-server architecture, which uses existing lines of communication to exchange digital messages split into 'packets' of data and encoded via through network controller hardware using special protocols like http, ftp, telnet, and IPv4, enabling to connect and communicate with each other multiple types of computers (ranging from powerful, to less powerful, up to totally useless) - mostly sharing hyperlinked text and multimedia documents, or increasing computer power by combining processors from a number of machines 'over the cloud' . The DNS service translates the IP addresses to more human readable domain names, pointing to servers on which web applications are hosted, offering various services ranging from the ultra-professional enterprise scale, up to the most basic html blinking text web page, allowing software clients such as Firefox or Internet Explorer to access those through the aforementioned communication channels using URL's. (Internet - Wikipedia, the free encyclopedia).

Working as a software developer, for a company that builds web applications, I've added my bit of html in some of the zillions of pages existing on the web right now - although most of the times deployment is an automated process and I don't think too much about it. Though for this part of my MSc course, to set up a simple web site, I’ve used all of the above protocols and technologies in this order:

Created three basic html pages with links to other websites - in one page I've used Google Earth's API and the iframe tag to show off my favourite place in the world. I thought that it made sense considering I'm doing an MSc in Geographic Information Science. Then used FTP (via SSH) to upload these files to my City file space, and published the pages using Telnet and the UNIX console: http://www.student.city.ac.uk/~abhp626/index.html

3.2 Text/HTML

All information used in computer systems is transformed in a series of bits 0 and 1, represented by an electromagnetic state, and stored in a digital media. A series of bit, called byte, is the smallest addressable element for a given computer architecture and can represent anything we want to, be it alphanumeric characters or a pixel in a bitmap image etc (Butterworth, R. & Dykes, J).

Collections of digital information referred as files contain series of bytes, and computers interpret them according to rules known as file formats. The operating system is aware of the files physical locations on the disk enabling users to access them via an interface. Additional data, called metadata exist within the files providing information about the contents. Search engines categorise the web pages using metadata information stored in HTML meta elements. (Butterworth, R. & Dykes, J).

But information is not just text, could also be presented in the form of a graphics. Different technologies provide mechanisms to enable graphical information presented in electronic documents. Embedding, or file-centred view, facilitates the document distribution in different environments and has all graphics and external data included in the binary file. Linking, or document-centred view used in a local environment and has any external data linked, with changes to the linked data reflected on the container document.(Butterworth, R. & Dykes, J).

On the Internet, the usual approach is document-centred; the web pages contain links to the server's filesystem and the browser then http-request the image from there. Other options use a file-centred view via data urls : instead of linking to an image stored locally on the server, the image is provided within the URL itself as a base64-encoded string. A drawback to this method is that the browser is not caching the image and downloads it every time is used, but this can be minimized with the use of CSS(http://www.sveinbjorn.org/dataurls_css).

3.1 Introduction

Once upon a time, the Greek philosopher Aristotle on his work Organon suggested a system of logic based on only two types of propositions: true and false. Several hundreds year later George Boole gave symbolic form to Aristotle's system of logic by codifying relationships of mathematical quantities limited to one of two possible values: 1 or 0. Claude Shannon of MIT recognized how Boolean algebra could be applied to on-and-off circuits, where all signals are characterized as either "high" (1) or "low" (0). Modern binary system, documented by Gottfried Leibniz in the 17th century uses 0 and 1 to handle any problem decimal arithmetic could deal with. Computers were born.

Since then lots of water has run under the bridge, and now we are living in the Information Age. Internet and the technologies behind it, have transformed the way people have access to information, and this had a tremendous effect in every aspect of human life, from personal to professional and more(Stanford Inst. for the quantitative study of society). And the Blog service is one of those technologies that since have attribute to that (Du, H.S. & Wagner, C., 2006. Weblog success: Exploring the role of technology).

This blog is created as part of the coursework for the DITA module of my MSc course on Geographic Information Systems. I've chosen to use the http://www.blogger.com/ service to host it, due to being one of the few blog services I know that many people use, as well as because I find it an easy to use service in general. Is quite fast and gives you powerful tools that allow you to create/edit/admin and modify the contents of your blog in many levels: a novice user would be able to alter/modify the blog's contents as to achieve a more professional or personal look and feel, and the web developer could add additional functionality with client scripts that enhancing the overall functionality and impression of the blog. I'll let you to explore it with a hope that my entries will do justice to it.