Friday, September 25, 2009

Language Processing: Defining a Word - 01.01


Preface:
                When I began studying programming and computer science in general I discovered that there are two major problems holding back modern programmers, representing data on a modern system and storage space. However, because this will be a conceptual analysis of topics and not one on  practical efficiency, let us assume that we have infinite storage.  
                In my opinion, representing data is the most important cognitive problem in computer science. If we were able to fully quantify a person then  a computer would be able to think and process information just as we do, it would be able to truly learn and grow, it would be able to create instead of just copy and manufacture. It all comes down to forming connections. The human brain is made of trillions of connections that help us do everything from process simple stimuli to adapt to a changing environment and create new information. Once we're able to automate this the possibilities for computing will be limitless.

But let's hop off the soapbox and start the very first issue of Binary Anomalies.

                Language processing has always been an interesting concept to me so this is as good a place to start as any. Any language is comprised of the same basic components:
·         Words which have meanings and connections to other words
·         Patterns of words which form new meanings and connect to other patterns
·         Groups of patterns which form a complete meaning. These point to words, patterns, and groups.
                These in fact will be my first 3 Binary Anomalies (in order). Now, to anyone who has studied computer languages you might be thinking "That sounds like a circular multiply-linked list." and you of course would be right. This is a very, very large circular multiply-linked list that contains all information and connections from every word, phrase, and pattern, in any given language. To my non-technical readers, let us consult this diagram.
                This diagram shows a very simple circular doubly-linked list which shows the connection between Earth, New York, and People, in that order. It states that The Earth contains New York which contains people.  it is circular because the first entry, or node, "Earth" has a connection, or pointer, to the last entry "People". It is considered as a doubly-linked list because each entry has a connection to the next entry and also to the previous entry. Now, here is the important part.  Each connection between entries has a word or a phrase attached to it. This is the single most important part of my entire adventure into language processing. These words and phrases are the basis for how the computer will know what to do with the information it receives.
                Let's expand. More specifically, let's expand using the example "Let's expand". When you first read the words "let's expand" you didn't have to stop and think about the meaning of this phrase, at least not in the way that we perceive it as thinking. Your brain automatically pulls down the meaning of each word and forms the most likely logical meaning of the phrase in the context that I used it. To do this, you must first understand the meaning of each individual word, then you must understand the link between these words and the other words they're being used with, then you must understand the overall context of the entire phrase. You already knew what the words meant, you knew how the words connect to each other, and you received the context of the phrase when I introduced my opinion before stating "let's expand" these are all things that must be accessible to the machine if it will be able to process language the way people are able to.
                The tricky issue is that my example is flawed, I previously stated that it must be a multiply-linked list for this to work. There will never be an example as simple as this one in any given language. For instance, A proper chain from Earth to people may look like this: Earth - North America - America - New York - Albany - 55 Main Street - People. And of course, "people" should be linked to every location. There are people on Earth, in Albany, and most likely on 55 Main street (although I haven't checked that last one). Now, imagine how big this list begins to get once you include every place on earth? Luckily, there is a solution to this problem in some cases. Sometimes it is simply easier to list places where things are not connected, and "people" is a great example of this case.
                For my technical readers: this could be accomplished by weighting each pointer, which I feel needs to be done in either case. People - Albany should be weighted more heavily than People - Earth by simple virtue of how likely a given connection is likely to occur. By making a weight negative and assigning a special "this node has all connections" modifier to a specific node we can accomplish this exception without filling the list with countless new pointers and help maintain the lists integrity and speed.
                Lastly, and most difficult, definitions of words. My current method of linking words works well for defining objects and actions but falls short when attempting to define things that are immaterial or intrinsic such as "honor," "respect," or "Fairness."  Unfortunately, as I mentioned in my preface, this is one of the biggest issues facing computer scientists. We can come close however. Many of these concepts can be related to something concrete (if in a roundabout way). Fairness in some cases can be related to equality, and although the full meaning of the word gets lost in the translation it's a great place to start.
              Thanks for reading the very first issue of Binary Anomalies! My next entry, "Pattern Recognition" will be up next Friday. Also please feel free to leave your own personal Interpretation (comment) below. Remember, there is
                -Brian Layne