Module 1: Language elements

  • are programming languages real languages?
  • what is the basic syntax of languages like C++?

Programming languages

As we go about learning how to design algorithms and abstractions, we need a language to describe our work. In Statics we’d use the language of mathematics and the visual language of the free-body diagram. In computer programming, we will use a textual language to describe our designs, and this textual language will be converted by a tool called a compiler into the specific instructions that a CPU knows how to execute. Programming languages genuinely are languages, but they are synthetic languages (like mathematics) rather than natural languages.

Programming languages really are languages

All languages have (at least) two central things to worry about:

Syntax
The rules of what makes for valid (“well-formed”) language. Natural languages have constructs like subjects (“I”, “you”, “they”), objects and verbs that can be combined into sentences. This sentence you are currently reading is a well-formed sentence. However, so is this one: ’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. That is a valid, well-formed sentence: you can identify subjects, objects, verbs, adjectives, etc., all of which follow the rules about word order, subject-verb agreement. Nonetheless, the sentence is nonsense. To make sense of language, we need more than syntax.
Semantics
The meaning of language. Semantics tell us what things mean, often including whether or not we agree that the words mean the same thing. “My dog is a pilot” is a syntactically-value sentence, but is semantically rather meaningless. On the other hand, the phrase “the pilot is high” has more than one possible interpretation… with rather dramatic potential consequences!

This distinction between syntax and semantics is also important for programming languages. The compiler (the tool mentioned above that converts our textual programs into instructions for the computer to execute) will spot syntactic problems with our programs and give up with an error message. For example, if you tried to type the following text in the middle of the C++ code for your next lab:

this sentence is not valid C++!

the compiler would print an error message and stop — it is syntactically invalid. However, if we wrote a program to calculate the average value of some data, but we forgot to divide by the number of elements (i.e., we calculated $\sum X$ rather than $\frac{\sum X}{n}$), it could be syntactically well-formed but semantically incorrect: the program would look like a valid program, but it would output incorrect results.

Programming languages are not natural languages

Although programming languages are real languages, we did say they are not natural languages (which evolve naturally over time) — they are synthetic languages, designed by engineers and computer scientists. This has some implications:

Good news
They have a very small, closed vocabulary and very regular rules.
Bad news
They are fussier about grammar than an eighth-grade English teacher!

Our objective is to get you familiar enough with C++ to write meaningful sets of instructions for computers. This will require a dive into rules of programming language syntax, so that we can figure out how to write well-formed C++ and express the semantics that we want the computer to understand. This is a two-step process:

  1. get the grammar right (syntax), then
  2. make it meaningful (semantics).

So without any further ado… to syntax!

C++ syntax

The syntax of the C++ programming language (or any language, for that matter!) can be viewed at different levels of abstraction. At a very low level, there are characters that come from an alphabet of acceptable characters. These characters are combined into tokens, a higher-level abstraction in programming languages. Once we’ve checked that a token is made of valid alphabet characters, we can start thinking about higher-level questions than “what characters are there”: we can start to think about the different semantics of different kinds of tokens.

Tokens can be used to build expressions and expressions can be incorporated into statements. When we talk about statments in a programming language, we are typically interested in answering questions like, “what does this statement do?”, while paying less attention to the sequence of characters contained in the statement. This is another example of abstraction: participating in a conversation about abstract things (statements) while ignoring some of the concrete details (characters, rules about tokens, etc.).

Here is an example of a C++ statement that contains one expression and is constructed from tokens, which are themselves constructed from characters in the C++ alphabet:

Constructing statements from expressions, tokens and alphabet characters

The alphabet

Every language uses an alphabet to represent its words. Most European languages use the latin alphabet, although there are some variations with accents (e.g., “crêpe”, “Straße” or “římský”). Other languages use entirely different alphabets, such as the Cyrillic alphabet (“кириллицы”), the Hangul alphabet (“한글”) or Devanagari script (“संस्कृतम्”).

Most programming languages, including C++, use a Latin-derived alphabet that includes:

  • uppercase Roman letters (A-Z),
  • lowercase Roman letters (a-z),
  • the digits 0-9,
  • any combination of newline, tab and/or space characters (collectively known as whitespace) and
  • a specific set of symbols available on almost every keyboard: < > { } ( ) : ; , . ? / * & + - ! ^ ' " = | _

Tokens

Natural languages combine characters from the alphabet to form words. The most basic unit of a programming language is called a token. Tokens can include simple words or operators (e.g., the less-than sign <). Crucially, tokens do not contain whitespace. The following are all C++ tokens:

while window { x George +

There are five tokens in the following expressions, no matter how much whitespace we add:

(x+b)
     (
           x
           +
           b
     )

There are effectively four kinds of tokens:

keywords
words defined for the language. Basically, its vocabulary.
identifiers
words created by the programmer as names for things.
symbols
Sometimes single characters as in { + * and sometimes pairs as != or /*.
literals
A constant written directly into the text such as 3.14159 or “Hello”. See below.

Keywords

Keywords are sometimes called reserved words as programmers may not use them as identifiers. They are reserved for the language. There are only about a hundred keywords in C++ and we won’t be using all of them. Here is a list of the ones we will use:

int double char bool if else for while do using namespace return void true false const

And that’s all.

Identifiers

Identifiers are the names that we, as programmers, give to things. For example, if we want to create a variable called $x$, we can use an identifier x as a variable name. Identifiers must conform to the naming rules:

  • they can include alphabetic characters (a-z, A-Z)
  • they can include digits (0-9)
  • they can include the underscore character (_)
  • they may not start with a digit
  • they should not start with an underscore

Here are some legal identifiers:

x y i circle arg22 thisIsAVeryLongIdentifier
swap Swap PI DEFAULT_WINDOW _legalButDontUse

C++ is case sensitive so swap and Swap are different names. It is possible to choose poor identifiers for things:

1x
thisIsAnIncrediblyLongIdentifier1 thisIsAnIncrediblyLongIdentifier2
default-window
1x
starts with a digit (illegal)
default-window
includes a - (illegal)
thisIsAnIncrediblyLongIdentifier1 thisIsAnIncrediblyLongIdentifier2
legal but too long (and, in fact, some compilers may ignore everything past character 31, making them indistinguishable)

Name Style Rules

Here are some style rules we will use for names.

  • names for constants should be uppercase, as PI
  • use underscore to separate words in constant names, as DEFAULT_WINDOW
  • other names should start with a lowercase letter, e.g., circle or starshipEnterpriseVoyage (often referred to as camelcase because of the capitals in the middle)

Symbols

C++ is rich in symbols, but just a few of them will get you most of the way to understanding most well-written C++ code. Some symbols are formed from a single character (e.g., +), but others are more complex (e.g., ++ or +=). Here are the symbols we will use in this course:

+ - * / % = == != < > <= >= { } ( ) [ ]
&& || ! ++ -- /* */ // ; , . += -= *= /= << >>

Literals

Another kind of token is a literal, whose value is what it literally what it says in the text of the source code. We will see four kinds of literals in our programs starting out:

integer literals
standard integers such as 1, 2, -17 or 2056. There are also some special ones such as 0xff, the hexadecimal (base 16) representation of 255.
double literals
real number literals using decimal notation (3.0, 3.14159 or -17.65) or floating point (exponential) notation (1.0e11 or -23.6e-3).
character literals
single characters such as 'a', 'x', 'H', '!' or '\n'.
string literals
a sequence of characters such as "hello world!\n", "programming" or "The quick fox jumped over the lazy dogs."

Note that single characters are always enclosed by a single quote while a sequence of characters (known as a string) is enclosed in double quotes.

Expressions

Just like in mathematics, an expression is something that can be evaluated, i.e., reduced to a single value. Simple values like 4 or the variable x are expressions, since they already are a single value. So are slightly more complex expressions like 4 + 1 or x - 2. So are quite complex expressions like 2 * pow(x, 3) - x + 17/t (the meaning of which you do not need to understand today).

Many expressions in programming languages look a lot like mathematical (or arithmetic) expressions, using some of the same operators:

Symbol Meaning
+ addition, e.g., x + 2
- subtraction, e.g., x - 2
* multiplication, e.g., 2 * x (vs the mathematical $2 \times x$ or $2x$)
/ division, e.g., x / 2 (vs the mathematical $x \div 2$ or $\frac{x}{2}$)
() evaluate the expression in parentheses first

Like mathematical expressions, the order of operations is significant. We will define this more precisely in a future lecture, but for now, multiplication and division happens before addition or subtraction and expressions in parentheses are evaluated first.

Statements

The equivalent of a sentence is a statement. A statement effectively tells the computer to do something. Statements are always terminated in C++ by a semicolon, e.g.:

x = y + z - 4;
cout << "Hello world!\n";

In this example, there are two statements, both of which contain some tokens. The whitespace is not significant. In fact, it’s not even necessary to stay on a single line:

x = 24 * gorganzola
    - 14 * emmenthaler
    + 17 * balderson;
cout  <<
	"The quick brown fox jumped over the lazy dogs\n";

are both legal.

Blocks

The C++ equivalent of a paragraph is a block: a set of statements enclosed in a pair of curly braces ({ and }). We can turn the above statements into a block as follows:

{
   x = y + z - 4;
   cout << "Hello world!\n";
}

It is considered good style to indent statements inside a block.

Comments

Comments are for people: the compiler ignores them, but other programmers can read them. This makes them a major mechanism for documenting programs. They come in two flavours:

// A single line comment. Ends at the end of the line
/*
 * Comments enclosed in slash-asterix ... asterix-slash can
 * extend over several lines
 */
License: CC BY-NC-SA

(c) 2009–2016 Michael Bruce-Lockhart, Theo Norvell, Dennis Peters and Jonathan Anderson. Licensed under a Creative Commons Attribution–Noncommercial–Share-Alike 2.5 Canada License. Permissions beyond the scope of this license may be available at theteachingmachine.org.