New Publications of Linguistic Data Consortium
作者:admin 日期:2010-01-26
In this newsletter:
- Newly Expanded Press Release Section -
- Upcoming LDC Institute Seminar -
New Publications:
LDC2010T02
- Czech Broadcast News MDE Transcripts -
LDC2010T03
- GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 -
LDC2010T01
- NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations -
--------------------------------------------------------------------------------
Newly Expanded Press Release Section
Recall reading a newsletter article about the Reduced Licensing Fee but unsure what you did with the email? Curious as to which organization was the recipient of LDC's 15,000th corpus distribution nearly eight years ago? If so, be sure to visit LDC's newly expanded Press Release section on our What's New! What's Free! page to read about these topics and more. The Press Release section includes the articles of previous newsletters as well as major announcements from LDC. Information is organized into the following categories:
15th Anniversary Monthly Spotlight Archive - as part of our 15th Anniversary celebration in 2007, we highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of LDC’s research activities.
Conference Attendance by LDC - recent publisher displays and conference participation by LDC.
Etc. - recent collaborations and grant awards plus other announcements.
Membership Mailbag Archive - to address the questions that our data users have asked, we introduced our Membership Mailbag series of newsletter articles in May 2008. This periodic series addresses frequently asked questions about LDC data, the LDC Intranet, and the benefits of an LDC membership.
Member Surveys - LDC conducted two end-of-year surveys to obtain feedback on satisfaction levels with LDC Membership and data releases as well as our corpus catalog, and to gather suggestions on future publications.
Milestones and Celebrations - information on our landmark corpora distributions and events to celebrate our 10th and 15th anniversary years.
Use of LDC Corpora in University Summer Schools - ways LDC corpora have been used for teaching purposes at university summer school programs.
The Press Release section will be updated as new announcements are made so we anticipate that this will be a great resource for information about LDC.
- Upcoming LDC Institute Seminar -
The LDC Institute will hold its next session on Tuesday, January 26, 2010, from 10:00 a.m. to 12:00 p.m. in the LDC Conference Room at LDC's Philadelphia offices, 3600 Market Street, Suite 810.
The topic of this session will be the U.S. Supreme Court Corpus (SCOTUS) presented by Daniel Katz, J.D., M.P.P., Fellow in Empirical Legal Studies, Michigan Law School, PhD Candidate, Political Science and Public Policy, University of Michigan, and Michael Bommarito, PhD Student, Political Science: Methods & Modeling, University of Michigan.
ABSTRACT:
The corpus of Supreme Court written opinions is a rich linguistic resource. Not only does this corpus provide a longitudinal sample of formal American English, but it is also a source of text with identified authors and vote-coded sentiment. Despite this value and years of qualitative and quantitative material of the United States Supreme Court, no compiled corpus of these opinions is currently available to researchers. The purpose of this talk is (1) to describe efforts to compile both the complete corpus of Supreme Court Opinions and associated metadata, (2) to outline a number of our current research projects utilizing this data, and (3) to discuss any criticism, potential projects, or possible collaboration.
Refreshments will be provided. If you are in the area, we hope to see you there!
New Publications
(1)Czech Broadcast News MDE Transcripts was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).
The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations and two television stations. The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news.
The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2.
Czech Broadcast News MDE Transcripts is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$750.
(2) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 was prepared by LDC and contains 223,000 characters (98 files) of Chinese newsgroup text and its translation selected from twenty-one sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program.
Preparing the source data involved four stages of work: data scouting, data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout. Once the text was downloaded, its format was standardized so that the data could be more easily integrated into downstream annotation processes. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.
Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU were identified: statement SU, question SU, and incomplete SU. After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features and quality control procedures applied to completed translations.
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.
(3) NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations. NIST Open MT is an evaluation series to support research in, and help advance the state of the art of, technologies that translate text between human languages. Participants submit machine translation output of source language data to NIST (National Institute of Standards and Technology); the output is then evaluated with automatic and manual measures of quality against high quality human translations of the same source data. This program supports the growing interest in system combination approaches that generate improved translations from output of several different machine translation (MT) systems. MT system combination approaches require data sets composed of high-quality human reference translations and a variety of machine translations of the same text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations set addresses this need.
The data in this release consists of the human reference translations and corresponding machine translations for the NIST Open MT08 test sets, which consist of newswire and web data in the four MT08 language pairs: Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language pair and genre were removed at random from the test sets for release. For the machine translations, only output from one submission per training condition (Constrained and Unconstrained training, where available) per participant is included. See section 2 of the MT08 Evaluation Plan for a description of the training conditions. The resulting data set has the following characteristics:
Arabic-to-English: 120 documents with 1312 segments, output from 17 machine translation systems.
Chinese-to-English: 105 documents with 1312 segments, output from 23 machine translation systems.
English-to-Chinese: 127 documents with 1830 segments, output from 11 machine translation systems.
Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation systems.
The data is organized and annotated in such a way that subsets for each language pair and/or data genre and/or training condition can be extracted and used separately, depending on the user's needs.
NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200.
-------------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
- Newly Expanded Press Release Section -
- Upcoming LDC Institute Seminar -
New Publications:
LDC2010T02
- Czech Broadcast News MDE Transcripts -
LDC2010T03
- GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 -
LDC2010T01
- NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations -
--------------------------------------------------------------------------------
Newly Expanded Press Release Section
Recall reading a newsletter article about the Reduced Licensing Fee but unsure what you did with the email? Curious as to which organization was the recipient of LDC's 15,000th corpus distribution nearly eight years ago? If so, be sure to visit LDC's newly expanded Press Release section on our What's New! What's Free! page to read about these topics and more. The Press Release section includes the articles of previous newsletters as well as major announcements from LDC. Information is organized into the following categories:
15th Anniversary Monthly Spotlight Archive - as part of our 15th Anniversary celebration in 2007, we highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of LDC’s research activities.
Conference Attendance by LDC - recent publisher displays and conference participation by LDC.
Etc. - recent collaborations and grant awards plus other announcements.
Membership Mailbag Archive - to address the questions that our data users have asked, we introduced our Membership Mailbag series of newsletter articles in May 2008. This periodic series addresses frequently asked questions about LDC data, the LDC Intranet, and the benefits of an LDC membership.
Member Surveys - LDC conducted two end-of-year surveys to obtain feedback on satisfaction levels with LDC Membership and data releases as well as our corpus catalog, and to gather suggestions on future publications.
Milestones and Celebrations - information on our landmark corpora distributions and events to celebrate our 10th and 15th anniversary years.
Use of LDC Corpora in University Summer Schools - ways LDC corpora have been used for teaching purposes at university summer school programs.
The Press Release section will be updated as new announcements are made so we anticipate that this will be a great resource for information about LDC.
- Upcoming LDC Institute Seminar -
The LDC Institute will hold its next session on Tuesday, January 26, 2010, from 10:00 a.m. to 12:00 p.m. in the LDC Conference Room at LDC's Philadelphia offices, 3600 Market Street, Suite 810.
The topic of this session will be the U.S. Supreme Court Corpus (SCOTUS) presented by Daniel Katz, J.D., M.P.P., Fellow in Empirical Legal Studies, Michigan Law School, PhD Candidate, Political Science and Public Policy, University of Michigan, and Michael Bommarito, PhD Student, Political Science: Methods & Modeling, University of Michigan.
ABSTRACT:
The corpus of Supreme Court written opinions is a rich linguistic resource. Not only does this corpus provide a longitudinal sample of formal American English, but it is also a source of text with identified authors and vote-coded sentiment. Despite this value and years of qualitative and quantitative material of the United States Supreme Court, no compiled corpus of these opinions is currently available to researchers. The purpose of this talk is (1) to describe efforts to compile both the complete corpus of Supreme Court Opinions and associated metadata, (2) to outline a number of our current research projects utilizing this data, and (3) to discuss any criticism, potential projects, or possible collaboration.
Refreshments will be provided. If you are in the area, we hope to see you there!
New Publications
(1)Czech Broadcast News MDE Transcripts was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).
The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations and two television stations. The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news.
The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2.
Czech Broadcast News MDE Transcripts is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$750.
(2) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 was prepared by LDC and contains 223,000 characters (98 files) of Chinese newsgroup text and its translation selected from twenty-one sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program.
Preparing the source data involved four stages of work: data scouting, data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout. Once the text was downloaded, its format was standardized so that the data could be more easily integrated into downstream annotation processes. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.
Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU were identified: statement SU, question SU, and incomplete SU. After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features and quality control procedures applied to completed translations.
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.
(3) NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations. NIST Open MT is an evaluation series to support research in, and help advance the state of the art of, technologies that translate text between human languages. Participants submit machine translation output of source language data to NIST (National Institute of Standards and Technology); the output is then evaluated with automatic and manual measures of quality against high quality human translations of the same source data. This program supports the growing interest in system combination approaches that generate improved translations from output of several different machine translation (MT) systems. MT system combination approaches require data sets composed of high-quality human reference translations and a variety of machine translations of the same text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations set addresses this need.
The data in this release consists of the human reference translations and corresponding machine translations for the NIST Open MT08 test sets, which consist of newswire and web data in the four MT08 language pairs: Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language pair and genre were removed at random from the test sets for release. For the machine translations, only output from one submission per training condition (Constrained and Unconstrained training, where available) per participant is included. See section 2 of the MT08 Evaluation Plan for a description of the training conditions. The resulting data set has the following characteristics:
Arabic-to-English: 120 documents with 1312 segments, output from 17 machine translation systems.
Chinese-to-English: 105 documents with 1312 segments, output from 23 machine translation systems.
English-to-Chinese: 127 documents with 1830 segments, output from 11 machine translation systems.
Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation systems.
The data is organized and annotated in such a way that subsets for each language pair and/or data genre and/or training condition can be extracted and used separately, depending on the user's needs.
NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations is distributed via web download.
2010 Subscription Members will automatically receive two copies of this corpus on disc. 2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200.
-------------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
评论: 0 | 引用: 0 | 查看次数: 141
发表评论
上一篇
下一篇

文章来自:
Tags: